.sen format, for sentences: one sentence per line,
separating all tokens by whitespace. Be sure to put spaces before and after
all punctuation symbols!
.gr format, for grammar rules: one rule per line, in the format: weight X Y1
Y2 ... Yn X -> Y1
Y2 ... Yn (where
X is a non-terminal and the Ys are
any mix of terminals and non-terminals). Anything on a line after a #
is considered a comment.
randsent [ -t ] [ -n num-sentences ]
[ -s start-symbol ] grammar-files -t: produce tree output instead of flat sentences.
-n num-sentences: e.g., -n 10 will generate
ten sentences instead of just one.
-s start-symbol: The generator's start symbol is S1 by default.
You may want to specify another so you can test just part of
the grammar.
grammar-files: Typically *.gr. In general, a list of .gr
files containing the parts of the grammar. Different parts
may be written by different people.
parse [-s start-symbol] sentence-file
grammar-files -s start-symbol:
The parser's default start symbol is START.
You may want to specify another.
sentence-file: Name of a file with one input sentence per line.
The special filename - means to read the input sentences from
the keyboard instead (or in general, from standard input).
grammar-files: As above.
If a sentence fails to parse, its output will be
NONE on a single line. If the parse is successful, you
will see the single best-scoring (highest-weight) parse on one line,
followed by the negative log-probability of the sentence according to
your model.
You will also see warnings, such as words in the sentences that are
missing in your grammar.
If you want to make the parses a bit more readable, pipe the
output of parse to prettyprint.
parse sentence-file *.gr | prettyprint
If you want
to see how you're doing in terms of cross-entropy, pipe the output of
parse to crossent:
parse sentence-file *.gr | crossent
setenv PATH /export/ws03_mt_2/scratch/ws03lab/bin:$PATH
If you use bash:
export PATH=/export/ws03_mt_2/scratch/ws03lab/bin:$PATH
Please use tcsh or bash for this exercise, not csh. We're pretty sure that the default on
your new account is tcsh.
umask 000 /export/ws03_mt_2/scratch/ws03lab/. Someone
should copy the files you need to get
started: cp ../start/* ./
The script test.csh will do
this for you.
members that contains the
(complete) email addresses of everyone on your team, one per line. This
is so that we can give you sentence for grammaticality judgements later in the lab.
.gr)
files. A good place to start is by modifying Top.gr,
where the weights for S1 and S2, and the initial S1, are. We recommend
that you not change the file Vocab.gr,
but rather create new Tag -> word
rules in addition to the ones we've given you, in a different file.
You will want to use parse to see how well you can parse
training data from the other teams, and you will want to use
randsent to see if the sentences you are generating
look like English.
cat
together all your grammar files into one file called GRAMMAR.gr
so we can evaluate your model. Don't forget to include files
like S2.gr and Vocab.gr when you do this!
Then, to make sure everything will run in the evaluation, run: randsent -n 10 GRAMMAR.gr
parse examples.sen GRAMMAR.gr
Noun, Verb,
Misc, etc.) to something more
complete and realistic.
START -> S1 and
START -> S2. How much do you trust
your grammar?
START, S1, and S2 for
evaluation purposes - don't use them elsewhere.