You do not have to write any code for this exercise!
Your system, which is in competition with those of the other teams, will consist of two sub-grammars:
One way to see what your grammar does is to generate
random sentences from it. For our purposes, generation is just
repeated symbol expansion. To expand a symbol such as NP, our
sentence generator will randomly choose one of your grammar's
NP -> ...
rules, with probability proportional to the rule's
weight.
Your task is to write the grammar rules and also choose the weight for each rule. The weights allow you to express your knowledge of which English phenomena are common and which ones are rare. By giving a low weight to a rule (or to S2 itself), you express a belief that it doesn't happen very often in English.
Another way to see what your grammar does is to parse sentences with it. You can use our parser to look at a sentence and figure out whether your grammar could have generated it -- and how, and with what probability. We say that your grammar predicts a sentence well if (according to the parser) your grammar has a comparatively high probability of generating exactly that sentence.
You will have to decide how much effort to expend on designing S1 and S2, and how much you will trust S1 relative to S2. Your goal is to describe English accurately. This means that your grammar (especially the S1 subgrammar) should be able to generate lots of syntactic constructions that show up in English. Moreover, it should have a higher probability of generating common sentences, and a low or zero probability of generating ungrammatical sentences.
The performance of your system will be measured in two ways:
A weighted context free grammar consists of:
START), and the terminal symbols as the words.
A derivation rule tells one way to rewrite a non-terminal symbol
into a sequence of non-terminal symbols and terminal symbols.
For example, S -> NP VP says that an S
(perhaps indicating a declarative sentence) can be rewritten as
an NP (noun phrase) followed by a VP (verb phrase).
A word about weights: Today you heard a lot about probability models. In a probabilistic CFG, each rule would have some probability associated with it, and the probability of a derivation for a sentence would be the product of all the rules that went into the derivation. We don't want you to worry about making probabilities sum up to one, so you can use any positive number as a weight.
An additional note for the truly curious: the
normalization script norm-log normalizes the
weights, then takes their negative logarithms. This is because
the parser we provide works by adding the rule scores and
seeking a minimum. This is identical to multiplying the
probabilities and seeking a maximum.
In the file Vocab.gr, we are giving you the set of terminal
symbols (words), embedded in rules
of the form Tag -> word.
Note that the vocabulary is closed. There will
be no unknown words in the test sentences, and you are not allowed
to add any
words (terminal symbols) to the grammar.
We have given equal weights to all the Tag ->
word rules, but you are free to change the weights if
you like. You are also free to add, remove, or change these
rules, as long as you don't add any new words.
Tag -> word rules, but
you might find that the tags aren't useful when trying to extend
S1 into a bigger English grammar. So you are free to create new
tags for word classes that you find convenient or use the words
directly in the rules, if you find it advantageous. We tried to
keep the vocabulary relatively small to make it easier for you
to do this.
You will almost certainly want to change the tags in rules
of the form Misc ->
word. But be careful: you don't want to
change Misc -> goes
to VerbT ->
goes, since goes doesn't behave like
other VerbT's. In particular, you want your S1
to generate Guinevere has the chalice .
but not Guinevere goes the chalice .,
which is ungrammatical. This is why you may want to invent
some new tags.
Our method of enforcing this requirement is to use a grammar that
is effectively a bigram (or finite-state) model. Suppose we only have
two tag types, A and B. The set of rules
we would allow in S2 would be:
S2 -> _A
This grammar can generate any sequence of
S2 -> _B
S2 ->
_A -> A
_A -> A _A
_A -> A _B
_B -> B
_B -> B _A
_B -> B _B
As and Bs,
and there is no ambiguity in parsing with it: there is always a
single, right-branching
parse.
START -> S1 and
START -> S2. The relative weight of these determines how
likely it is that S1 (with start symbol S1) or S2 (with
start symbol S2) would be selected in generating a sentence,
and how costly it is to choose one or the other when parsing a sentence.
Choosing the relative weight of these two rules is a gamble. If you are over-confident in your ``real'' English grammar (S1), and you weight it too highly, then you risk assigning very low probability to sentences generated by a team whose grammar is more extensive (since the parser will have to resort to your S2 to get a parse).
On the other hand, if you weight S2 too highly, then you will probably do a poor job of predicting the other teams' sentences, since S2 will not make any sentences very likely (it accepts everything, so probability mass is spread very thin across the space of word strings). Of course, you can invest some effort in trying to make S2 a better n-gram model, but that's a tedious task and a risky investment.
randsent.
We will then make a grammaticality judgement about each sentence,
and your score will be the fraction of sentences judged to be
grammatical. To judge the grammaticality of the sentences,
we will be using the best tool available: human speakers (yourselves!).
Toward the end of the lab, we'll elicit grammaticality judgments from each of you
on a set of sentences
(which might have been generated by your grammar or someone else's -- so
be honest!).
The second score evaluates your full grammar (S1+S2) and is much more important. Here we will take the other teams' test sets (removing the sentences that human judges couldn't parse) and try to parse them with your grammar (the combination of S1 and S2). You will be able to parse all of the sentences (because S2 will accept anything, albeit with low probability). The score will be the cross-entropy of your model with respect to the test set. A low cross-entropy is good; it means that your model is unsurprised by the test sentences generated by the other teams. We will report your model's cross-entropy on each of the other teams' test sets.
If the test set of sentences is {s1, s2, s3, ...}, then your
model's cross-entropy is defined as
average(
-log(p(s1)), -log(p(s2)), -log(p(s3)), ... )
Note that
-log(1)=0, -log(1/2)=1, -log(1/4)=2, -log(1/8)=3, ... -log(0)=infinity
So a high cross-entropy corresponds to a low probability and
vice-versa. You will be severely penalized for probabilities
close to zero.
Note: p(s) denotes the probability that a randomly generated sentence is exactly s. Often your grammar will have many different ways to generate s, corresponding to many different trees. In that case, p(s) is the total probability of all those trees. However, for convenience, we approximate this with the probability of the single most likely tree, because that is what our parser happens to find. So we are really only measuring approximate cross-entropy.
| Team | Score 1 (Precision) | Score 2 (Cross-entropy) |
| 01 | 6/20 | 35.57 |
| 02 | 0/20 | 54.01 |
| 03 | 16/20 | 38.48 |
| 04 | 5/20 | 49.37 |
| 05 | 11/20 | 39.59 |
| 06 | 0/20 | 39.56 |
| 07 | 3/20 | 40.82 |
| 08 | 13/20 | 40.97 |
| 09 | 6/20 | 36.53 |
| 10 | 14/20 | 36.17 |
| 11 | 0/20 | |
We only want to draw attention to one of last year's teams' blunders.
Remember the part about weighting the
START -> S1 and
START -> S2 rules? Only teams 01 and 09 changed
those weights from their defaults; doing so helped their cross-entropy scores a lot.
Note that teams 10 and 03 are to be commended for having top precision and
being in second and fourth place in cross entropy -- without even reweighting S1 and S2!
There are two file formats to know about. The first is for sentences
generated by a grammar, or to be parsed by a grammar. We name files in this
format with the suffix .sen . Simply put one sentence per
line, with spaces between tokens. Make sure you put spaces before and after
punctuation symbols, since, for example Arthur, will look like
one token to the parser, while Arthur , is what you want
(two tokens).
The file format for the grammar itself is quite simple. We
name these files with the suffix .gr . On a
given line, anything following # is an ignored
comment. Each rule goes on one line, in the format:
weight X Y1
Y2 ... Yn
signifying
the rule X -> Y1
Y2 ... Yn (where
X is a non-terminal and the Ys are
any mix of terminals and non-terminals).
Be careful not to name any of your non-terminals the same as any terminals.
From here on .gr files refer to grammar files,
and .sen files refer to sentence files.
randsent [ -t ] [ -n num-sentences ]
[ -s start-symbol ] grammar-files -t: produce tree output instead of flat sentences.
-n num-sentences: e.g., -n 10 will generate
ten sentences instead of just one.
-s start-symbol: The generator's start symbol is S1 by default.
You may want to specify another so you can test just part of
the grammar.
grammar-files: Typically *.gr. In general, a list of .gr
files containing the parts of the grammar. Different parts
may be written by different people.
randsent -t -n 17 -s START *.gr | prettyprint parse [-s start-symbol] sentence-file
grammar-files -s start-symbol:
The parser's default start symbol is START.
You may want to specify another
sentence-file: Name of a file with one input sentence per line.
The special filename - means to read the input sentences from
the keyboard instead (or in general, from standard input).
grammar-files: As above.
It is expected that your grammar will be in multiple files. If you follow
our convention and name
all of these ending in .gr you can parse with something like:
parse - *.gr
which will take sentences as you type them in and use all your grammar
files.
If a sentence fails to parse, its output will be
NONE on a single line. If the parse is successful, you
will see the single best-scoring (highest-weight) parse on one line,
followed by the negative log-probability of the sentence according to
your model.
You will also see warnings, such as words in the sentences that are missing in your grammar.
If you want to make the parses a bit more readable, pipe the
output of parse to prettyprint.
parse sentence-file *.gr | prettyprint
If you want
to see how you're doing in terms of cross-entropy, pipe the output of
parse to crossent:
parse sentence-file *.gr | crossent
check-for-new-terms grammar-files
Note that this script is called automatically whenever you call
parse or randsent, so you will get constant
warnings until you fix the illegal terminals.
/export/ws03_mt_2/scratch/ws03lab/data/. It will be updated
frequently, so check back often to keep on top of what
kinds of sentences the others are generating!
In addition, we
have generated examples of sentences you might want to try to design for.
They
are listed in the file called examples.sen in the start
directory.
These are just
suggestions; they aren't part of the evaluation. If you want to get more ideas
about how to stump the other teams, we suggest looking on the Web. It's not
hard to find complicated sentences.