I expect this homework to be a lot of work. I strongly advise starting on it right away.
The assignment is to build a bigram tagger, using the FSM tools.
You should do the following:
Use the Simple Good Turing program http://www-nlp.stanford.edu/fsnlp/statest/SGT.c. The comment at the beginning of the program explains how to use it. Note that you will probably find you'll have to change the #define for MAX_ROWS to be something larger than 200, since you will probably have more spectrum elements than that. You will have to compile with the math library, thus:
cc -o SGT SGT.c -lm
Now, you may find that SGT gives you a probability of 0 for the unseen cases. If this is the case, then check to see if there is a spectrum element for count 1. If not, there's your answer. In that case, you could just use the MLE and assign probability 0 to unseen cases. (In the bigram FSM, this effectively means omitting the arcs corresponding to those cases.) But beware: this means that if the test data should contain any word sequence that must involve an unseen bitag, then you will get no analysis for that example. You might consider then just assigning a very very small probility to those cases.
Note that FSM label files are not the lextools format symbol files. The FSM label files are in the format:
lab1 int1 lab2 int2 lab3 int3(This is the identical format to the .lab file that lexmakelab produces from the .sym file.) Note that the arc label that you intend to use for <epsilon> must have numeric value 0.
For the probabilities, use the tropical semiring, with negative log probabilities. This is the default semiring in the FSM library.
0 . <epsilon> 0 . NN NN -logP(NN|.) . VB VB -logP(VB|.) . . . -logP(.|.) NN NN NN -logP(NN|NN) NN VB VB -logP(VB|NN) NN . . -logP(.|NN) VB VB VB -logP(VB|VB) VB NN NN -logP(NN|VB) VB . . -logP(.|VB) NN VB .
Note that I have downcased all the words, so that you don't have to deal with typographical variants ("The", "THE", "the"). Of course this also loses information where capitalization is actually meaningful.
Since I am not giving you a dictionary, which could inform you about other possible tags for words, assume (bad assumption, but what the heck) that if you haven't seen a given word with a given tag, then it does not occur with that tag.
However, you will need an unknown word: call it <unk>. If you have enough frequency spectrum elements for a given tag T (I believe the minimum is 5), use SGT to smooth the counts, and assign P(unseen) to P(<unk>|T). Otherwise, if you don't have enough spectrum elements, assign some small nominal "count" (say 0.1) to C(<unk>|T), renormalize and use the MLE to compute the probabilities.
The transducer will look something like the following toy example:
0 0 NN dog -logP(dog|NN) 0 0 VB dog -logP(dog|VB) 0 0 VB eat -logP(eat|VB) 0 0 VB <unk> -logP(<unk>|VB) 0When you compile this, remember you have to use the "-t" flag to "fsmcompile", so that "fsmcompile" knows it's a transducer.
S o C-1 o L
(You all know what the raised -1 means, yes?) Or in terms of the FSM toolkit:fsmcompose sentence_fsm channel_inv language_model | fsmbestpath | fsmprint -i yourlabels -o yourlabels
Note that the form of a sentence is a simple FSM, where here we assume you pre-pad the sentence with ".":
0 1 . 1 2 my 2 3 dog 3 4 has 4 5 fleas 5 6 . 6
Find the best path, and print out each result in the same format as the tagged data I gave you. Give it back to me in one long file making sure that the number of rows is the same as the number of words, and that the sentences are given in the order of the test sentences I gave you.
The output sentences should be in the form:
the DT fulton NNP county NNP grand NNP jury NNP said VBD friday NNP an DT investigation NN of IN atlanta NNP 's POSI.e., the same form as I gave you for the training data.
You could also hold out a bit of the training data to tune the small nominal count that you add for the unknown word for cases where there are too few spectrum elements to use SGT: try fiddling with the weight to see if it has any effect on performance on the held-out data. My expectation is that it will make little if any difference.
Turn in the homework in the normal way: i.e., give me a tar file named "hw4_YOUREMAILHANDLE.tar" or "hw4_YOUREMAILHANDLE.tgz". Also, people have been getting sloppy about this: when I unpack your file, I am expecting that it will give me a directory called hw4_YOUREMAILHANDLE.