Homework 4

Due Mon, April 14

Problems

Data for this problem set can be found at http://serrano.ai.uiuc.edu/L406_08/Homeworks/hw4data/training.gz. This contains training data from the tagged Brown Corpus. I cannot just give this corpus away free, the directory is password-protected. I will give out the access information in class.

I expect this homework to be a lot of work. I strongly advise starting on it right away.

The assignment is to build a bigram tagger, using the FSM tools.

You should do the following:

  1. Compute the tag bigram transition probabilities. Sentences can be assumed, for the sake of this problem, to end with the tag "." (period). You can pad the beginning of each sentence with "." (i.e. the first bigram conditional probability is just the probability of the first tag, given that the previous tag was ".".
  2. Produce a channel model C which will give you the P(word|tag). This should be implemented as a transducer in the tropical semiring. Input labels should be tags. Output labels should be words. Weights (costs) should be P(word|tag). You should only need one state. The one state is both an initial and a final state.

    Note that I have downcased all the words, so that you don't have to deal with typographical variants ("The", "THE", "the"). Of course this also loses information where capitalization is actually meaningful.

    Since I am not giving you a dictionary, which could inform you about other possible tags for words, assume (bad assumption, but what the heck) that if you haven't seen a given word with a given tag, then it does not occur with that tag.

    However, you will need an unknown word: call it <unk>. If you have enough frequency spectrum elements for a given tag T (I believe the minimum is 5), use SGT to smooth the counts, and assign P(unseen) to P(<unk>|T). Otherwise, if you don't have enough spectrum elements, assign some small nominal "count" (say 0.1) to C(<unk>|T), renormalize and use the MLE to compute the probabilities.

    The transducer will look something like the following toy example:

    0      0      NN      dog     -logP(dog|NN)
    0      0      VB      dog     -logP(dog|VB)
    0      0      VB      eat     -logP(eat|VB)
    0      0      VB      <unk>   -logP(<unk>|VB)
    0
    
    When you compile this, remember you have to use the "-t" flag to "fsmcompile", so that "fsmcompile" knows it's a transducer.
  3. On Saturday April 12 at 5:30PM I will release the test data. This will consist of a bunch of ascii representations of FSM's, which you will compile using your label file. I will be nice to you and provide the out-of-vocabulary words (those not in the training data) as <unk>. You will compose the sentences individually as follows, where S is an FSA representing an individual sentence:

    S o C-1 o L

    (You all know what the raised -1 means, yes?) Or in terms of the FSM toolkit:
    fsmcompose sentence_fsm channel_inv language_model | fsmbestpath |
    fsmprint -i yourlabels -o yourlabels
    

    Note that the form of a sentence is a simple FSM, where here we assume you pre-pad the sentence with ".":

    0       1      .
    1       2      my
    2       3      dog
    3       4      has
    4       5      fleas
    5       6      .
    6
    

    Find the best path, and print out each result in the same format as the tagged data I gave you. Give it back to me in one long file making sure that the number of rows is the same as the number of words, and that the sentences are given in the order of the test sentences I gave you.

    The output sentences should be in the form:

    the     DT
    fulton  NNP
    county  NNP
    grand   NNP
    jury    NNP
    said    VBD
    friday  NNP
    an      DT
    investigation   NN
    of              IN
    atlanta         NNP
    's              POS
    
    I.e., the same form as I gave you for the training data.
  4. I am not assuming anyone will get 100% tag accuracy: this is not likely. I will grade according to how close you get to 100%, with the A+'s going to the people who are closest, and so on...
  5. As a general suggestion, it is probably a good idea for you to test out that everything works by holding out a bit of the training data for testing, until you get all of the kinks out of the system: then you can retrain on all the training data once you're sure everything works.

    You could also hold out a bit of the training data to tune the small nominal count that you add for the unknown word for cases where there are too few spectrum elements to use SGT: try fiddling with the weight to see if it has any effect on performance on the held-out data. My expectation is that it will make little if any difference.

  6. Again, I don't need to see your code, but you should write up enough of a description of what you did so that I know you understood what you were doing.

Turn in the homework in the normal way: i.e., give me a tar file named "hw4_YOUREMAILHANDLE.tar" or "hw4_YOUREMAILHANDLE.tgz". Also, people have been getting sloppy about this: when I unpack your file, I am expecting that it will give me a directory called hw4_YOUREMAILHANDLE.