Homework 5
Due Friday, May 2
Note that this is the first of two homeworks in lieu of a final
exam, to be completed by everyone who is not doing a final project.
The second homework will be distributed on April 23.
Implement a version of Yarowsky's log-likelihood-based approach to
sense disambiguation for the two homographs bass
(FISH versus MUSICAL-RANGE) and sake (CAUSE versus
RICE-BEER). The following files contain training and test data for
each sense:
Each line in the file contains the target word (bass or
sake), followed by a colon, followed by five words of left
context, the target word, and five words of right context. Lines
beginning with a star are to be interpreted as follows:
- *bass: is the FISH sense, otherwise MUSIC
- *sake: is the BEER sense, otherwise CAUSE
The senses were tagged by hand, by me, so there are likely to be some errors.
You are to implement Yarowsky's basic log-likelihood-based
decision list model and report on the results.
Some suggestions:
-
You will probably want to do some cleanup of the text. This might
include:
- Normalizing all words to lower case (so that trivial differences
in case don't get in the way)
- Removing punctuation symbols
-
Your features can be anything you want, but it is suggested that you
at least use something like the following:
- Word anywhere in context
- Word in position -2
- Word in position +2
- Word in position -1
- Word in position +1
-
You will want to smooth the zero counts as discussed in Yarowsky. One
suggestion is to use Good-Turing smoothing for all counts for a given
feature (e.g. the feature "word anywhere in context"). One way to do
this is to use SGT, which you have already used.
Note that the default definition for "MAX_ROWS" is too small. Search for the line:
#define MAX_ROWS 200
and change it to some value large enough to accommodate the number of
spectrum elements you will have, e.g.:
#define MAX_ROWS 1000
before you compile SGT.
You should report on the following for each of the two test sets:
-
The prior probability for guessing
-
Your improvement over this baseline, if any
-
The decision list derived for each term. I won't require that you
worry about pruning, but I would like to see a rank-ordered list that
indicates which contextual evidence was most important for each sense,
and so forth. On the other hand, you may find that your list contains
lots of useless rules if you don't prune. One honest way to prune is
to hold out a portion of the training data and see which rules are
useful on the held out portion (one of the methods that Yarowsky used).
You should also check your test results (there are only 100 in each
case): it is possible that some of your errors are actually correct,
and reflect mistaggings on my part. Report if you find any such cases.