Homework 6
Due Monday, November 6
The fragment of the tagged Brown Corpus which we looked at in class
can be found at /home/rws/CodeFall06/brown_corpus. For these
exercises, please do not make your own copy of the
corpus. It is not necessary. Just use mine.
-
Write a function that will read in the Brown Corpus, and compute:
- The frequency of each word
- The frequency of each part of speech associated with each word
(Another way to think about this is that you are simply computing the
frequence of each word-POS pair.)
Use this function to answer the following question: if I used the
stupidest possible baseline of simply assigning every word
with its most common POS, how well would I do at tagging the Brown
Corpus? You should be able to report an error rate in terms of the
percentage of words that I would get wrong by this method.
-
Write a function to compute the set of word-POS
types for all words. A type is just a word form that is
distinct from other word forms. For example, the sentence "the blue
dog has a blue hat", there are 7 tokens, but 6 types (since "blue"
occurs twice). If you are computing word-POS types you consider each
POS to be different, so that "might/MD" is a distinct type from
"might/NN", but two instances of "might/MD" are just two instances of
the same type.
-
Write a function to take the set of word-POS types and compute, for
words longer than 6 characters, the set of 3 and 4-character suffixes
that are the ten most frequent with each of the parts of speech "VBG",
"NN", "JJ", "VB". You should write your function in a general way so
that I could give it a POS and it would go off and compute the
requested histograms. Then use it to report results for those specific
four POS's.
-
Write a function to collect those word-POS types that just occur
once in the corpus. (The technical term for these is hapax
legomena.) Then use your code from the previous problem to report
on the top 10 suffixes for "NN" and "JJ". Do you notice anything interesting
about the difference you find for "NN" and "JJ"? (Hint: what are your
intuitions about which affixes are more productive?)