Homework 6

Due Monday, November 6

The fragment of the tagged Brown Corpus which we looked at in class can be found at /home/rws/CodeFall06/brown_corpus. For these exercises, please do not make your own copy of the corpus. It is not necessary. Just use mine.

  1. Write a function that will read in the Brown Corpus, and compute: (Another way to think about this is that you are simply computing the frequence of each word-POS pair.)

    Use this function to answer the following question: if I used the stupidest possible baseline of simply assigning every word with its most common POS, how well would I do at tagging the Brown Corpus? You should be able to report an error rate in terms of the percentage of words that I would get wrong by this method.

  2. Write a function to compute the set of word-POS types for all words. A type is just a word form that is distinct from other word forms. For example, the sentence "the blue dog has a blue hat", there are 7 tokens, but 6 types (since "blue" occurs twice). If you are computing word-POS types you consider each POS to be different, so that "might/MD" is a distinct type from "might/NN", but two instances of "might/MD" are just two instances of the same type.

  3. Write a function to take the set of word-POS types and compute, for words longer than 6 characters, the set of 3 and 4-character suffixes that are the ten most frequent with each of the parts of speech "VBG", "NN", "JJ", "VB". You should write your function in a general way so that I could give it a POS and it would go off and compute the requested histograms. Then use it to report results for those specific four POS's.

  4. Write a function to collect those word-POS types that just occur once in the corpus. (The technical term for these is hapax legomena.) Then use your code from the previous problem to report on the top 10 suffixes for "NN" and "JJ". Do you notice anything interesting about the difference you find for "NN" and "JJ"? (Hint: what are your intuitions about which affixes are more productive?)