Due Friday, December 8, 7:00 PM
The file cant contains a Cantonese character dictionary mapping 6,496 characters in the (Mainland) standard GB-2312 character set (EUC_CN encoding) to their pronunciation in Roman transliteration. Note that every Chinese character in this coding scheme corresponds to two bytes, or characters: note that the length of the stuff in the first column is always two.
You won't be able to see the characters as Chinese characters unless you are in a window that can display Chinese characters. If you wish, you can view the text in a browser with the appropriate character encoding settings. For example, you should see Chinese characters for the following if you set your browser to the "Chinese Simplified (GB2312)" setting (from either Mozilla, Firefox or IE, follow "View" then "(Character) Encoding"):
The file chinnames.txt contains ten two- or three-character names. In each case, the first character (i.e. first two bytes) is the family name, and the last one or two characters (i.e. last two or four bytes) are the given name. Write a script that reads in the dictionary and for each of the ten names, produces a Roman transliteration of the name. The format of the transliteration (a fairly standard format in transliterating Chinese names these days) is:
The family name should have the first letter in upper case and all other letters in lower case. The given name should also have the first letter in upper case and all the remaining letters in lower case. Thus the name above would be transliterated as:
There are three files matching the wildcarded name 1850*.txt, which contain data from the 1850 US Census:
| 1850fnfl.txt | Frequency-ordered female given names |
| 1850fnml.txt | Frequency-ordered male given names |
| 1850freq.txt | Frequency-ordered family names |
In each case, the first line of the file gives the total size of the sample, then on lines with names, the second column gives the name, and the third the frequency.
Write a script that will estimate the probability of any name of the form, given that we know that a person is male, and given that we know a person is female:
Use your script to estimate these probabilities for the following 10 names. I expect you to give me two numbers in each case:
Albert Farrar
Clarra Gates
Elizbeth Hutchinson
Floyd Cline
Isham Cline
Infant Thorn
Juliana Ireland
Lucetta Roberson
Sharlott Fuller
W.M Douglass
To get the estimate of a particular first or last name, you can use the maximum likelihood estimator, which is given by c/N, where c is the count of the name and N is the sample size. Note that the N to be used here is the count given at the top of the file (which is different for family names, male names and female names). You should have your code read these numbers from the file, not hard code them!
To estimate the probability of a full name, assume independence: i.e., assume that the probability of a first name is independent of the probability of a last name.
Then the probability of a name, given that we know that the person is female, is just:
Remember that probabilities have to be numbers between 0 and 1 inclusive. If you have an answer outside that range then something is definitely wrong.
Write a script that loads one of the US Census files from Problem 2, and randomly generates names according to their estimated probability in the data. That is, if a name has a probability of 0.2 it should occur in the output of your generator roughly twice as often as one that has a probability of 0.1.
Use this script to run a so-called Monte Carlo experiment, where you loop generating names for some number of trials. I would like you to run 10,000 trials. For this reason, I recommend you define a function that generates a single name randomly, and then run that function 10,000 times.
Run the experiment on the family names, and show the 20 most frequently generated names in descending order, in the format:
freq1 name1 freq2 name2 . . . . . .
(Note: I do not want to see the whole list, just the top 20.) How well does your list match the list of 20 most frequent names in the US Census sample?
I am expecting that you will write a single program. There should be no need to write the data out to a file and then read it back in with another program and I will deduct points if you do it that way.
You will recall our previous work with corpora such as Alice in Wonderland.
Define a class called "corpus", that supports the following operations:
A by-product of encode should be a wordlist that maps between words and their integers and the other way around. That is, you should provide me the functionality to map from 2 to cat, and from cat to 2, depending upon whether I want to get the index from the word, or the word from the index.
You should use the array module for this, which you will need to import. Then you can set up the array as follows:
encoding = array.array('i')
Here 'i' means that the data in the array are integers. Then you can just append to the end of the array on each new word. Arrays behave much like lists. See here for more details.
Why should you want to do this? For many applications involving large corpora (things much bigger than Alice), it is much more efficient to represent the corpus as a sequence of integers than as a sequence of character strings.
mycorpus.computeNgrams(3)
and have it compute the counts for all of the distinct word trigrams.
mycorpus.computeNgrams(2)
would do the same for the bigrams. Have the result sorted in order of descending frequency.
mycorpus.printNgrams(2)
would print the (frequency ordered) set of bigrams.
For things like histograms and the encoded corpus, it is highly advisable to compute these once, when you first compute them. Then if you call the method again, just have it use the already computed data.
I have provided a cleaned up version of Alice in /home/rws/Final/alice_clean.txt. Use that version so you do not have to worry about cleaning up punctuation etc.
Demonstrate that your program works by:
Consider Grimm's Law as below, for which a more extensive description can be found in various places, including (of course) Wikipedia:
For the sake of this problem you may assume that the /h/ resulting from Grimm's Law is really /x/ (which it presumably was at one point). Given that, the Law clearly involves manipulating the features [continuant], [voiced] and [aspirated].
You are all linguists, so I assume you can figure out a reasonable formulation of the rule(s). What I want you to do for this problem is design a system that implements the rule(s). You should do this as follows. Define a class that allows one to input a segment and output a changed segment. The class should have associated with it the following elements:
Your class should be general so that I could define any context-independent segmental change I want to: you may ignore context. (If you read on in that Wikipedia article, you will see that there were some contextual restrictions on the Law, but you may ignore those for the sake of this problem.) I.e., you should not build Grimm's law into the class definition, but rather define it in terms of a class instance.
I do not care how you represent your segments as strings. I.e., don't spend time worrying about how to represent "þ", for example: just pick something that is clear and unambiguous.