Final Take-Home Exam

Due Friday, December 8, 7:00 PM

Introduction

As advertised in the syllabus, this final will account for 20% of your grade. You should make sure you follow the following instructions carefully:

Problems

Unless otherwise indicated, all data for the problems can be found in /home/rws/Final under the named files.
  1. 25 Points

    The file cant contains a Cantonese character dictionary mapping 6,496 characters in the (Mainland) standard GB-2312 character set (EUC_CN encoding) to their pronunciation in Roman transliteration. Note that every Chinese character in this coding scheme corresponds to two bytes, or characters: note that the length of the stuff in the first column is always two.

    You won't be able to see the characters as Chinese characters unless you are in a window that can display Chinese characters. If you wish, you can view the text in a browser with the appropriate character encoding settings. For example, you should see Chinese characters for the following if you set your browser to the "Chinese Simplified (GB2312)" setting (from either Mozilla, Firefox or IE, follow "View" then "(Character) Encoding"):

    ʯ»ùÁÕ

    The file chinnames.txt contains ten two- or three-character names. In each case, the first character (i.e. first two bytes) is the family name, and the last one or two characters (i.e. last two or four bytes) are the given name. Write a script that reads in the dictionary and for each of the ten names, produces a Roman transliteration of the name. The format of the transliteration (a fairly standard format in transliterating Chinese names these days) is:

    Family Given

    The family name should have the first letter in upper case and all other letters in lower case. The given name should also have the first letter in upper case and all the remaining letters in lower case. Thus the name above would be transliterated as:

    Sek Geilam

  2. 25 Points

    There are three files matching the wildcarded name 1850*.txt, which contain data from the 1850 US Census:

    1850fnfl.txt Frequency-ordered female given names
    1850fnml.txt Frequency-ordered male given names
    1850freq.txt Frequency-ordered family names

    In each case, the first line of the file gives the total size of the sample, then on lines with names, the second column gives the name, and the third the frequency.

    Write a script that will estimate the probability of any name of the form, given that we know that a person is male, and given that we know a person is female:

    FirstName LastName

    Use your script to estimate these probabilities for the following 10 names. I expect you to give me two numbers in each case:

    Albert Farrar
    Clarra Gates
    Elizbeth Hutchinson
    Floyd Cline
    Isham Cline
    Infant Thorn
    Juliana Ireland
    Lucetta Roberson
    Sharlott Fuller
    W.M Douglass

    In other words, I want to know the answer to the following two questions:

    To get the estimate of a particular first or last name, you can use the maximum likelihood estimator, which is given by c/N, where c is the count of the name and N is the sample size. Note that the N to be used here is the count given at the top of the file (which is different for family names, male names and female names). You should have your code read these numbers from the file, not hard code them!

    To estimate the probability of a full name, assume independence: i.e., assume that the probability of a first name is independent of the probability of a last name.

    Then the probability of a name, given that we know that the person is female, is just:

    P(first|female) * P(last)

    Similarly for males:

    P(first|male) * P(last)

    I am expecting that you will write one program to do this, it will load all the name lists, do the appropriate computations, and take a file containing a list of names and return, for each name, the two probabilities I ask for.

    Remember that probabilities have to be numbers between 0 and 1 inclusive. If you have an answer outside that range then something is definitely wrong.

    Extra credit: 20 points.

    Compute the probability that the names in the above list are male or female. Again, I want two numbers. This is the inverse of what I asked for before. What I asked for before was, given that I know that a person is male, what is the probability of them having a certain name? This time I want to know, if I have a certain name, what is the probability of them being, say, female?

  3. 25 Points

    Write a script that loads one of the US Census files from Problem 2, and randomly generates names according to their estimated probability in the data. That is, if a name has a probability of 0.2 it should occur in the output of your generator roughly twice as often as one that has a probability of 0.1.

    Use this script to run a so-called Monte Carlo experiment, where you loop generating names for some number of trials. I would like you to run 10,000 trials. For this reason, I recommend you define a function that generates a single name randomly, and then run that function 10,000 times.

    Run the experiment on the family names, and show the 20 most frequently generated names in descending order, in the format:

    freq1     name1
    freq2     name2
      .         .
      .         .
      .         .
    

    (Note: I do not want to see the whole list, just the top 20.) How well does your list match the list of 20 most frequent names in the US Census sample?

    How to randomly generate names

    You will (of course) need to import the module random. One simple-minded, but for data of this size perfectly workable, approach is to create a list that contains N instances of each name, where N is the number observed instances in the data. Then you can use random.choice() to select a name.

    I am expecting that you will write a single program. There should be no need to write the data out to a file and then read it back in with another program and I will deduct points if you do it that way.

  4. 25 points

    You will recall our previous work with corpora such as Alice in Wonderland.

    Define a class called "corpus", that supports the following operations:

    For things like histograms and the encoded corpus, it is highly advisable to compute these once, when you first compute them. Then if you call the method again, just have it use the already computed data.

    I have provided a cleaned up version of Alice in /home/rws/Final/alice_clean.txt. Use that version so you do not have to worry about cleaning up punctuation etc.

    Demonstrate that your program works by:

  5. 25 Points

    Consider Grimm's Law as below, for which a more extensive description can be found in various places, including (of course) Wikipedia:

    For the sake of this problem you may assume that the /h/ resulting from Grimm's Law is really /x/ (which it presumably was at one point). Given that, the Law clearly involves manipulating the features [continuant], [voiced] and [aspirated].

    You are all linguists, so I assume you can figure out a reasonable formulation of the rule(s). What I want you to do for this problem is design a system that implements the rule(s). You should do this as follows. Define a class that allows one to input a segment and output a changed segment. The class should have associated with it the following elements:

    Your class should be general so that I could define any context-independent segmental change I want to: you may ignore context. (If you read on in that Wikipedia article, you will see that there were some contextual restrictions on the Law, but you may ignore those for the sake of this problem.) I.e., you should not build Grimm's law into the class definition, but rather define it in terms of a class instance.

    I do not care how you represent your segments as strings. I.e., don't spend time worrying about how to represent "þ", for example: just pick something that is clear and unambiguous.