Homework 7

Due Friday, May 2


Note that this is the second of two homeworks in lieu of a final exam, to be completed by everyone who is not doing a final project.
Here are a bunch of personal names tagged for gender. These are derived from a list of Chinese names, but I have encoded the characters so that if you know Chinese you won't have any special advantage over those who don't. The first column is the tag ("M" or "F"), and the remaining three are individual characters of the name.

Your task is to build a system -- a classifier -- that is able to predict the gender of a novel name based on this training data. You are free to use any technique you want. Presumably the features will be based on the characters, and some characters will likely be more relevant than others, but you should make no a priori assumptions about what features are relevant. Instead, I strongly advise you to divide this training set into a training and test set, and try out various ideas and see which one works best on your test data.

On April 30 at 7:00 PM. I will release the test data. You will run your classifier on the data.

You will turn in a tar file in the usual form, containing two files:

data.tst  your predictions
readme    a file that explains how your classifier works

Note that I want data.tst to be exactly in the same format as the training data, with the lines in the same order as I gave them to you. If you are not sure what "exactly in the same format" means, then you should ask.