Homework 2

Due Monday, February 11

See below on how to hand in this homework.

Quick note

The links to the documentation for lextools have been given out already. One thing to bear in mind is that the regular expression syntax of lextools is a bit quirky when it comes to the interpretation of scope of different operators. The basic rule to follow is, when in doubt use parentheses to make the scope clear. So if you want a b* (a followed by zero or more b's) then it is safest to write this as a (b*).

Note that documentation is also available online: for fsm it is http://www.research.att.com/~fsmtools/fsm/ (you probably just care about fsmintro(1) and fsm(1)); for lextools manual pages see http://catarina.ai.uiuc.edu/L406_08/Homeworks/lextools.1.html and http://catarina.ai.uiuc.edu/L406_08/Homeworks/lextools.5.html.

Don't forget that multicharacter symbols in lextools must be enclosed in square brackets, otherwise the compiler will attempt to interpret it as a sequence of single character symbols, which would not be what you want.

The problems

  1. Use lextools to implement the Soundex algorithm. This is described in J&M's book, problem 3.5. You can also find a description here.
  2. Email handles are often formed in one of the following ways: Make up a list of thirty names (first name followed by last name), then use lextools to develop a transducer that can map from any of the above email types into one or more names in your list.

    You may ignore case, which is not distinctive for email handles anyway. You can't use space as a symbol, so use "ww" -- that's a single multi-character symbol -- (for word boundary).

  3. Implement the Latin third stem examples given in Roark & Sproat, Table 2.10, page 47.

    Your analysis should have the following properties:

    This problem will require some amount of linguistic analysis, and you may find that you will want to check some online Latin grammatical sources in order to figure out what the right analysis is.

  4. Here you have three choices:
    1. Consider the following alternations in English:

      happy happiness
      happy happier
      grammatical grammaticality
      delegate delegation
      camp encampment
      but also:
      body embodiment
      The latter should probably be handled by a context-dependent rewrite rule.

      List at least ten but preferably twenty pairs of each kind above. You may want to use the dictionary at http://catarina.ai.uiuc.edu/L406_08/Data/alldict, that you've seen before as a guide.

      Using lextools write a finite-state grammar that will decompose words on the right into sequences of affixes and words on the left. Make sure you also account for the spelling/phonological alternations. So for example, if I type "happiness", I want to see the decomposition "happy+ness". If I type "embodiment" I want "en+body+ment".

    2. If you are familiar with a language other than English, you are encouraged to instead pick an example of equal complexity in that language, and build a similar morphological analyzer. You will of course need to provide me (when you hand the homework in) with a short description of the phenomena you are describing.
    3. Alternatively, you can pick some morphological alternations from English of equal or greater complexity to the alternations I gave above.
    Extra credit: If you picked either 2 or 3 above, and would like to get extra credit for this homework (worth up to half of the points assigned to this homework), you may prepare a short (5-10 minute) presentation of what you did. In the presentation you should describe the linguistic phenomena (especially if it is from a language that you can't expect everyone else to know), and explain how you solved the problem. You should be ready with your presentation on Monday, February 11. You may do the presentation in any way you choose: using the podium machine, or on the blackboard. Let me know by email by Monday, February 11, if you plan to take this option.
  5. Another extra credit problem:

    Assume the alphabet {a, b, c} and a rewrite rule:

    a --> b / c __

    Following the discussion in Mohri & Sproat and the lecture on rule compilation, show what each of the component transducers -- i.e. r, f, replace, l1 and l2 -- look like. You would be well-advised to look at the specification of each of these machines in the numbered list on page 233 of the Mohri & Sproat paper.

    Note that you can actually use lextools to build these: you will just need to augment the label set to include the auxiliary brackets. If you do use lextools you can easily verify that the machine gives the desired result by composing the five machines together in the proper order and seeing if the resulting transducer does the right thing.

Important: How to hand in this homework.

Obviously it will be silly to hand this one in on paper (unless for some reason you have no choice).

Here's what you should do

  1. Keep all of the files associated with the four problems above four separate directories. Each directory should include:
    • the symbol (.sym) file you created for that problem
    • all the input files for lextools
    • a script (or a Makefile) that builds the transducer(s)
    • a script that demonstrates that the transducer works. For instance you might have a script that looks like the following:

      lexcompre -l mylables.lab -s "myexample" |
      fsmcompose - mytransducer.fst |
      lexfsmstrings -l mylabels.lab

      Please make sure that you don't put anything in your script that depends upon your own environment or directory structure. I need to be able to run these things and I do not want to have to edit your files.

    • For the third problem, if you choose to work on an example from a language other than English, a brief explanation of the linguistic facts being described.
  2. Tar or zip the directories up into a single file with your name. So something like myname.zip or myname.tar
  3. Send them to me:
    • I would strongly prefer you place them somewhere on a webserver and point me to them so that I can retrieve them.
    • However if that is impossible, then I will accept them as an email attachment. However, I do not want to receive the homework as an "inline" in your email. If you do that, I will return it to you and ask you to resubmit it in an acceptable format.
The homework is due by MIDNIGHT on February 11.