Homework 2
Due Monday, February 11
See below on how to hand in this homework.
Quick note
The links to the documentation for lextools have been given out
already. One thing to bear in mind is that the regular expression
syntax of lextools is a bit quirky when it comes to the interpretation
of scope of different operators. The basic rule to follow is, when in
doubt use parentheses to make the scope clear. So if you want
a b* (a followed by zero or more b's)
then it is safest to write this as a (b*).
Note that documentation is also available online:
for fsm it is
http://www.research.att.com/~fsmtools/fsm/ (you probably
just care about fsmintro(1) and fsm(1)); for lextools manual pages
see http://catarina.ai.uiuc.edu/L406_08/Homeworks/lextools.1.html and
http://catarina.ai.uiuc.edu/L406_08/Homeworks/lextools.5.html.
Don't forget that multicharacter symbols in lextools must be
enclosed in square brackets, otherwise the compiler will attempt to
interpret it as a sequence of single character symbols, which would
not be what you want.
The problems
-
Use lextools to implement the Soundex algorithm. This is
described in J&M's book, problem 3.5. You can also
find a description
here.
-
Email handles are often formed in one of the following ways:
-
First and last name concatenated together: johnsmith
-
First name only: john
-
Last name only: smith
-
First initial plus last name: jsmith
-
Initials: js
Make up a list of thirty names (first name followed by last
name), then use lextools to develop a transducer that can map from any
of the above email types into one or more names in your list.
You may ignore case, which is not distinctive for email handles
anyway. You can't use space as a symbol, so use "ww" -- that's a
single multi-character symbol -- (for word boundary).
-
Implement the Latin third stem examples given in Roark & Sproat,
Table 2.10, page 47.
Your analysis should have the following properties:
- There should be an FST that takes a verb stem and produces the
third stem.
- Assume that the verb stem is what you get if you remove the
infinitive ending (-are, -ere, -ire).
-
In some cases the third stem is irregular in which case you will
probably just want to have a transducer that maps between the given
verb stem and the third stem.
-
In other cases, as in first conjugation verbs like laudare,
the third stem formation is regular: in this case it just involves
adding "at" to the stem.
-
Your third stem transducer should tag the third stem with a
distinctive tag such as [3St]. The derived forms listed in the table
(perfect participle, future participle, supine ... ) should involve
rules that are sensitive to the presence of this feature.
-
An analysis of "laudaturus" would involve the following stages and
the output would look something like the following:
laud o 3StemFST o FutParticipleFST = laudaturus[FutPart]
You should union all of your individual transducers that take third
stems as input (e.g. FutParticpleFST, PerfectParticipleFST, ...)
together into a single transducer that produces all third stem-derived
forms. You can then compose that with the third stem transducer to
produce a single transducer that, given a verb stem, will produce an
appropriate set of forms.
-
For the purposes of this problem you may ignore vowel length
distinctions indicated with the macron.
This problem will require some amount of linguistic analysis, and you
may find that you will want to check some online Latin grammatical
sources in order to figure out what the right analysis is.
-
Here you have three choices:
-
Consider the following alternations in English:
| happy | happiness |
| happy | happier |
| grammatical | grammaticality |
| delegate | delegation |
| camp | encampment |
| but also: |
| body | embodiment |
The latter should probably be handled by a context-dependent rewrite rule.
List at least ten but preferably twenty pairs of each kind above. You
may want to use the dictionary at http://catarina.ai.uiuc.edu/L406_08/Data/alldict,
that you've seen before as a guide.
Using lextools write a finite-state grammar that will decompose words
on the right into sequences of affixes and words on the left. Make
sure you also account for the spelling/phonological alternations. So
for example, if I type "happiness", I want to see the decomposition
"happy+ness". If I type "embodiment" I want "en+body+ment".
-
If you are familiar with a language other than English, you are
encouraged to instead pick an example of equal complexity in that
language, and build a similar morphological analyzer. You will of
course need to provide me (when you hand the homework in) with a short
description of the phenomena you are describing.
-
Alternatively, you can pick some morphological alternations from
English of equal or greater complexity to the alternations I gave
above.
Extra credit: If you picked either 2 or 3 above, and would like
to get extra credit for this homework (worth up to half of
the points assigned to this homework), you may prepare a
short (5-10 minute) presentation of what you did. In the
presentation you should describe the linguistic phenomena (especially
if it is from a language that you can't expect everyone else to know),
and explain how you solved the problem. You should be ready with your
presentation on Monday, February 11. You may do the presentation in
any way you choose: using the podium machine, or on the
blackboard. Let me know by email by Monday, February 11, if you plan
to take this option.
- Another extra credit problem:
Assume the alphabet {a, b, c} and a rewrite rule:
a --> b / c __
Following the discussion in
Mohri &
Sproat and the
lecture
on rule compilation, show what each of the component transducers --
i.e. r, f, replace, l1
and l2 -- look like. You would be well-advised to
look at the specification of each of these machines in the numbered
list on page 233 of the Mohri & Sproat paper.
Note that you can actually use lextools to build these: you
will just need to augment the label set to include the auxiliary
brackets. If you do use lextools you can easily verify that the
machine gives the desired result by composing the five machines
together in the proper order and seeing if the resulting transducer
does the right thing.
Important: How to hand in this homework.
Obviously it will be silly to hand this one in on paper (unless for
some reason you have no choice).
Here's what you should do
-
Keep all of the files associated with the four problems above four
separate directories. Each directory should include:
-
the symbol (.sym) file you created for that problem
-
all the input files for lextools
-
a script (or a Makefile) that builds the transducer(s)
-
a script that demonstrates that the transducer works. For instance you
might have a script that looks like the following:
lexcompre -l mylables.lab -s "myexample" |
fsmcompose - mytransducer.fst |
lexfsmstrings -l mylabels.lab
Please make sure that you don't put anything in your script that
depends upon your own environment or directory structure. I need to be
able to run these things and I do not want to have to edit your files.
-
For the third problem, if you choose to work on an example from a
language other than English, a brief explanation of the linguistic
facts being described.
-
Tar or zip the directories up into a single file with your name. So
something like myname.zip or myname.tar
-
Send them to me:
- I would strongly prefer you place
them somewhere on a webserver and point me to them so that I can
retrieve them.
- However if that is impossible, then I will accept
them as an email attachment.
However, I do not want to receive the homework as an
"inline" in your email. If you do that, I will return it to you and
ask you to resubmit it in an acceptable format.
The homework is due by MIDNIGHT on February 11.