LING 406: Take-home Midterm

Due Wednesday, March 12, 5:30 PM

Introduction

All of the examples in this homework will relate in one way or another to the syntax, morphology and corpus statistics of a mystery language. We start with some background on the morphology and syntax.

Morphology and phonology

The morphology of interest involves nouns and verbs.

Noun Morphology

There are three noun classes, conventionally known as n1, n2, and n3, four cases (nominative, accusative, genitive, dative) and two numbers (singular and plural). The following tables illustrate the paradigms:

CLASS N1: "baflopub":

Nom: baflopub        Singular
Gen: baflopubler     
Dat: baflopupte      
Acc: baflopubzo      

Nom: baflopupte      Plural
Gen: baflopupser     
Dat: baflopubne      
Acc: baflopubve      


CLASS N2: "bait":

Nom: baita           Singular
Gen: baitlar 
Dat: baitta  
Acc: baitsu  

Nom: baitta          Plural
Gen: baitsar 
Dat: baitna  
Acc: baitva  

CLASS N3: "beeb":

Nom: beebata         Singular
Gen: beebatalar
Dat: beebata 
Acc: beebatsu

Nom: beebata         Plural
Gen: beebasar
Dat: beebanna
Acc: beebatva

Verb Morphology

Verbs are divided into three classes depending upon how many arguments they take. This determines agreement patterns. There is either subject agreement (v1), subject+direct-object agreement (v2) or subject+direct-object+indirect-object agreement (v3).

There are also two aspects marked on the verb: perfect and imperfect.

Here are some sample verbs for each of the classes:

Class v1, baisun:

imperfect:   baisunnime
perfect:     baisunme

Class v2, baetog:

imperfect:   levaetognotime  
perfect:     levaetokteme    

Class v3, mynfox:

imperfect:   televynfoxnalime
perfect:     televynfoxlame  

Phonological Alternations

You should be aware of two phonological processes:

Syntax

The grammar of this fragment of the language is fairly simple. Verbs are final. If there is a direct object, it comes right before the verb. If there is an indirect object, it comes before the direct object. Noun phrases may contain a possessive noun marked in the genitive, in which case the genitive-marked noun comes after the head noun of the phrase. Note that subjects are nominative, direct objects accusative, indirect objects dative and possessives are genitives.

An example parsed structure is as follows:


[S zapwoata/NP-nom [VP3 bepsuessogata/NP-dat [VPX [NP-acc tribxwibblaazo/NP-acc xlyatlozuler/NP-gen] telesukplognalime/V3]]]

Note that if the non-terminal node dominates just a single word I have it marked as "word/cat".

Problems

Please follow these directions very carefully. If there are questions about what something means, ask.
  1. Here you will find a list of nouns and verbs and their morphological analyses. Build a grammar that allows you to exactly match what I have here.

    In what you turn in, include your grammar, and a demonstration that this does indeed match what I gave you. You will probably want to use lextools for this, but you don't have to: if you want to do this in some programming language where you program in the morphology, that is also fine. But see the important caveat below.

    Important caveat: you will only be able to do the problems below if you do this in a way that allows you to handle previously unseen cases. So simply compiling the list I give you here is not going to work. You are strongly advised to set it up so that if I give you a word and its category, you can compose it with your morphological analyzer and produce all legal surface forms and their features.

  2. Here you will find a set of sentences and their associated syntactic structures,

    You will augment the CYK program you wrote for Homework 4 to include the backpointers, so that you can actually recover the structure.

    Then:

  3. On Tuesday morning March 11 at around 10:00 there will appear a test corpus containing a set of words and a set of sentences, both unanalyzed. For the sentences, I give you all the base forms of the words and their category (n1, n2, n3, v1, v2, v3), but you have to use your morphological grammar to produce the forms that you'll actually see in the sentences. Note that the set of words that you will have to produce morphological analyses for are all going to be forms of the base forms I give you for the sentences.

    So to clarify, here are the names of the files you will be getting and what they are:

    For the set of words, produce all possible analyses using your morphological analyzers. Arrange the analyses in the form:

    word1     analysis1
    word1     analysis2
      .         .
      .         .
      .         .
    word2     analysis1
    word2     analysis2
      .         .
      .         .
    
    Note that this is exactly the same format as you had here.

    For the sentences, produce the legal parse for each sentence. Note that all sentences will be accepted by the grammar (assuming you did the grammar correctly). Your output should be in the form:

    sentence1
    output1
    sentence2
    output2
    
    as in the example in problem 2.
  4. Here is a corpus of about 90,000 words from this language and here is a comparably sized corpus from English. Compute the Good-Turing estimate, n1/N, for both corpora and answer the following two questions, er three questions:

As before:

Once you have run done the problems, I want you to create a directory called "midterm_youremailhandle" (where "youremailhandle" should be substituted with your email handle). Put all files in subdirectories corresponding to each of the problems above. Then tar it up as before:

 tar -cvf midterm_youremailhandle.tar midterm_youremailhandle 
and post it somewhere where I can find it.