LING 270: Second Homework for Unit 9

Available: Monday April 14

Due: Monday April 21

Consider the following data, namely a set of word unigrams and bigrams from a familiar text:

Note that <s> is the beginning of sentence marker and </s> is the end of sentence marker.

The conditional probability P(w2|w1) can be computed as C(w1w2)/C(w1).

  1. (20 points)
    What are the conditional probabilities for the following pairs of words: i.e. in each case, compute the conditional probability of the second word given the first:
    1. see </s>
    2. sam you
    3. will try
    4. will see
    5. would eat
    6. train </s>
    7. like them
    8. <s> do
    9. ham </s>
    10. not with
  2. (6 points)
    Of these estimates of the conditional probabilities, which one do you trust the most as being the most accurate? Why?
  3. (24 points)
    As we discussed in class, with an ngram language model, you compute the probability of a sentence by simply multiplying the conditional probabilities for each ngram of words. Thus:

    P(<s> w1 w2 w3 </s>) = P(w1 | <s>) * P(w2 | w1) * P(w3 | w2) * P(</s> | w3)

    What probability would a bigram language model based on the data above assign to the following sentences. Assume that in each case the sentence has the beginning-of-sentence and end-of-sentence tags surrounding it:

    1. would you eat them with a fox
    2. would you eat them in a box
    3. i would not eat them anywhere
    4. i do so like them sam i am
    5. i do so like them sam-i-am
    6. i would not eat them in a fox

    Your answer should be a floating point number in each case. You should show how you calculated it so that if you get the answer wrong, I will at least know that you know how to calculate the probability and just got a clerical error.