Homework 3

Due Tuesday, September 26


This homework revolves around a dataset of phonetically transcribed Mandarin speech data, which can be found in 424 files in the directory /home/rws/Data/duration.

Each file represents the transcription for one utterance. The following is an example of a portion of one utterance:

** wo3 hai2_shi4 xiao4 , xiao4_de0 zi4_ji3 dou1 you5_die3_r2 bu4_hao3_yi4_si0_le0 , kan4 ta1 hen3 you1_xin1_chong1_chong1_de0 muo2_yang4 , you4 ren3_bu2_zhu4 xiao4 
        
          si             1.824
wo3     
           w          1.937583
         3_o         2.0480001
hai2_shi4       
           h           2.18325
         2_I         2.2945831
           S         2.3939171
         4_%         2.4460831
xiao4   
           x          2.576833
           y         2.6087501
         4_W         2.6954169
,       
           }         3.6730001

The first line is the pinyin transliteration of the utterance.

Subsequent lines give a word-by-word phonetic transcription. The first element in the transcription is the word in pinyin transliteration. Following this are the phonetic segments (using an ascii-based phonetic transcription system), with the end times in seconds of each segment.

For example, the first word is wo3 (`I'), consisting of two segments "w" and "o". The "w" ends 1.94 seconds into the utterance, and the "o" ends 2.05 seconds into the utterance. Also indicated on the "o" with the prefix "3_" is the third (low) tone.

Here are some of the transcription conventions you'll need to know:


Problems:

  1. Write an awk script to produce a list of all the segment symbols used in this database. Don't forget to remove the tone marks: I want a list of segments without tone.
  2. Write an awk script to compute the average duration for all segments in the database. This will necessarily involve remembering what the previous end time was, and computing the difference with the current segment's end time.
  3. Write an awk script to compute the average duration for all stop closures in the database.
  4. Write an awk script to compute the average duration for all vowels in the database.
  5. What is the most common tone mark? Show a script that computes this.
  6. What is the most common vowel? Show a script that computes this. Again, don't forget to remove the tone marks.
  7. Extra Credit Problem. Compute the average duration for utterance-final, versus non-utterance-final vowels. Is there a difference? Show the script that computes this.