Tools and Techniques for Speech and Language Processing

Richard Sproat

Fall 2006

MW 4-5:20, FLB G27

Office Hours: Wednesdays 11-12:30, BI 2057

Overview Syllabus Requirements Online Resources

Overview

This course introduces basic computing and programming with a particular view to the kinds of basic skills one needs to deal with data from language and speech in a unix/linux operating environment. It is intended for students who have no prior computing background.

The course is intended to serve two groups of people:

  1. People who have had no prior computing experience who wish to pursue courses in Computational Linguistics starting with LING 406 and moving on to LING 506.
  2. People who have no particular interest in the computational track, but who nonetheless have the need to process linguistic data.

At the end of this course, you will know more than enough to implement all of the things asked for in this posting on the Linguist list from a couple of years ago:

I'd like recommendations for software (preferably inexpensive) for a research project looking at fifth grade student writing. Most of the analysis will be quite basic: number of total words/types/tokens, word lists, and so on.

While this course will focus on skills needed to deal with linguistic data, I am not billing this as a course in "Computing for Linguists". Courses and books so billed do exist, but it's an odd concept since much of what one needs to learn to be able to efficiently do one's work are general computing skills that one would need in any field. So the focus of the course will be on developing these skills. But the data, where practical, will be of a kind that is likely to be of interest to linguists.

This is not primarily a lecture course, but a lab course. Each week will be devoted to a topic. I will generally lecture on this topic on Mondays. The homeworks will be fairly extensive, and become more so as the course progresses. The Wednesday class will primarily be a lab session where you can work on your homework and ask questions as needed. Homeworks will generally be due on Fridays (all homework will be submitted electronically). The first homework explains how to hand in your homework assignments.

Syllabus

Week Topic Reading Homeworks
1: 8/23 Introduction to Unix, Getting Started Fiamingo et al. Ch. 1--3 HW 1
2: 8/28, 8/30
  • Resources and Shells; Text Processing Tools;
  • awk.
    For awk a good place to start is with the one-liners.
  • I suggest that now would be a good time to order the Python book. There is a 37% discount at Amazon.com.
  • Fiamingo et al. Ch. 4--6.
    • Ignore the stuff on printing in Ch. 4
    • Ignore the stuff on C-shell in Ch. 5: we are using bash which is a descendent of the "Bourne shell" described in Ch. 5. (Bash stands for "Bourne-again shell".)
  • Fiamingo et al., Ch. 7.
 
3: 9/6 Awk and text processing tools   HW 2
4: 9/11, 9/13 More on Awk: General programming concepts. See the relevant sections of the Awk manual for descriptions of concepts like variables, assignment operators, increment operators, associative arrays, and multidimensional arrays. In class we will build a concordancing program.   HW 3
5: 9/18, 9/20 Further Commands/Shell Programming Fiamingo et al., Ch. 8--9  
6: 9/25, 9/27 Introduction to Python, Types and Operations Lutz & Ascher, Part I, Part II, Chs. 4-5  
7: 10/2, 10/4 Types and Operations; Statements and Syntax Lutz & Ascher, Part II, Chs. 6-7; Part III HW 4
8: 10/9, 10/11 Functions. See also the Regular Expression HOWTO Lutz & Ascher, Part IV HW 5
9: 10/16, 10/18 Functions Lutz & Ascher, Part IV  
10: 10/23, 10/25 Modules Lutz & Ascher, Part V HW 6
11: 10/30, 11/1 Object-Oriented Programming Lutz & Ascher, Part VI  
12: 11/6, 11/8 Object-Oriented Programming Lutz & Ascher, Part VI HW 7
13: 11/13, 11/15 Unicode   HW 8
Thanksgiving Break
14: 11/27, 11/29 TBA TBA Final
15: 12/4, 12/6 TBA TBA  

Course Requirements

There are two texts for this course:
The first is a general introduction to the unix/linux operating system, including basics such as the shell, basic commands, and commonly used tools such as sed and awk. We will supplement this reading with resources from the web. This material will be covered in the first half of the course.

The second book, which we will start in the second half of the course, is an introduction to the Python scripting/programming language. The advantage of Python over many other programming languages is that there is a considerably smaller "startup" cost associated with learning Python than there is with almost any other language. Yet at the same time it is a fully functional object-oriented language that has many of the same constructs as other languages like Java or C++.

The grade in this course will be determined by:

Online Resources

The following are some of the many online resources on unix, linux,