USC Computer Science 562 - Empirical Methods in Natural Language Processing, Fall 2006

This web page is for the Fall 2006 class. Please refer to the Fall 2007 page for current information.

Instructors: Prof. Kevin Knight and Prof. Daniel Marcu

Teaching Assistant: Jonathan May,,

Class Meeting Time:

Tues & Thurs 11am-12:20pm 

Class Location:


Prerequisite: CS561

Course Description

This graduate course covers the basics of statistical methods for processing human language, intended for:


(1) students who want to understand current natural-language processing (NLP) research,
(2) students interested in tools for building NLP applications,
(3) machine-learning students looking for large-scale application domains, and
(4) students seeking experience with probabilistic methods that can be applied to a range of AI problems.


Students will experiment with existing NLP software toolkits and write their own programs. Grades will be based on six programming assignments (72% = 12% each) and a final project (28%); there will be no midterm or final.


Office hours: TBA.


Course software:

         Carmel finite-state string toolkit (

         Tiburon tree automata toolkit (


Aug 22

Sample NLP Application: Overview of Machine Translation

Example state-of-the-art natural language application: Machine Translation.

Aug 24

Basic linguistic theory. Words, parts-of-speech, ambiguity, morphology, phrase structure, word senses, speech. Text corpora and processing tools.

Programming Assignment 0 (no credit) out Aug 24, nothing to turn in.

Assignment 0

Aug 29, 31

Basic automata theory. Finite-state acceptors and intersection. Finite-state transducers and composition. Applications in morphology and text-to-sound conversion. Context-free grammars and parsing.

Programming Assignment 1 out Aug 31, due beginning of class Sept 7.

Assignment 1

Topic: Finite-state acceptors for natural language.

Sept 5, 7

Basic probability theory. Conditional probability, Bayes rule, estimating parameter values from data, building generative stochastic models, the noisy-channel framework. Probabilistic finite-state acceptors and transducers.

Sept 12, 14, 19, 21

Language modeling. Estimating the frequency of English strings. Using language models to resolve ambiguities across a wide range of applications. Training and testing data. The sparse data problem. Smoothing with held-out data.

Assignment 2

Programming Assignment 2 out Sept 14, due beginning of class Sept 21.

Topic: Weighted finite-state acceptors for language modeling.

Sept 26, 28; Oct 3, 5

String transformations. A simple framework for stochastically modeling many types of string transformations, such as: tagging word sequences with parts of speech, cleaning up misspelled word sequences, automatically marking-up names, organizations, and locations in raw text, etc. Estimating parameter values from annotated data.

Programming Assignment 3 out Sept 28, due beginning of class Oct 5.

Eppstein k-best algorithm

Assignment 3

Topic: Weighted finite-state transducers for string transformation.

Oct 10, 12, 17, 19

Hidden parameters. Problems involving incomplete data, such as: elementary cryptanalysis, transliteration, machine translation, NL interfaces, deciphering ancient scripts. The EM algorithm.

Programming Assignment 4 out Oct 12, due beginning of class Oct 19.

Topic: Unsupervised learning of natural language structure.

Assignment 4

More EM Applications

Oct 24, 26, 31

Syntactic structures, context-free grammars, parsing, lexicalized grammars, regular tree grammars, syntax-based language models, the inside-outside algorithm.

Programming Assignment 5 out Oct 26, due beginning of class Nov 2.

Topic: Modeling syntactic structure of English.

Assignment 5

Tree Lectures

Nov 2, 7, 9, 14

Tree transformations and applications.

Programming Assignment 6 out Nov 9, due beginning of class Nov 16.

Topic: Modeling syntactic structure.

Assignment 6





Initial project proposal due beginning of class Nov 9.

Final project scope settled Nov 16.

Final project write-ups due Dec 12 by email.

Nov 16, 21


Nov 23


Nov 28, 30

Current research in natural language processing.