Computational Linguistics I
CMSC 723
Fall 2010
Schedule: Tue/Thr 12:30-1:45pm
Location: CSI 2107
Instructor: Hal Daume III: me AT hal3 DOT name
Office Hours: AVW 3227; TBD or by appointment
Course blog: http://cmsc723.blogspot.com/ or Atom feed
TA: Amit Goyal (office hours: 10:50-11:50 Tue/Thr, AVW 1152 [Grad Lounge])

Jump to: [Background] [Structure] [Grading] [Textbooks] [Schedule] [Homework] [Links] [Policies]

 Background and Description

Computational linguistics (CL) is the science of doing what linguists do with language, but using computers. Natural language processing (NLP) is getting computers to do what people do with language. Despite the title, most of this course is actually about NLP, not CL! But we'll hit a lot of linguistics along the way.

This course is indended to be a broad introduction to what is a very broad field. We will cover both rule-based and statistical approaches to a wide variety of challenging problems in natural language processing. We will discover that language ambiguity is the rabid wolf of NLP, and develop techniques to try to tame it. Along the way, we will see some linguistic theories developed specifically for computational linguistics, which sheds some like on what sort of linguistic models make sense, computationally.

Prerequisites: You must be able to program. You must find language interesting. If you cannot write breadth- and depth-first search in your programming language of choice in under an hour, you will struggle in this class. If you cannot find humor in the sentence pair "I ate spaghetti with meatballs / I ate spaghetti with a fork" then you might not enjoy the class. Linguistic background is not necessary, though of course it never hurts.

 Structure of Class

I will take a slightly non-standard approach to class time. I will not spend 3 hours per week going over material that was in the readings. As a result, you should read. And you should do the short written assignments. My responsibility will be to help you understand things that are hard, and to give you an insider's view of the field. Class time will be interactive. Certain homework problems will be marked for in-class presentation, and you will do them. The rest of class time will be spent talking about issues that arise, things that I think are particularly interesting, doing activities and/or demos.

Your responsibilities are as follows:

Given that this is a three credit class, I expect you to spend nine hours per week working on CL stuff. Three of those hours will be in class. Of the remaining six, I expect about two to be spent reading (one hour per assignment), two to be spent on written homeworks and two to be spent on projects. If things are taking significantly more time than this, you should talk to us.


The purpose of grading (in my mind) is to provide extra incentive for you to keep up with the material and to ensure that you exit the class as a computational linguistics genius. If everyone gets an A, that would make me happy (sadly, it hasn't happened yet). The components of grading are:
40%Programming projects
There are four programming projects, each worth 10% of your final grade. You will be graded on both code correctness as well as your analysis of the results. These must be completed in teams of two or three students, with cross-department (CS to linguistics) teams highly encouraged.
30%Written homeworks
There are eleven written homeworks (one per week), each is worth 3% of your final grade. They will be graded on a high-pass (100%), low-pass (50%) or fail (0%) basis. These are to be completed individually. Your lowest scoring homework will be dropped. (The initial homework, HW00, is not graded, but required if you do not want to fail.)
25%Final project
Everyone is to complete a final project, in teams of size up to three. We will discuss the scope of the project later in class.
5%Class participation
You will be graded on your in-class presentations of homework questions and other general participation, including participation in the comments on the blog. This is mostly subjective.

Important note for MS-comp students: Your grading will be different, as required by departmental policy. You will have a midterm and final exam, both take-home. You will not have a final project. Your final exam will be worth 30%, and the midterm worth 20%. For the remaining, 2% will be for participation, 28% for projects (7% each) and 20% for homeworks (2% each).

Late homeworks are not allowed (without prior approval). This is because I need to put solutions up on the web page. You may hand any project in up to 48 hours late; however, once it is late by one minute, your final score will be halved.

We will post notes on the blog when assignments have been graded. If you handed something in and do not get a score for an assignment, you have a one week moritorium on complaints.


The textbook is the new-ish book by Jurafsky and Martin, Speech and Language Processing (Second Edition) (ISBN 978-0-13-605234-0).

Other recommended (but not required) books:

 Schedule (tentative)

The following schedule is subject to change, but likely not by very much. The readings listed are readings that you should have finished by that date (all from Jurafsky+Martin unless otherwise noted). Everything is due by 12:20pm on the date listed on the schedule. Written assignments are to be handed in in PDF format.

Date Topics Readings Due Notes
31 Aug Welcome to Computational Linguistics
What is this class about, linguistic phenomena
- - blog
02 Sep History and Approaches
Initial attempts, ALPAC, statistics and data
1-1.6 HW00 blog
Nuts and Bolts
07 Sep Regular Languages
Finite state machines and morphology
2-2.2, 3-3.3 - blog
09 Sep Probability and Statistics
A refresher, with a language focus
4-4.3, 4.10-4.11 HW01 blog
14 Sep N-gram models
Language modeling and smoothing
4.4-4.6 - blog
16 Sep Part of Speech Tagging
Rule-based approaches
5.1-5.4 HW02 blog
21 Sep Part of Speech Tagging II
Hidden Markov Models and the Viterbi algorithm
5.5, 5.8 - -
23 Sep Context Free Grammars
Expressivity, X-bar theory and parsing as search
12-12.3, 12.5,
HW03 blog
28 Sep Context Free Grammars II
Dynamic programming and the CKY algorithm
13.4,X-bar_theory - blog
30 Sep No class: finish up P1
(Deadline extended 2 hours)
- P1 -
05 Oct No class: Hal sick (again) :( - HW04 -
07 Oct Statistical Parsing
From treebanks to grammars, and Markovization
12.4, 14-14.4 - blog
12 Oct Incorporating Context
Features-based grammars, unification
15-15.4 - blog
14 Oct Representing Meaning
First-order logic
17-17.3, 18.4 HW05 blog
19 Oct Interpreting Text
Interpretation as abduction
abduct (sec 1-3) P2 blog
21 Oct Linguistic Challenges
Metaphor, metonymy, time, scope, quantifiers, etc.
17.4, 18.3,
18.6, 19.6
HW06 blog
26 Oct Computational Lexical Semantics
Word sense disambiguation + midterm
19-19.3 - blog
28 Oct Computational Semantics
Semantic roles and frames
19.4-19.5 HW07 blog
Machine Learning for NLP
02 Nov Classification with Decision Trees
Learning, generalization and features
dt - blog
04 Nov Linear Models for Learning
Perceptron learning for sentiments
Perceptron HW08 blog
09 Nov Sequential Learning
Named entity recognition
22.1 - blog
11 Nov Using World Knowledge
Bootstrapping knowledge from text
20.5, boot HW09 blog
Higher-level Structure
16 Nov Local Discourse Context
Anaphors, antecedents and coreference
21.3-21.7 - blog
18 Nov Document Coherence
TexTiling and argumentative zoning
21-21.1, zone HW10 blog
23 Nov Hierarchical Text Structure
Rhetorical structure theory and the discourse treebank
21.2, discourse P3 blog
30 Nov Information Extraction 22.2, 22.4 HW11 blog
02 Dec Mapping Text to Actions mapping - blog
07 Dec Machine Translation 25-25.5 - -
09 Dec Automatic Document Summarization 23.3-23.6 P4 -
17 Dec Final Exam and Final Projects Due - - -

 Homework Assignments

All written homeworks are due on Thursday. See the schedule above for due dates. You may handin your homework/projects here. You're free to use the LaTeX source in any way you want, but you'll need haldefs.sty and notes.sty to build them.

Written Homeworks

Programming Projects

 Final Project


 Useful Links

This course has been taught at UMD (though not by me) in the past: Fall 2009, Fall 2008 and Fall 2007 .

 Course Policies

Cheating: Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. When taking an exam, you must work independently. Any collaboration during an exam will be considered cheating. Any student who is caught cheating will be given an E in the course and referred to the University Student Behavior Committee. Please don't take that chance - if you're having trouble understanding the material, please let us know and we will be more than happy to help.

ADA: Any student eligible for and requesting reasonable academic accommodations due to a disability is requested to provide, to the instructor in office hours, a letter of accommodation from the Office of Disability Support Services (DSS) within the first two weeks of the semester. You may reach them at 301-314-7682 or by visiting Susquehanna Hall on the 4th Floor.

College guidelines: Document concerning adding, dropping, etc. here.