Computational Linguistics I
CMSC 723
Fall 2012
Schedule: Tue/Thr 9:30-10:45am
Location: CSIC 2107
Instructor: Hal Daume III: me AT hal3 DOT name
Office Hours: AVW 3227; Mon 11a-noon or by appointment
Q/A: Piazza:CMSC 723
TA: Greg Sanders (office hours: AVW 3126 [knock], Tue 2p-3:30p)

Jump to: [Schedule] [Homework] [Handin]

 Background and Description

Computational linguistics (CL) is the science of doing what linguists do with language, but using computers. Natural language processing (NLP) is the engineering disciplin of doing what people do with language, but using computers. Despite the title, most of this course is actually about NLP (not CL), but we'll still do CL along the way.

CL and NLP are broad fields and we cannot possibly cover everything. We will cover both rule-based and statistical approaches to a wide variety of challenging problems in natural language processing and computational linguistics. We will discover that language ambiguity is the rabid wolf of NLP, and develop techniques to try to tame it. Along the way, we will see some linguistic theories developed specifically for computational linguistics, which sheds some like on what sort of linguistic models make sense, computationally.

Prerequisites: You must be able to program. You must find language interesting. Anyone who has taken and undergrad AI course, a machine learning course, an algorithms course, or LING 689/889 (Computational Psycholinguistics) should be able to to well in this course. That said, it is also a prerequisite that you be willing to work hard and catch up on things you don't know on your own. In particular, the following are considered background material and I will not cover them (though you must know their contents): Unix for Poets and very basic prob/stats and slightly less basic stats.

The official textbook is:
Speech and Language Processing (Second Edition)
by Dan Jurafsky and James Martin
(ISBN 978-0-13-605234-0)

Other recommended (but not required) books:


 Structure of Class

I like the view that Ben Schniederman has of teachers: a guide by your side, not a sage on a stage. At most half of class time will be "lecture-ish," which means you must read. We will spend the rest of class time doing exercises, working interactively on projects (mostly on Tuesdsays), and having you present homework solutions (on Thursdays). Students are encouraged to bring laptops to class: in fact, if you do not have one and would like one (during class), let me know and I'll arrange to have some supplied by the CS department.

Your responsibilities are as follows:

Given that this is a three credit class, I expect you to spend nine hours per week working on CL stuff. Three of those hours will be in class. Of the remaining six, I expect about two to be spent reading (one hour per assignment), one to be spent on written homeworks and three to be spent on projects. If things are taking significantly more time than this, you should talk to us to see if we need to adjust or if there's some key background piece we've incorrectly assumed.


 Grading

The purpose of grading (in my mind) is to provide extra incentive for you to keep up with the material and to ensure that you exit the class as a computational linguistics genius. If everyone gets an A, that would make me happy (sadly, it hasn't happened yet). The components of grading are:
20%Written homeworks
There are eleven written homeworks (roughly one per week). Your lowest score will be dropped. Each of the remaining 10 is worth 2% of your final grade. They are graded on a high-pass (100%), low-pass (50%) or fail (0%) basis. These are to be completed individually. Your lowest scoring homework will be dropped. (The initial homework, HW00, is not graded, but required if you do not want to fail.)
30%Programming projects
There are three programming projects, each worth 10% of your final grade. You will be graded on both code correctness as well as your analysis of the results. These must be completed in teams of two or three students, with cross-department (eg., CS to linguistics) teams highly encouraged.
20%Midterm exam
There will be an in-class "midterm" exam in early November.
30%Final exam
The structure of the final exam is a take-home practical project of your choosing (but you must clear it with me). During the final exam slot, you will give brief presentations of your work (probably poster presentations; this is TBD).

Late homeworks are not allowed. Period. No exceptions. The time deadlines are automatic and unforgiving. Late projects are allowed: you get two extra days. However, once the project is 1 minute late, you lose 25% (absolute). We will post notes on Piazza when assignments have been graded. If you handed something in and do not get a score for an assignment, you have a one week moritorium on complaints.

Your overall grade in the class will be based on the following scale: 90+ (A), 80+ (B), 70+ (C), 60+ (D). If you're in the "012" range (eg, 90-92) then you'll get a "minus"; if you're in the "789" range (eg., 87-89) you'll get a "plus." These letter grades are lower bounds: I may adjust them up, but will not adjust them down. You can view your grades on grades.cs for individual assignments. To compute your overall course grade, view your grades table, save the HTML of the page in your browswer (eg., to "hal.html"), and run "perl mygrade.pl < hal.html" to get your grade. You can download mygrade.txt here, which you'll have to rename to "mygrade.pl".


 Schedule (tentative)

The following schedule is subject to change, but likely not by very much. The readings listed are readings that you should have finished by that date (all from Jurafsky+Martin unless otherwise noted).

Date Topics Required Readings Suggested Readings Due
30 Aug Welcome to Computational Linguistics
What is this class about, linguistic phenomena
- - -
Linguistic Ambiguity
04 Sep Regular languages
Finite state machines and baby morphology
2-2.2, 3-3.4 - HW00
06 Sep N-gram models
Language modeling and information theory
4-4.2, 4.4-4.5 - HW01
11 Sep Noisy channel models
Automatic morphological disambiguation
3.5, 5.9, Link (3, 4-4.2) - -
13 Sep Unsupervised learning via EM
In-class example: morphological disambiguation
- HW02
18 Sep Unsupervised learning II
Word alignment and model 1
25.5-25.6 - -
20 Sep Phonological change
Cognate lists and diachronic linguistics
(not all will
make sense: that's ok)
- HW03
Syntax
25 Sep Indroduction to syntax
Flavors, tests and goals
- P1
27 Sep Part of speech tagging
Finite state solutions
5.1-5.5 - HW04
02 Oct Syntactic parsing
Treebanks, PCFGs and the CKY algorithm
13, 13.2-13.4.1, 14.2 - -
04 Oct Dependency grammars
Graph-based models
(through 3.3), 12.7 - HW05
09 Oct Left-to-right parsing
Efficient and psycholinguistically plausible
12.9, 14.10, -
11 Oct Discussion about projects
(and catch-up...)
TBD - HW06
Semantics
16 Oct Working Day
(come with P2 questions)
None - -
18 Oct Categorial grammars
Representation and parsing
or - P2+HW07
23 Oct Lambda calculus/first order logic
Semantic interpretation
17.1, 17.3 -
25 Oct Lexical semantics
Word sense disambiguation
19-19.3, 20.7 - HW08
01 Nov Multilingual semantics
Lexical substitution
// - Proposals
01 Nov 6pm, AVW 3258!
Images and words: Multimodal inference of meaning
None - HW09
06 Nov No class: Hal is sick - - -
08 Nov Textual inference
Entailment and paraphrasing
(sec 1.0, 1.3, 2.4, 4.2, 4.3) - -
Discourse and Pragmatics
13 Nov Words and actions
Following instructions
P3
15 Nov Anaphora and coreference
Supervised and unsupervised approaches
- -
20 Nov Midterm review - - -
Fun Stuff
27 Nov Midterm
One double-sided page of notes allowed
- - -
29 Nov Multilingual language processing
Shared representations and learning history
None - Progress
04 Dec Topic modeling
(Guest speaker: Jordan Boyd-Graber)
- HW10
06 Dec Modeling conversations
(Guest speaker: Philip Resnik)
-
11 Dec Computational Phonology
(Guest speaker: Ewan Dunbar)
- HW11

 Homework Assignments

All written homeworks are due on Thursday before class (i.e., by 9:25a). See the schedule above for due dates. All projects are due on Tuesdays by 10pm. There is no allowance for late homeworks or projects without prior permission (for which you had better have a good reason). Everything should go through our online handin system.

Written Homeworks

Programming Projects


 Useful Links

This course has been taught at UMD (once by me, in 2010) in the past: Fall 2011, Fall 2010, Fall 2009, Fall 2008 and Fall 2007 .


 Course Policies

Cheating: Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. When taking an exam, you must work independently. Any collaboration during an exam will be considered cheating. Any student who is caught cheating will be given an E in the course and referred to the University Student Behavior Committee. Please don't take that chance - if you're having trouble understanding the material, please let us know and we will be more than happy to help.

ADA: Any student eligible for and requesting reasonable academic accommodations due to a disability is requested to provide, to the instructor in office hours, a letter of accommodation from the Office of Disability Support Services (DSS) within the first two weeks of the semester. You may reach them at 301-314-7682 or by visiting Susquehanna Hall on the 4th Floor.

College guidelines: Document concerning adding, dropping, etc. here.