Review Questions
----------------------
IMPORTANT: Please note that there is no guarantee that the questions in the final will be anything like the review questions. These questions are just a sampling of the material and should NOT be considered a substitute for going through the lectures and the textbook.
n-gram language model
=====================
-- Why are language models useful in NLP? What are some NLP applications that use n-gram LMs?
-- What is the operative assumption that allows us to build n-gram LMs?
-- How would you compute the maximum likelihood estimate for the bigram probability P(w1,w2)?
-- What are the two problems associated with scaling n-gram language models to higher n?
-- Why is smoothing necessary? Why is it also called discounting?
-- Is Laplace's law an effective smoothing technique? Why/Why not?
-- Derive the expression for the bigram conditional probability under Laplace's law P_LAP(w2|w1).
-- Prove that Good-Turing is actually performing discounting (not a formal proof, just an intuitive one).
-- Given a corpus with N tokens, V types, B unique bigrams and the (r,Nr) distribution from the example in the lecture slides (second slide on page 30 in the pdf), explain how you would go about performing Good-Turing discounting.
-- Is it ok to use the MLE P(w1) as the denominator when computing P_GT(w2|w1)? Why/Why not?
-- Is it a good idea to perform Good-Turing discounting uniformly,i.e., for all r? If not, what possible alternatives do we have?
-- What is the difference between interpolation and backoff (in the context of LMs)?
-- Can I perform backoff with regular MLEs of n-gram probabilities or do I need to do discounting first?
-- Does Katz Backoff yield well-formed probability estimates?
-- What are grammatical zeros? Can Katz Backoff deal with genuine grammatical zeros?
-- What is the other disadvantage of Katz Backoff?
-- Is the unigram model P_{cont} in Kneser Ney smoothing different from a regular MLE unigram model? If so, how? Why is it useful?
-- What is the difference between closed vocabulary and open vocabulary language modeling?
-- What is perplexity? Why is it an appropriate measure for evaluating LMs? Intuitively, can you say if perplexity differs significantly from entropy?
-- Why do closed vocabulary scenarios lead to lower perplexities?
-- Why would testing a newswire-trained LM on blogs yield higher perplexities?
WSD questions
============
-- What is the difference between homonymy and polysemy?
-- What is a synset?
-- What do the edges in WordNet represent?
-- What are hypernyms?
-- What are meronyms?
-- What is the difference between the lexical sample task and the all-words WSD tasks?
-- What are some commonly used features in a WSD task?
-- Give example of an extrinsic evaluation of a WSD system?
-- What is Senseval/SemEval?
-- What is sense-annotated data? Why is it needed?
-- What do knowledge-lean and a knowledge-rich WSD approaches mean?
-- What are supervised, unsupervised, and semi-supervised WSD approaches?
-- Describe Lesk's WSD algorithm?
-- Describe Yarowsky's cotraining algorithm for WSD?
-- What is the difference between word sense disambiguation and word sense discrimination?
-- Give two ways in which the basic Lesk algorithm can be improved.
-- What are training and test data?
-- Why does more training data give better results?
-- What is the Naive Bayes Algorithm for WSD?
-- What is a decision list?
-- What are collocations?
-- Why is Yarowsky's algorithm appealing?
-- What are its drawbacks?
Semantic Distance
===============
-- What is the difference between semantic similarity and semantic relatedness?
-- Give examples of how semantic distance algorithms are evaluated?
-- Understand the edge-distance, Wu Palmer, Hirst--St. Onge, Resnik, and Jiang Conrath methods.
-- What is the basic idea behind distributional measures of similarity?
-- What is information content?
-- What are the differences between WordNet-based and distributional measures of similarity?
Clustering
========
-- What is text clustering?
-- How does partitional clustering work?
-- What are first-order and second-order co-occurrences?
-- What are intra-cluster and inter-cluster distances?
MT, IR, Speech
==============
For the guest lectures on MT, IR and Speech processing, we will most likely ask you very high-level questions that do not go into any significant technical detail but rather test your understanding of the basic concepts that the lectures touched on.
For example, one such question for MT might ask you to identify and briefly describe the various components of a typical phrase-based SMT system. This is something that Philip covered in great detail in the class.