This examination is open book, open notes, and you may consult any source of information during the exam except another person. You are limited to TWO HOURS from the time that you begin until you complete writing the exam. The exam must be returned to me (oard@glue.umd.edu) by 5:30 P.M., Saturday, December 18, 1999 either by email or by placing it in my mailbox. It is your responsibility to ensure that I receive the exam. Answer any TWO of the following THREE questions. If you answer all three, I will grade only the first two that I happen to read, which may not be read in the order that you wrote them. 1. Term Selection. In character-coded electronic text, bags of words, word stems, phrases and/or character n-grams are commonly used to represent the meaning of a document. The same representations could also be used to index document for which you have only images of printed pages or recorded speech (that is originally in audio form) if accurate optical character recognition or speech recognition is available. Identify one other way of automatically representing the content of a document image (of a printed page) when accurate optical character recognition cannot be provided or one way of automatically representing the meaning of recorded speech when accurate speech recognition cannot be provided. In each case, explain how the representation can be used by explaining the query formulation and document detection (matching, ranking, ...) processes. 2. Evaluation. Recall and precision are often criticized as not accurately reflecting the way in which people actually use information retrieval systems. Identify the assumptions that underly the calculation of recall and precision, give at least one example for each assumption that shows a case in which that assumption is not true, and then explain why recall and precision are so widely used despite the apparent problems that you have identified. 3. Why Things Work. People who are new to text retrieval are often surprised by how well such simple techniques work. This good performance is surprising because the use of language is complex. Some words have several meanings, some meanings can be expressed using different words, and the order in which words appear can change their meaning. Using the an interactive retrieval system based on the vector space model as an example, explain how text retrieval systems overcome each of these three problems (to the extent that they do). In your answer, be sure to focus on each of the three problems separately. You might find it helpful to use examples to illustrate your points, but the key to a good answer will be to explore the way authors and searchers behave, and how the design of vector space text retrieval systems can take advantage of that behavior. ------------------------------ End -------------------------------