Assignment 3: Who Hangs Out Together on Wikipedia

Due: Thursday 3/31 (11:59pm)

Background

Pointwise mutual information is a function of two events x and y:

The larger the magnitude of PMI is for x and y is, the more information you know about the probability of seeing y having just seen x (and vice-versa; PMI is symmetrical). If seeing x gives you no information about seeing y, then x and y are independent and the PMI is zero.

Proper nouns are nouns that refer to distinct entities. In English, they're usually capitalized even when they don't start a sentence. Examples are

These are discovered automatically using a parser (but you don't need to worry about the specifics.

In this project you're going to compute the PMI for different entities appearing in the same sentence.

Input Files

We created files with the proper nouns set off in angle brackets:

⟨ joel norman quenneville ⟩ ( born ⟨ september ⟩ 15 , 1958 in ⟨ windsor ⟩ , ⟨ ontario ⟩ , ⟨ canada ⟩ ) is the head coach of the ⟨ chicago blackhawks ⟩ professional ice hockey team .
a grand coalition of ⟨ cdu ⟩ and ⟨ spd ⟩ lasted from 1968 to 1972. a new grand coalition lasted from 1992 to 1996. since 1996 , the < cdu ⟩ is cooperating with the fdp .
the elementary school in " the ⟨ simpsons ⟩ " is based on ⟨ mccarthy middle school ⟩ , which was ⟨ chelmsford ⟩ 's high school before the construction of ⟨ chelmsford high school ⟩ in 1974. the town hall in the show is based on the ⟨ chelmsford public library ⟩ ( prior to the recent reconstruction ) .

Which can be found in "/umd-lin/jbg/data/wackypedia/np". Note that the entities are the whole string inside the angle brackets. E.g. "joel norman quenneville" is one entity and "mccarthy middle school" is another.

Required Tasks

In this assignment, you will compute the PMI of proper nouns that appear together more than 25 times among entities that appear in more than 100 sentences. (In other words, if an entity by itself appears in 100 sentences or fewer, we're not interested; if two entities appear together 25 sentences or fewer, we're also not interested). Write the code necessary to do this and at least one unit test (you'll likely want to write more!).

When you're done answer the following questions:

  1. What strategy did you use to get the necessary information needed to compute PMI?
  2. Did you use a combiner? If so, describe what it looked like.
  3. Did you use a partitioner? If so, describe what it looked like.
  4. How many entities appear more than 100 times?
  5. How many entity pairs appear more than 25 times?
  6. What are the entities that have the highest PMI with: "seinfeld", "maryland", and "clinton"?
  7. What is the pair of entities with the largest PMI?

Submission Instructions

This assignment is due by 11:59pm, Thursday 3/31. Please send us (both Jordan and Yingying) an email with "[CCC Assignment 3]" as the subject. In the body of the email put answers to the questions above and any source code you created or modified.

Hints / Tips


Back to main page