WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China PSST: A Web-Based System for Tracking Political Statements Samantha Kleinberg, Bud Mishra Courant Institute of Mathematical Sciences, New York University New York, NY, USA samantha@cs.nyu.edu, mishra@nyu.edu 2. METHODS ABSTRACT Determining candidates' views on imp ortant issues is critical in deciding whom to supp ort and vote for; but finding their statements and votes on an issue can b e lab orious. In this pap er we present PSST, (Political Statement and Supp ort Tracker), a search engine to facilitate analysis of p olitical statements and votes over time. We show that prior tools for text analysis can b e combined with minimal manual processing to provide a first step in the full automation of this process. The main steps of PSST are: (1) Retrieving webpages with sp eeches, press releases, and votes (2) Extracting relevant quotes from the texts and relating these to p olitical issues (3) Displaying the statements and voting histories. 2.1 Data collection In order to analyze candidates' consistency on issues we gathered sp eech transcripts from their official websites, and later examined these to determine their focus and find statements on particular issues. Voting records were also gathered and similarly analyzed. By using the p oliticians' own words, we obtain a truer representation of their publicly stated p ositions, not filtered through the lens of the media. As valid RSS feeds of sp eeches do not exist for all candidates, and news stories proved to b e too ambiguous, we restricted the system to a pre-defined subset of all p olitical figures. The candidates and p oliticians currently supp orted in the PSST system are: George Bush, Hillary Clinton, John Edwards, John McCain, Barack Obama, Mitt Romney, and Fred Thompson. These were the top candidates, according to p olls at the time, for which sp eech transcripts were available. A full list of data sources is given on the pro ject web site. Similarly, for p oliticians with voting histories, key votes were retrieved from the Pro ject Vote Smart website.1 Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; J.4 [Computer Applications ]: Social and Behavioral Sciences General Terms Algorithms 1. INTRODUCTION During the 2004 US Presidential election, the notion of "flip-flopping" was made salient, but despite analyses[2, 3, 4] of sp eeches given by b oth candidates in the final months b efore the election, the "flip-flopp er" lab el was not applied equally. In this work, we aim to provide a non-partisan, unbiased, method of viewing how public statements on issues change over time as well as how statements correlate with the p oliticians' votes on these issues. This is useful for voters who value consistency and want to b e aware of candidates' histories, as well as for holding p oliticians accountable for their statement histories b oth b efore and after elections. Ideally, if this could b e done in an automated way, for any p olitician or candidate, as the news is happ ening, any p erson could check the facts for themselves immediately - rather than having just a feeling that rhetoric has changed. On the other side of things, if p oliticians are aware that anyone can check on their histories in a simple way, we may avoid this shift in arguments and rationale. We present our first work toward this end, PSST (Political Statement and Supp ort Tracker), a web-based system currently allowing analysis many of the candidates in the 2008 United States Presidential election, and President Bush, on a set of issues and key votes. Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 2.2 Getting key phrases In the second phase of PSST, key phrases are extracted from the texts and then linked with issues. Here Extractor[1][5] was used for key phrase extraction, as well as for identifying the supp orting text of those key phrases. Given a link to a sp eech, Extractor returns a list of the key phrases in the sp eech and, for each phrase, a list of the sentences in the text that supp ort that phrase. These phrases represent the topic of the text. For example, a sp eech ab out health care p olicy may return phrases such as "insurance companies" and "health." Other techniques for identifying key phrases, such as TF-IDF, tend to focus on unusual or frequently used words that do not always accurately describ e the sp eech's focus. PSST compares the key phrases found by Extractor with a list of words and phrases that we have manually identified as relating to sp ecified p olitical issues (e.g. campaign finance, the environment, etc.). For each occurrence of such a phrase in the sp eech, PSST searches the surrounding sentences for any additional significant phrases that might indicate other relevant issues. (See website for full list of issues supp orted). 1 www.vote-smart.org 1143 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China changing p ositions in cases where votes are not available. Rather than simply doing a search on Google for a p olitician or issue, a user can immediately find all statements for multiple candidates on the topics that interest them. We were able to identify inconsistencies such as changes in President Bush's arguments for war with Iraq, Senator Clinton's statements on bringing troops home versus her votes against redeployment, and Senator McCain's rationale for a vote against a health care bill after stating the imp ortance of health care reform. However, it does not succeed in all areas. In some cases, particularly with categorizing votes based solely on their titles, there were misclassifications. For example "Healthy Forests Restoration Act of 2003" was included in the health category due to the string "health" app earing in the title. Another issue is that phrases p ertaining to multiple issues may b e only identified in their primary context, as the pro ject favors precision in the main topic of the text over recall. Finally, we note that while new statements and votes are b eing added as they occur, the system is currently closed in terms of the candidates and issues supp orted. Adding a new candidate or p olitician currently requires identifying data sources for them, analyzing their structure, and writing code that can parse them. This system is quite vulnerable to small changes in the web page's structure. If, in the future, all candidates supp ort a standardized format to disseminate their views (e.g., using RSS feeds) and provide transcripts in a common layout, this would enable PSST to supp ort very flexible queries involving a nearly unlimited numb er of p oliticians without any manual intervention. Adapting the system to a different country or set of candidates would then involve simply up dating the issues, as some may b e more or less relevant in the future and new issues may arise. Figure 1: Query interface in use. The user is adding a third candidate to the selected query. Then, using the same rules as for the statements we automatically classified each of the bills. In order to determine these main issue categories as well as what words fall within them, we studied websites such as that of Pro ject Vote Smart to get an idea of the general categories the statements fell into and then reviewed the phrases and sentences extracted for one Democratic and one Republican candidate (Barack Obama and John McCain) to see which phrases corresp ond to which issues and account for different wordings b etween parties. Manually reviewing the phrases and sentences extracted for each sp eech aided in finding issue sp ecific language. For each sp eech we looked at the phrases extracted, verified that they corresp onded to at least one of the p oints of the sp eech and then determined based on common and issue sp ecific knowledge what category the phrase b elonged in, and what other similar phrases may b e used to make such classifications. For example, phrases such as "contraception" and "family planning" are common in texts dealing with ab ortion but rare in other contexts. We continued this process using vote titles, in order to identify more phrases that had originally b een missed or that might only b e used in the context of a bill rather than a sp eech. One example of this is CHIP (Child Health Insurance Plan), as there were a numb er of votes on the plan, though it was rarely mentioned by name outside of bill titles. 4. CONCLUSIONS AND FUTURE WORK We presented PSST, a system for the analysis of p olitical statements and votes, currently implemented for a predefined set of p oliticians and issues. Preliminary exp eriments supp ort the validity of the approach. We plan to make improvements in the characterization of statements, integration of other data sources, and facilitation of expansion to include new candidates and issues. The pro ject is located at: http://cs.nyu.edu/samantha/search/psst.html 5. ACKNOWLEDGMENTS The authors would like to thank Ernest Davis for helpful discussions. 2.3 Queries At query time, results are retrieved from the database. Using a drag and drop interface, a user can select any numb er of candidates and issues. Then, for each issue and each candidate selected, the related votes and statements are retrieved and arranged in descending chronological order with votes/statements displayed together. Figure 1 shows the user interface. 6. REFERENCES [1] Extractor. www.extractorlive.com. [2] D. P. Kuhn. Bush's top ten flip-flops: Cbsnews.com charts the opinion switches, part 1: George bush. www.cbsnews.com, Septemb er 2004. [3] D. P. Kuhn. Kerry's top ten flip-flops: Cbsnews.com charts the opinion switches, part 2: John kerry. www.cbsnews.com, Septemb er 2004. [4] M. Sandalow. News analysis: Flip-flopping charge unsupp orted by facts. www.sfgate.com, Septemb er 2004. [5] P. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2(4):303­336, 2000. 3. DISCUSSION The system is successful in aiding the discovery of inconsistencies b etween statements and votes, highlighting the reasoning for a vote that may seem inconsistent with a p olitician's stated b eliefs, and facilitating the identification of 1144