First DC-area IR Experts meeting

Monday November 1, 2010 
University of Maryland, College Park
A.V. Williams Building, Room 3258

Doug Oard, University of Maryland
Welcome, Background, Future Plans

Bill Frakes, Virginia Tech
Automatic Vocabulary Extraction Methods for Domain Analysis

Nazli Goharian, Georgetown University
On Passage Detection

Paul McNamee, Johns Hopkins University HLT COE 
Two Unusual Forms of Tokenization for Text Retrieval


Dina Demner-Fushman, National Library of Medicine
Combining Text and Visual Features for Information Retrieval

Doug Oard, University of Maryland
Evaluating E-Discovery Search

Justin Martineau, University of Maryland Baltimore County
Dynamic Domain Specific Sentimental Word Identification

Bill Kules, Catholic University of America
Gaze Behavior with Faceted Search Interfaces in Academic and Health Domains

Open Discussion


Others Attending:

Jae-Wook Ahn, University of Maryland, College Park
Oleg Aulov, University of Maryland, Baltimore County
Rezarta Islamaj, National Center for Biotechnology Information
Jacob Kogan, University of Maryland Baltimore County
Dawn Lawrie, Loyola University
Jim Mayfield, Johns Hopkins University HLT COE
Charles Nicholas, University of Maryland Baltimore County
Delip Rao, Johns Hopkins University
Hagit Shatkay, University of Delaware
Ian Soboroff, NIST
Kanti  Srikantaiah, University of Maryland, College Park
Kaizhi Tang, Intelligent Automation, Inc.
Earl Wagner, University of Maryland, College Park
Lidan Wang, University of Maryland, College Park
Tan Xu, University of Maryland, College Park
Lana Yeganova, National Library of Medicine
Dongming Zhang, Johns Hopkins University


The campus is easily accessible by car and metro.  For directions,
please see:  Visitor
parking costs $15 for the full day; pay before you leave the lot.
Parking for speakers is complementary (contact Doug Oard for the code
number that you will need to give the machine in lieu of money).  A
free shuttle is available from the College Park metro station.

Coffee, water, and (for lunch) a variety of pizza will be provided.
For those with other dietary preferences, a food concession located in
the same building is available.


William B. Frakes, Virginia Tech
Automatic Vocabulary Extraction Methods for Domain Analysis

Software reuse is the use of existing software or software knowledge
to construct new software.  A key concept in systematic reuse is the
domain, a software business area that contains systems sharing
commonalities.  Most organizations work in only a few domains,
repeatedly building similar systems with variations to meet the needs
of different customers.  Rather than building each variant system from
scratch, as is often done today, significant gains are achievable by
reusing large portions of previously built systems in the domain to
construct new ones.  The process of identifying domains, bounding
them, and discovering commonalities and variabilities among the
systems in the domain is called domain analysis.  The entire process
of reusing domain knowledge in the production of new systems is called
domain engineering or product line engineering.  I will describe
several experiments that evaluated automatic term extraction for the
development of domain vocabularies.  In the first experiment, fourteen
word frequency metrics were tested to evaluate their effectiveness in
identifying vocabulary in a domain.  Fifteen domain-engineering
projects were examined to measure how closely the vocabularies
selected by the fourteen word frequency metrics were to the
vocabularies produced by domain engineers.  Stemming and stop word
removal were also evaluated to measure their impact on selecting
proper vocabulary terms.  The results of the experiment show that
stemming and stop word removal do improve performance and that term
frequency is a valuable contributor to performance.  Variations on
term frequency are not always significant improvers of performance.
The second study compared the vocabularies created by various domain
experts and the source documents selected by them to create the
vocabulary.  The results indicate that there is similarity among the
vocabularies created and the source documents selected.


Nazli Goharian, Georgetown
On Passage Detection 

Passages can be hidden within text to circumvent their disallowed
transfer. Such release of compartmentalized information is of concern
to all corporate and governmental organizations. Passage retrieval is
well studied; we posit, however, that passage detection is not.
Passage retrieval is the determination of the degree of relevance of
blocks of text, namely passages, comprising a document.  Rather than
determining the relevance of a document in its entirety, passage
retrieval determines the relevance of the individual passages.  As
such, modified traditional information retrieval techniques compare
terms found in user queries with the individual passages to determine
a similarity score for passages of interest.  In passage detection,
passages are classified into predetermined categories.  More often
than not, passage detection techniques are deployed to detect hidden
paragraphs in documents.  That is, to hide information, documents are
injected with hidden text into passages.  Rather than matching query
terms against passages to determine their relevance, using text mining
techniques, the passages are classified.  Those documents with hidden
passages are defined as infected.  Thus, simply stated, passage
retrieval is the search for passages relevant to a user query, while
passage detection is the classification of passages.  Our approach
identifies passages and utilizes relationship nets to improve


Paul McNamee, JHU HLT COE, 
Two unusual forms of tokenization for text retrieval

This talk discusses preliminary work using two relatively unstudied
methods of tokenizing text.  The first method is based on synthetic
syllabification - breaking words down into component 'syllables'.  The
second approach is based on a generalization of character n-gram
tokenization that allows for skip characters within a fixed template.
Syllabification aims to segment words into syllables. Because of
irregularities and variations in pronunciation, and due to homonymy,
determining precise syllable boundaries in most alphabetic languages
is non-trivial. We examine the use of simple approximate methods for
inducing syllable boundaries with the goal of creating suitable
indexing representations for text. In preliminary experiments we find
that sequences of syllables or 'syllable-grams' can lead to effective
text retrieval, especially in morphologically complex languages.  An
entirely different representation for text involves the use of
templatic character skip-grams.  As an example, the word 'park' can be
represented as a set of symbols: '*ark', 'p*rk', 'pa*k', and 'par*'
using one skip and relying on the symbol '*' to indicate a single
wildcard character.  As the template size increases and the number of
permitted skips changes, the number of symbols (i.e., indexing terms)
per word increases dramatically.  While this can be problematic from
an efficiency point of view, we envision some applications of the
method.  In particular, dealing with with irregular conjugations and
infix morphology becomes tractable ('swim', 'swam', and 'swum') all
have similar representations, something a suffix stemmer might have
trouble accomplishing. Similarly 'foot' and 'feet' share symbol
'f**t'.  We have some preliminary experiments showing that promise of
the method in morphologically complex languages. We also envision that
the method may prove suitable for dealing with heavily degraded text
such as might be produced by low-quality optical character


Dina Demner-Fushman in collaboration with Matthew Simpson, Sameer
Antani, Md. Rahman and George Thoma:
Combining text and visual features for information retrieval

The search for relevant and actionable information is key to achieving
clinical and research goals in biomedicine. Biomedical information
exists in different forms: as text and illustrations in journal
articles and other documents, in images stored in databases, and as
patients' cases in electronic health records. We study ways to move
beyond conventional text-based searching of these resources by
combining text and visual features in search queries and document
representation. We use a combination of IR, CBIR and NLP techniques
and tools to develop building blocks for advanced information
services. Such services will enable searching by textual as well as
visual queries, and retrieving citations enriched by relevant images,
charts, and other illustrations from the journal literature, patient
records and image databases.


Douglas W. Oard, University of Maryland, College Park
Evaluating E-Discovery Search

Civil litigation relies on each side making relevant evidence
available to the other, a process known as "discovery."  The explosive
growth of information in digital form has led to an increasing focus
on how search technology can best be applied to balance costs and
responsiveness in what has come to be known as "e-discovery".  This is
now a $4 billion USD business, one in which new vendors are entering
the market frequently, usually with impressive claims about the
efficacy of their products or services.  Courts, attorneys, and
companies are actively looking to understand what should constitute
best practice, both in the design of search technology and in how that
technology is employed.  In this talk I will provide an overview of
the e-discovery process, and then I will use that background to
motivate a discussion of which aspects of that process the TREC Legal
Track is seeking to model.  I will then spend most of the talk
describing two novel aspects of evaluation design: (1) recall-focused
evaluation in large collections, and (2) modeling an interactive
process for "responsive review" with fairly high fidelity.  Although I
will draw on the results of participating teams to illustrate what we
have learned, my principal focus will be on discussing what we
presently understand to be the strengths and weaknesses of our
evaluation designs.


Justin Martineau, UMBC
Dynamic Domain Specific Sentimental Word Identification

Query driven sentiment analysis is a difficult problem because the
strength and polarity of sentimental words and expressions is
dependent upon the topic. This necessitates a dynamic approach fast
enough to operate at run time.  In this talk I will outline the
problem by presenting new experiments supporting the claim that
topical sentiment is expressed by the sum of a large number of very
weak signals. As a partial solution I will present a fast
statistically well grounded relevance feedback style technique to
determine sentimental word orientation for a given topic. Then I will
present preliminary results using document synthesis to bootstrap
topical sentiment polarity classifiers at run time.


Bill Kules, Catholic University of America
Gaze Behavior with Faceted Search Interfaces in Academic and Health Domains

Faceted search is an accepted technique to support complex and
iterative information seeking tasks like exploratory search. Faceted
search interfaces incorporate clickable categories into search results
so searchers can narrow and browse the results without reformulating
their query. These techniques have the potential to support more
flexible information seeking strategies for exploratory search by
allowing searchers to fluidly transition between browsing and search
strategies. My research is motivated by a desire to better understand
how these interfaces affect searcher actions, tactics, and strategies.
In this talk I will discuss several studies (two completed and two
underway) that examine searcher gaze behavior in two types of faceted
search interfaces: an academic library catalog and the Medline Plus
web site. The studies apply multiple methods, including eye tracking
and stimulated recall interviews to investigate several factors in
searcher gaze behavior (what components of the interface searchers
look), including: differences when training is and is not provided;
changes as searchers become familiar with the interface; differences
depending on the stage of search; and differences by domain and task;