WWW 2008 / Refereed Track: Search - Query Analysis April 21-25, 2008 · Beijing, China Unsupervised Query Segmentation Using Generative Language Models and Wikipedia Bin Tan Fuchun Peng Yahoo! Inc. 701 First Avenue Sunnyvale, CA 94089 Depar tment of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 bintan@uiuc.edu fuchun@yahoo-inc.com which makes it possible to enforce word proximity and ordering constraints on document matching, among other things. A partial solution provided by many search engines is that a user can put double quotes around some query words to mandate they appear together as an inseparable group in retrieved documents. Aside from sometimes being overly limiting, this doublequote proximity operator requires additional efforts from the user. However, a typical user may be unaware of the syntax, or reluctant to make this clarification unless really dissatisfied with the results. It would be desirable if the structural relationship underlying the query words could be automatically identified. Query segmentation is one of the first steps in this direction. It aims to separate query words into segments so that each segment maps to a semantic component, which we refer as a "concept". For example, for the query "new york times subscription", [new york times] [subscription] is a good segmentation, while [new] [york times] [subscription] and [new york] [times subscription] are not, as segments like [new york] and [york times] greatly deviate from the intended meaning of the query. Ideally, each segment should map to exactly one "concept". For segments like [new york times subscription], the answer of whether it should be left intact as a compound concept or further segmented into multiple atomic concepts depends on the connection strength of the components (i.e. how strong / often are "new york times" and "subscription" associated) and the application (e.g., whether query segmentation is used for query understanding or document retrieval). This leaves some ambiguity in query segmentation, as we will discuss later. Query segmentation could help a retrieval system to improve its accuracy, as segments carry implicit word proximity / ordering constraints, which may be used to filter documents. For example, we would expect "new" to be near (and probably just before) "york" in matching documents. Query segmentation should also be useful in other query processing tasks such as query rewriting and query log analysis, where one would be able to work on semantic concepts instead of individual words [2, 8, 21]. Interestingly, there is little investigation done of query segmentation, in spite of its importance. To our knowledge, two approaches have been studied in previous works. The first is based on (pointwise) mutual information (MI) between pairs of query words. In [23, 14], if the MI value between two adjacent words is below a predefined threshold, a segment boundary is inserted at that position. The problem with this approach is that MI, by definition, cannot capture correlations among more than two words, thus it cannot handle long entities like song names, where MI may be low in certain positions (e.g., between "heart" and "will" in "my heart will go on"). Another problem with this approach, which is also true with most unsupervised learning methods, is its heavy reliance on cor- ABSTRACT In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia. Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774). Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms Keywords Query segmentation, concept discovery 1. INTRODUCTION A typical search engine of today usually allows a user to submit a bunch of keywords as a query. This simple user interface greatly eases use of the search engine, and is sufficient for the traditional bag-of-words retrieval model, where query and document are assumed to be composed of individual words neither related nor ordered. In the quest for higher retrieval accuracy, however, it is necessary to understand the user's search intent beyond the simple bag-of-words query model. At a minimum, we would like to know whether some words comprise an entity like an organization name, work done while the author was an intern at Yahoo! Inc. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. WWW 2008 / Refereed Track: Search - Query Analysis pus statistics. In some cases highly frequent patterns with incomplete semantic meaning may be produced. For example, "leader of the" is a fairly frequent pattern, but it is certainly not a selfcontained semantic concept. The second approach [4] uses supervised learning. At each word position, a binary decision is made whether to create a segment boundary or not. The input features to the decision function are defined on words within a small window at that position. The features include part-of-speech tags, position, web counts, etc., with their weights learned from labeled training data. Although the window size is set to 4 to capture a certain degree of long-range dependencies, the segmentation decisions are still local in essence. Moreover, their features are specifically designed for noun-phrase queries (queries comprised of multiple noun phrases). For example, one feature is the web count of "X and Y " for two adjacent words X and Y , which is not helpful for general queries. Another problem with this approach, and with all supervised learning methods, is that it requires a significant number of labeled training samples and well designed features to achieve good performance. This makes it hard to adapt to other languages, domains, or tasks. In this paper, we propose an unsupervised method for query segmentation that alleviates the shortcomings of the current approaches. Three characteristics make our approach unique. First, we use a statistical language modeling approach that captures the high-level sequence generation process rather than focusing on local between-word dependencies as mutual information does. Second, in order to avoid running expectation-maximization (EM) algorithm on the whole dataset, which is quite expensive, we propose an EM algorithm that performs optimization on the fly for incoming queries. Third, instead of limiting to use query logs, we combine resources from the Web. In particular, we use Wikipedia to provide additional evidence for concept discovery, which proves to be quite successful. In our experiments, we test our method on human annotated test sets, and find language modeling, EM optimization and Wikipedia knowledge each brings improvement to query segmentation performance. In the rest of the paper, we will first describe other related work in Section 2. We then in Section 3 describe the generative language model for query segmentation. Section 4 discusses the techniques for parameter optimization, including a method for estimating long n-gram's frequency and an EM learning algorithm. In Section 5, we describe how to incorporate additional knowledge such as Wikipedia into our segmentation model. Practical system implementation is discussed in in Section 6. We present experiment results in Section 7 and its discussions in Section 8. Finally, we conclude the paper in Section 9. April 21-25, 2008 · Beijing, China In terms of unsupervised methods for text segmentation, the expectation maximization (EM) algorithm has been used for Chinese word segmentation [19] and phoneme discovery [16], where a standard EM algorithm is applied to the whole corpus. As pointed out in [19, 16], running EM algorithm over the whole corpus is very expensive. To avoid this costly procedure, for each incoming query, we run an EM algorithm on the fly over the affected sub-corpus only. Since a query is normally short and the subcorpus relevant to it is also quite small, this online procedure is very fast. A second difference is that we augment unsupervised learning with Wikipedia as external knowledge, which dramatically improves query segmentation performance. We combine multiple evidence using the minimum description length principle (MDL), which has been applied widely to problems such as grammar induction [13] and word clustering [15]. To some extent, utilizing external knowledge is similar to having some seed labeled data for unsupervised learning as used in [1], only that Wikipedia is readily available. Wikipedia has been used in many applications in NLP, including named entity disambiguation [7, 9], question answering [10], text categorization [12], and conference resolution [20]. To our knowledge, this is the first time that it is used in query segmentation. 3. A GENERATIVE MODEL FOR QUERY SEGMENTATION When a query is being formulated in people's mind, it is natural to believe that the building blocks in the thinking process are "concepts" (possibly multi-word), rather than single words. For example, with the query "new york", the two words must have popped into one's brain together; if they were come up separately, the intended meaning would be vastly different. It is when the query is uttered (e.g., typed into a search box) that the concepts are "serialized" into a sequence of words, with their boundaries dissolved. The task of query segmentation is to recover the boundaries which separate the concepts. Given that the basic units in query generation are concepts, we make the further assumption that they are independent and identicallydistributed (I.I.D.). In other words, there is a probability distribution PC of concepts, which is sampled repeatedly, to produce mutually-independent concepts that construct a query. This is essentially a unigram language model, with a "gram" being not a word, but a concept / segment. The above I.I.D. assumption carries several limitations. First, concepts are not really independent of each other. For example, we are more likely to observe "travel guide" after "new york" than "new york times". Second, the probability of a concept may vary by its position in the text. For example, we expect to see "travel guide" more often at the end of a query than at the beginning. We could tackle the above problems by using a higher-order model (e.g., the bigram model) and adding a position variable, but this will dramatically increase the number of parameters that are needed to describe the model. Thus for simplicity the unigram model is used, and it proves to work reasonably well for the query segmentation task. Let T = w1 w2 · · · wn be a piece of text of n words, and S T = s1 s2 · · · sm be a possible segmentation consisting of m segments, where si = wki wki +1 · · · wki+1 -1 , 1=k1 0 a new segmentation a.seg s {s} a.pr ob PC (s) B [i] {a} for j in [1..i - 1] for b in B [j ] s wj wj +1 · · · wi if PC (s) > 0 a new segmentation a.seg s b.seg s {s} a.pr ob b.pr ob × PC (s) B [i] B [i] {a} sort B [i] by pr ob truncate B [i] to size k return B [n] PC (si ) And the cumulative probability of generating Q is S P (S Q ) P (Q ) = Q where S Q is one of 2n-1 different segmentations, with n being the number of query words. T T For two segmentations S1 and S2 of the same piece of text T , suppose they differ at only one segment boundary, i.e., T S1 = s1 s2 · · · sk-1 sk sk+1 sk+2 · · · sm T S2 = s1 s2 · · · sk-1 sk sk+2 · · · sm where sk = (sk sk+1 ) is the concatenation of sk and sk+1 . Naturally, we will favor segmentations with higher probability T T of generating the query. In the above case, P (S1 ) > P (S2 ) (thus k is preferred) if and only if Pc (sk )Pc (sk+1 ) > Pc (s ), i.e., when sk and sk+1 are negatively correlated. In other words, a segment boundary is justified if and only if the pointwise mutual information between the two segments resulting from the split is negative: MI(sk , sk+1 ) = log Pc (sk ) <0 Pc (sk )Pc (sk+1 ) Note that this is fundamentally different from the MI-based approach in [14]. MI as computed above is between adjacent segments, rather than words. More importantly, the segmentation decision is non-local (i.e., involving a context beyond the words near the segment boundary of concern): whether sk and sk+1 should be joined or split depends on the positions of sk 's left boundary and sk+1 's right boundary, which in turn involve other segment decisions. If we enumerate all possible segmentations, the "best" segmentation will be the one with the highest likelihood to generate the query. We can also rank them by likelihood and output the top k. Having multiple possibly correct segmentations is desirable if the correct segmentation cannot be easily determined at query preprocessing time: with a small number of candidates one can afford to ask the user for feedback, or re-score them by inspecting retrieved documents. In practice, segmentation enumeration is infeasible except for short queries, as the number of possible segmentations grows exponentially with query length. However, the I.I.D. nature of the unigram model makes it possible to use dynamic programming (Algorithn 1) for computing top k best segmentations. The complexity is m , O k m log(k m) where n is query length, and m is maximum allowed segment length. Algorithm 1: Dynamic programming algorithm to compute top k segmentations certain length (n = 1, 2, · · · , 5) that occur at least once in the corpus. It is usually impractical to do this for longer n-grams, as their number grows exponentially with n, posing difficulties for storage space and access time. However, for long n-grams (n > 5) that are also frequent in the corpus, it is often possible to approximate their counts using those of shorter n-grams. 4.1 Frequency lower bounds for long n-grams We compute lower bounds of long n-gram counts using set inequalities, and take them as approximation to the real counts. For example, the frequency for "harry potter and the goblet of fire" can be determined to lie in the reasonably narrow range of [5783, 6399], and we just use 5783 as an estimate for its true frequency. A dynamic programming algorithm is given next. If we have frequencies of occurrence in a text corpus for all ngrams up to a given length, then we can infer lower bounds of frequencies for longer n-grams, whose real frequencies are unknown. The lower bound is in the sense that any smaller number would cause contradictions with known frequencies. Let #(x) denote n-gram x's frequency. Let A, B , C be arbitrary n-grams, and AB , B C, AB C be their concatenations. Let #(AB B C ) denote the number of times B follows A or is followed by C in the corpus. We have #( A B C ) = #( A B ) + #( B C ) - #( A B B C ) #( A B ) + #( B C ) - #( B ) (1) follows directly from a basic equation on set cardinality: |X Y| = |X | + |Y| - |X Y| where X is the set of occurrences of B where B follows A and Y is the set of occurrences of B where B is followed by C . (1) (2) 4. PARAMETER ESTIMATION The central question that needs to be addressed is: how to determine the parameters of our unigram language model, i.e., the probability of the concepts, which take the form of variable-length n-grams. As this work focuses on unsupervised learning, we would like the parameters to be estimated automatically from provided textual data. The primary source of data we use is a text corpus consisting of a 1% sample of the web pages crawled by the Yahoo! search engine. We count the frequency of all possible n-grams up to a WWW 2008 / Refereed Track: Search - Query Analysis Since #(B ) #(AB B C ), (2) holds. Therefore, for any n-gram x = w1 w2 · · · wn (n 3), if we define fi,j (x) def #(w1 · · · wj ) + #(wi · · · wn ) - #(wi · · · wj ) = we have #( x ) 1