WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 Exploring Social Annotations for Information Retrieval Ding Zhou Jiang Bian College of Computing Georgia Institute of Technology Atlanta, GA 30332 Shuyi Zheng Computer Science & Engineering Pennsylvania State University University Park, PA 16802 Facebook Inc. 156 University Avenue Palo Alto, CA, 94301 Hongyuan Zha College of Computing Georgia Institute of Technology Atlanta, GA 30332 C. Lee Giles Information Sciences and Technology Computer Science & Engineering Pennsylvania State University University Park, PA 16802 ABSTRACT 1. INTRODUCTION The goal of the semantic Web [1] is to make the Web reSocial annotation has gained increasing popularity in many sources understandable to both humans and machines. This Web-based applications, leading to an emerging research has motivated a stream of new Web applications including area in text analysis and information retrieval. This paWeb blogs [10], social annotations (a.k.a social bookmarkper is concerned with developing probabilistic models and ing) [4, 3, 20], and Web social networks [22]. Research in computational algorithms for social annotations. We proWeb blogs and social networks has been especially focused pose a unified framework to combine the modeling of social on discovering the latent communities [10, 22], detecting topannotations with the language modeling-based methods for ics from temporal text streams [14], and the retrieval of such information retrieval. The proposed approach consists of highly dynamic information. In this paper, we focus on the two steps: (1) discovering topics in the contents and annotasocial annotations 1 in large part motivated by their increastions of documents while categorizing the users by domains; ing availability across many Web-based applications. and (2) enhancing document and query language models by Social annotation is a form of folksonomy, which refers to incorporating user domain interests as well as topical backInternet-based methods for collaboratively generating openground models. In particular, we propose a new general ended text labels that categorize content such as Web pages, generative model for social annotations, which is then simonline photographs, and Web links. Many popular Web serplified to a computationally tractable hierarchical Bayesian vices rely on folksonomies including delicious (del.icio.us) network. Then we apply smoothing techniques in a risk minand flickr (flickr.com). Despite the rising popularity of those imization framework to incorporate the topical information Web services, research on in folksonomies is still at an early to language models. Experiments are carried out on a realstage. Much of the work has been focused on the study of world annotation data set sampled from del.icio.us. Our rethe data properties, the analysis of usage patterns of tagsults demonstrate significant improvements over traditional ging systems [4], and the discovery of hidden semantics in approaches. tags [20]. The ob jective of this paper, however, is to leverage the efforts and expertise of users embodied in social annotaCategories and Subject Descriptors tions for improving user experience in information retrieval G.3 [Probability and Statistics]: Models; H.3.3 [Information (IR). We advance previous work by combining topic analySearch and Retrieval]: Clustering sis with language modeling methods used in contemporary IR [6]. Incorporating social annotations with document content General Terms is a natural idea, especially for IR applications. Consider the Algorithms, Models, Experiments IR methods based on language modeling, for example [15, 12], we may simply treat the terms in annotation tags the same as those in document content, consider them as addiKeywords tional terms of the documents, and then follow the existing social annotations, folksonomy, information retrieval, lanIR approaches. The pitfalls here, however, come in several guage modeling forms. First, a tag term is generated differently than a docThis work was done at The Pennsylvania State University. ument content term. A tag, upon its generation by a user, Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, Beijing, China ACM 978-1-60558-085-2/08/04. By a social annotation, we mean the annotation tags associated with the document. Each tag is generated by a user (or shared by several users) that can include several terms. 1 715 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 represents an abstract of the document from single perspective of a single user. Second, the differences in domain expertise of users should be taken into consideration when incorporating user tags. Some users in certain domains might be more trustworthy than others. Some users for various reasons may give incorrect tags. Although it remains an open problem to discover domain expertise of users, such peer differences are believed to be important [7] for effective societal information retrieval. Finally,the improvement for IR will be limited without considering the semantics of the tag terms. Usually the number of tag terms is much smaller than the number of terms in a document being tagged. Therefore using the tag terms in the same way as the document terms are used will lead to the same problems observed in traditional language modeling-based IR, such as the lack of smoothness of results and the sparsity of observations. In this paper, we develop a framework that combines the modeling of social annotations with the expansion of traditional language modeling-based IR using user domain expertise. First, we seek to discover topics in the content and annotations of documents and categorize the users by domains. We propose a probabilistic generative model for the generation of document content as well as the associated tags. Second, we follow an IR framework based on risk minimization proposed earlier [12]. The framework is based on Bayesian decision theory focusing on improving language models for queries and documents. We then study several ways for expanding the language models where the user domain interests and expertise and the background collection language models are incorporated. In particular, we apply linear smoothing between the original term-level language models and the new topic-level language models. The newly proposed framework benefits from the consideration of the differences between document content terms and tag terms in the modeling process. User domain expertise can be readily included in the retrieval framework by the proposed ways of language model expansion. The smoothing of the original term-level language model with the topic-level language models addresses the issues raised by the sparsity of observations. The main contributions of this paper include (1) a general and a simplified probabilistic generative model for the generation of document content as well as the associated social annotations; (2) a new way for categorizing users by domains based on social annotations. The user domain expertise, evaluated by activity frequency, are used to weigh user interests; (3) the study of several ways for combining term-level language models with those topic-level models obtained from topics in documents and users. The rest of this paper is organized as follows: Sec. 2 introduces the related work on topic analysis and language modeling. Sec. 3 proposes the new probabilistic generative models for the social annotations, including a brief discussion on choosing the correct topic number; In Sec. 5, we review the risk minimization framework for information retrieval as a Bayesian decision process. Sec. 6 explores several methods for incorporating the discovered domain interests to language modeling-based IR. Experimental results are presented in two sections, Sec. 4 and Sec. 7, respectively for topic analysis and IR quality. We conclude the paper and discuss future work in Sec. 8. 2. RELATED WORK We review two lines of work which are closely related to the approach we will propose; the document content characterization, and information retrieval based on language modeling. 2.1 Topic Analysis using Generative Models Related work on document content characterization [2, 17, 13, 18, 22] introduce a set of probabilistic models to simulate the generation of a document. Several factors in producing a document, either observable (e.g. author [17, 18]) or latent (e.g. topic [2, 13], community [22]), are modeled as variables in the generative Bayesian network and have been shown to work well for document content characterization. The Latent Dirichlet Allocation (LDA) model [2] is based upon the idea that the probability distribution over words in a document can be expressed as a mixture of topics, where each topics is a probability distribution over words. Along the line of LDA, the Author-Word model proposed in [13] considers the interests of single authors as the origin of a word. Influential following work named Author-Topic model combines the Topic-Word and Author-Word models, such that it regards the generation of a document as affected by both factors in a hierarchical manner [17, 18]. A recent work on social network analysis extends the previous model with an additional layer that captures the community influence in the setting of information society. The model proposed in this paper is different from the Author-Topic model proposed before [17]. Here the users or sources of the tags and documents are observed instead of being latent. 2.2 Information Retrieval based on Language Modeling This work also overlaps with the research on information retrieval (IR) using probabilistic language modeling. Language modeling is a recent approach to IR which is considered as an alternative to traditional vector space models and other probabilistic models. Initially proposed by Ponte and Croft [15], the basic idea is to estimate the probability of generating the query from the candidate documents, each of which is modeled as a language model. The research line in IR using language models is later supported [12] by a framework based on Bayesian decision theory, which transforms the focus into improving the language models. A common way for improving language model is smoothing, which seeks to fight against the challenge of estimating an accurate language model from the insufficient data available. A relative complete study of the smoothing methods for statistical language modeling is given in [21]. Usually the document language model is smoothed with the background collection model, a pre-built model believed to be smoother and contain more words. This paper employs the linear interpolation [9] of the original language model with the reference models discovered before. This way, the social expertise of the users are imported to the language modeling and will further improve the quality of information retrieval. In addition to the above traditional work, a recent work [20] presents a preliminary study on clustering annotations based on EM for semantic Web and search. The probability of seeing certain words for a URL is estimated, which is then used for retrieval. However, the URL content and differences in users are not considered in that work. 716 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 3. MODELING SOCIAL ANNOTATIONS We propose a probabilistic generative model for social annotations. The model specifies the generation process of the terms in document content as well as the associated user tags. The motivation for modeling the social annotations with document content is to obtain a simultaneous topical analysis of the terms, documents, as well as the users. As we will discuss later, the topical analysis of terms (or the clustering of them by topics) essentially provides the basis for expanding query and document language models. In addition, the topical analysis of users, which categorizes the users by domains, enables the input of domain expertise of users in addition to the tags generated by them. This section starts with the introduction to modeling the user tag generation, as effected by document content. Then we simplify the general generative model for tags to a structure which is tractable and easier to estimate. Finally, we present the training method and a discussion on selecting the number of topics using the perplexity measure. d d d a a a Þ d d Td Nd D Þ Nt D a Ta a 3.1 Generative Models for Annotations We start by modeling the generation of words in documents and annotations. Intuitively, the content of documents and annotations are generated by two similar but correlated approaches. We illustrate our understanding of the generation process in plate notation in Fig. 1. On the document side (left-hand side), for an arbitrary word in document d, a topic z is first drawn, then conditioned on this topic, is drawn; Repeating this process for Nd times, which is the number of words in d, d is generated. The whole collection repeats the same process for D times 2 ; On the annotation side (right-hand side), each word in the annotation is generated similarly. First, an observed user a decides to make annotation on a particular document, then the user picks a topic z to describe the d, followed by the generation of . The generation of z by user, however, depends not only on the user but also the topic of d. Note the dependency of user topics on document topics can be seen as a mapping between two conceptions. Generally speaking, there are different number of topics on both sides, Td and Ta . The two topic sets can be different but are usually very similar. Inspired by related work on topic analysis [2, 17, 18], we make assumptions about the probability structures of the generative model in Fig. 1. First, we assume all the conditional probabilities follow multinomial distribution. For example, each topic is a multinomial distribution over words where for the conditional probability of each word is fixed. Second, we assume that the prior distribution for topics and words follow Dirichlet (d ,d for documents and a ,a for annotations), which are conjugate priors for multinomial, respectively parameterized by d ,d and a , a . The generative model, illustrated in Fig. 1, is not quite tractable in practice. The probability distributions we would have to estimate include: (1) D + A multinomial distributions for documents over topics; (2) Td + Ta multinomial distributions for topics over words; (3) Td × Ta conditional probabilities to capture the correlation of the topics in documents and the topics in annotations. In addition, there are 2 Note the document side of the general annotation model is essentially the LDA model proposed in [2]. But the right side takes into consideration the generation of annotations as dependent on the document content generation. Figure 1: The general generative mo del for content of do cuments and annotations in plate notation. Td (Ta ) is the numb er of topics in do cuments (annotations); Nd (Nt ) is the numb er of content words (or tag words) in do cument d; A and D are the numb er of users and do cuments; d , a d , and a are Dirichlet priors parameterized resp ectively by, d , a , d , and a . Dark circles denote the observed variables and the blank circles denote the hidden ones. Rectangles denote the rep etition of mo dels with the same structure but different parameters, where the lowerright symb ol indicates the numb er of rep etitions. many parameters that adds difficulty in tuning in practice (d , d , a , a , Td , and Ta ). Therefore, in the next section, we will simplify this general annotation model with some relaxations in assumptions, arriving at a tractable model with easy training algorithms available. 3.2 A Simplified Annotation Model In this section, we simplify the general annotation model given before. In order to reduce the general model to a one tractable with fewer parameters, we make several compromises in assumptions. First, we assume the topics in documents and annotations are the same. This assumes that the taggers conceptually agree with the original document authors without variation of information in their understanding. Second, we assume that documents and users have the same structure of prior distributions which are only parameterized differently. Although arguably the users and documents might have different types of distributions over topics, we make the assumption here for the sake of simplicity. The assumptions before lead to a simplified generative model for annotations. As illustrated in Fig. 2, we have a single topic-word distribution with parameter ; a single source-topic distribution with extended dimension (here the source can be a document or a tagger). Now we have much fewer distributions to estimate, making the modeling more tractable in practice. Let us name the the model in Fig. 2 as the user-contentannotation (UCA) model. The UCA model describes the generation of words in document content and in the tags in similar but different processes. For document content, each observed term in document d is generated from the source x (each document d maps one-to-one to a source x). Then from the conditional probability distribution on x, a topic z is drawn. Given the topic z , is finally generated from the conditional probability distribution on the topic z . For document tags, similarly, each observed tag word for document d is generated by user x. Specific to this user, there is a conditional probability distribution of topics, from 717 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 cording the posterior probability in Eq. 3, we know the EM will be very computationally expensive due to the sum in the denominator. Thus, we pursue an alternative parameter estimation method, Gibbs sampling [16], which is gaining popularity in topic analysis recently [5, 22]. Instead of estimating the parameters directly, we evaluate the posterior distributions. While using Gibbs sampling to train generative models, typically, a Markov chain is formed, where the transition between successive states is simulated by repeatedly drawing a topic for each observed term from its conditional probability. The algorithm keeps track of the number of times TW that a term is assigned to a topic Czw and the number of ( A +D ) T times that a topic is assigned to the user or source Cxz . Here C T W denotes a T × W matrix and C (A+D)T denotes a (A + D) × T matrix, where x, z , are the indices of the sources (document or user), topics, and words. We repeat the Gibbs sampling until the perplexity score 3 measured on distributions converges. Algorithm 1 illustrates the Gibbs sampling algorithm for model training. Algorithm 1 Training User-Content-Annotation Model 1: Given a sequence of triplets x, d, , where d is the document id; is the word id; x = nil if is a content word; x = user id if is a tag word. 2: Given as the threshold for determining convergence. 3: Initialize C T W , C (A+D)T with random positive values. 4: rep eat 5: for all x, d, do 6: t = z( ) // get the current topic assignment TW TW 7: Ctw Ctw - 1 //decrement count 8: if x == nil then 9: // is a document word ( A +D ) T ( A +D ) T 10: Cdt Cdt - 1 // decrement count 11: // compute P (t) below 12: for all z = 1, ..., T do 13: P (z ) P (d, z | ) = P (d|z )P (z | ) 14: end for 15: sample to obtain t using P (t) ( A +D ) T ( A +D ) T Cdt + 1 // increment count 16: Cdt 17: else 18: // is a tag word ( A +D ) T ( A +D ) T 19: Cxt Cxt - 1 // decrement count 20: // compute P (t) below 21: for all z = 1, ..., T do 22: P (z ) P (x, z | ) = P (x|z )P (z | ) 23: end for 24: sample to obtain t using P (t) ( A +D ) T ( A +D ) T 25: Cxt Cxt + 1 // increment count 26: end if TW TW 27: Ctw Ctw + 1 28: end for 29: measure the perplexity on a held-out sample; 30: measure the perplexity change in ; 31I until : t can be seen from Algo. 1 that the key issue here is the evaluation of the posterior conditional probabilities, i.e. P (z |w), P (d|z ), P (x|z ), which leads to the evaluation of 3 The measurement of perplexity will be introduced in Sec. 3.4. x A+D Þ T Nd + Nt D Figure 2: The User-Content-Annotation (UCA) Mo del in plate notation. T , A, and D are the numb er of topics, users, and do cuments. Nd and Nt denote the numb er of terms in the do cument and the numb er of terms in the tag. is the topic-word distribution with parameter ; is the source-topic distribution with parameter . which a topic z is then chosen. This hidden variable of topic again finally generates in the tag. According to the model structure, the conditional joint probability of , , x, z , given the parameters , is: P (, , x, z , w|, ) = P (w|z , )P (| )P (z |x, )P (|)P (x); (1) (2) For inferences of words, we can calculate the conditional probability given a word as: P (, , x, z , | , , ) = x P (, , x, z , |, ) z . P (, , x, z , |, ) (3) Again, similar to related work, we make assumptions regarding the probability structures. We assume the prior distribution of topics and terms follow Dirichlet distributions parameterized respectively by and . Let T be the number of topics (input as a parameter); A is the number of users; D is the number of documents; Nd and Nt respectively denote the number of terms in the document and the number of terms in the tag. Each topic is a probabilistic multinomial distribution over terms, denoted by ; Each user (or source) is a probabilistic multinomial distribution over topics, denoted by . As illustrated in Fig. 2, there are A + D distributions of topics, each of which corresponds to an observed user or source. There are T distributions of words, each corresponds to an unobserved topic. For each document, the generation process repeats for Nd + Nt times where Nd of the iterations correspond to the terms in the document content and Nt corresponds to the terms in the tags. The above again repeats for D times for all documents. 3.3 Model Training The UCA model includes two sets of unknown parameters, the source-topic distributions , and the topic-word distributions , corresponding to the assignments of individual words to topics z and source x. One can use the Expectation-Maximization (EM) algorithm for estimating parameters in models with latent variables. However, the approach is susceptible to local maxima. In addition, ac- 718 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 P (d|w) or P (x|w). Let us again consider the joint probabilities P (x, z |w), P (d, z |w). Similar to earlier work [5, 22], we know the posterior conditional probabilities can be expressed as the product of several conditional probabilities on the edges of the Bayesian network. In particular, for documents, we have: CW T + z k W T Ckz + V CW T + t k W T Ckz + V Cdt + , k ( A +D ) T Cdk + T Cxt + . k ( A +D ) T Cxk + T ( A +D ) T ( A +D ) T as the termination criterion; the topic number is determined by using the smallest T that leads to the near maximum perplexity. Similar approach is also used in previous work for choosing parameters in generative models [2, 17]. 4. 4.1 EXPERIMENTS ON ANNOTATION MODELING Data Preparation P (d, z | ) (4) and for users, we have: P (x, z | ) (5) Here the unit conditional probabilities in fact are Bayesian estimation of the posteriors: P (d|z ), P (x|z ) and P (z |w): P (d|z ) = Cdz + , k ( A +D ) T Cdk + T Cxt + , k ( A +D ) T Cxk + T CW T + t k W T . Ckt + V ( A +D ) T ( A +D ) T (6) P (x|z ) = (7) A data sample is collected from del.icio.us using the method similar to [20]. We crawled the del.icio.us Web-site starting with a set of popular URL's in Jan. 2006. Then we followed the URL collection of users who have tagged these URL's, arriving at a new set of URL's. By iteratively repeating the above process, we ended up with a collection of 84,961 URL's tagged from May, 1995 to Apr., 2006. There are 9070 users along with 62,007 distinct tag words. Then we crawled the URL's to collect document content. There are 34,530 URL's in the collection which are still valid and have textual content, including 747,935 content words. The activity of users seems to follow a power-law distribution. Since the data we collected is relatively small, many infrequent users and tags might not be included. How to handle resources distributed on the long tail remains an interesting question to explore. P (z | ) = (8) 4.2 Topic Number Selection We first perform the training of the proposed model using the algorithm introduced above. For different settings of the desired topic number, we test the perplexity of the trained model on a held-out sample dataset. Over iterations, the perplexity scores always decreases dramatically after the first several iterations and then soon converges to a stable level. We show a plot of perplexities on five different settings of T in Fig. 3. Here the training set is a 1% random sample of the data available. We are able to see 3.4 Topic Number Selection that the larger setting of topic number leads to a lower perThe remaining question is how to select the number of plexity score from the start, indicating a better prediction topics. We resort to the perplexity measure, which is a stanperformance. This is because the increased number of topics dard measure for estimating the performance of a probabilis(before a certain point) reduces the uncertainty in training. tic model. The perplexity of a set of term-source test pairs, For the same reason, the larger setting of topics also leads (wd , xd ), for all d Dtest documents, is defined as the expoto a smaller perplexity value in the first several iterations, nential of the negative normalized predictive log-likelihood followed by a sharper drop in perplexity. From the figure, using the trained model: we can see that empirically the algorithm converges within 20 iterations for a relative small sample. For the full dataset, D we repeat the Gibbs sampling for 100 iterations. d=1 ln P (wd |xd ) perplexity(Dtest ) = exp[- ]. (9) D The second set of experiments carried out seeks to de|{wd , xd }| d=1 termine the best number of topics in the setting. Using the Here the probability of a set of term-source pairs on a perplexity measure defined in Eq. 9 - Eq. 11. We perform the particular document is obtained by a straightforward calcuexperiments by setting different number of topics in training lation: on various sizes of samples from the available data. Generally, the perplexity score first decreases and then remains ( stable after T is at certain size. We prefer the smallest T P (wd |xd ) = P (wd |xd ) (10) that yields a convergence since the greater T requires larger wd ,xd ){wd ,xd } computation. In Fig. 4, we show the perplexity scores over where the probability of an individual term-source pair P (wd |xd ) different T for various sample sizes. It is clear that the is evaluated using the model hierarchy: perplexity decreases much slower from after T = 80. Accordingly, we choose the desired topic number to be 80 in tT P (wd |t)P (t|xd ). (11) P (wd |xd ) = the following experiments. Accordingly, for implementation, we need to keep track of k ( A +D ) T k ( A +D ) T k WT ( A +D ) T , and , Cdk Cxk Ckt in addition to Cdt ( A +D ) T TW Cxt and Ctw . It is easy to implement these counting using several hash tables. In practice, we set and to be 50/T and 0.05 respectively. These parameters seem to only affect the convergence of Gibbs sampling but not much the output results, unless the problem is very ill-conditioned. =1 Note that the better generalization performance of a model is indicated by a lower perplexity score over a held-out document set. We run the Gibbs sampling using perplexity score 4.3 Discovered Topic Words We also examine the top words discovered for each topics to judge the quality. Usually the determination of topic 719 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 10000 9000 8000 7000 Perplexity 6000 5000 4000 3000 2000 1000 0 0 5 10 15 Number of iterations 20 25 T = 10 T = 40 T = 80 T = 160 T = 320 Topic ID 0 2 3 9 32 Top words web site news http information time www page free home software search online text links data work research services group science programming library education file code world states usa country west japan europe north asia australia south russian worldwide product process quality cool sale feedback catalog suggestions patterns pretty rates clothing cds cookies tea sugar cafe orange organic milk bread food egg meat diet fruit kitchen snacks Table 1: Top words for a selected sample of discovered topics. 5. Figure 3: The p erplexities over the iterations in training for five settings of topic numb er. The training set is a 1% random sample of the available data. The p erplexity is tested on a held-out sample. 14000 100% sample 50% sample 10% sample 5% sample 1% sample INFORMATION RETRIEVAL BASED ON RISK MINIMIZATION 12000 10000 Perplexity 8000 6000 4000 2000 0 T=10 T=40 T=80 T=160 Topic numbers Figure 4: The p erplexities for T = 10, 40, 80, 160. Different sample sizes are tested yielding similar curves indicating a minimum optimal topic numb er of 80 on the collected data. The p erplexity is tested on a held-out sample. words are very sub jective and is lack of quantitive measures. Nevertheless, without quantitative assertions, we observe generally high semantic correlations among the top words that are discovered in the same topic. Typically, most discovered topic words are about Web the related applications or softwares. We consider this as a bias of the del.icio.us collection. For Web-based systems with general focus, the topic words can be more sparsely distributed. In addition, we see several cases where some seemingly irrelevant words appear in a relatively coherent word set. But there are few of these cases and the noisy words are usually ranked not high. Here, we present a subset of the discovered topics due to space limit. As illustrated in Table 1, five topics and their top words are presented. Here, Topic 0 is the topic on Web; Topic 2 is interested in research groups and the programming tools they offer; Topic 3 has many geographical locations; Topic 9 seems to concern about dining and restaurants; Topic 32 is a topic on cooking and kitchen. In this section, we propose a method to incorporate the topic discovery results discussed in the previous sections into the language modeling-based information retrieval. We first review an information retrieval framework based on Bayesian decision theory. Then, in the next section, we propose a method that naturally combine the topical analysis language models to improve retrieval quality, which is incremental and requires little computational overhead. In the language modeling (LM) approach to information retrieval (IR), queries and documents are modeled respectively by a probabilistic LM. Let Q denote the parameters of a query model, and let D denote the parameters of a document model. The LM-based IR involves two independent phases: In one case, the generation of a query is viewed as a probabilistic process associated with a certain user. This user first selects the query model Q then picks a query q from the query model Q with probability P (q|Q ); In the other case, the document generation has been carried out. First the document language model D is chosen and then the d is generated word by word with probability P (d|D ). The task of an IR system is to determine the probability of a document being relevant to the query given their LMs are respectively estimated. Here we work within a risk minimization framework for IR proposed earlier [12]. The framework views the retrieval of relevant documents as those actions to be carried out in Bayesian decision theory. The goal of retrieval is equivalent to minimizing the expected loss. Risk Minimization Framework: Suppose the relevance is a binary random variable R {0, 1}. Consider the task of a retrieval system as the problem of returning a list of documents to the issued query q. In the general framework of Bayesian decision theory, to each action, there is an associated loss, which, in our case, is the loss for returning a particular document to the user. Assume that the loss function only depends on Q , D , and Ri , the expected risk of returning di is: R(di ; q) = R L(Q , D , R) × D {0,1} Q P (Q |q)P (D |di )P (R|Q , D )dD dQ (12) where L(Q , D , R) is the loss function, P (Q |q) is the probability of the query model being parameterized by Q given the query q, P (D |di ) is the probability of the document model being parameterized by D given the document di , 720 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 and P (R|Q , D ) is the probability of relevance of R given the parameter sets are Q and D . Following earlier work [12], we make the assumption that the loss function only depends on Q and D and is proportional to the distance between Q and D , i.e., L(Q , D , R) (Q , D ) The expected risk for returning di to q is thus: R(di ; q) (Q , D )P (Q |q)P (D |di )dD dQ . (14) D (13) Now let us suppose the LMs we want to improve are already estimated. In the following, we give three types of additional LMs we can estimate based on the previous topical analysis of annotations and content. The first model simply treats the annotations as additional terms of the documents; The second model expands the query with the topics; The third model proposes several expansion methods on the document LM. 6.1 Word-Level Annotation Language Model Q Note here P (Q |q) depends on the input q only and is the same for all candidate documents di . Rather than explicitly computing the risk in the integral format, we can use the point estimate with the posterior D and D : R(di ; q) (q , di )P (D |di ). (15) The annotation LM we give is an ad-hoc improvement. For each document d, let (d) be the set of words in its tags, each having the frequency of being used for d. We are able to estimate a LM, say Ld 4 , from the observations of (d) for w all d's. It easily follows that Ld can be combined with di w using Eq. 17. For Word-level annotation language model, we focus on the simple case of unigram LM, in which each word is assumed to occur depending on the latent probability distribution regardless of the surrounding words. where q and di can be obtained using maximum likelihood estimation observing the words in query and documents. Further assuming that P (D |di ) is the same for all di , the risk minimization framework finally becomes a measurement of the distance between two LMs: q and di . As in other related work, we can employ the Kullback-Leibler divergence to measure , yielding R(di ; q) (q , di ) = w P (w|q ) log P (w|q ) . (16) P (w|d ) i 6.2 Topic-Level Query Language Models Comments: According to Eq. 16, the setup of the risk minimization framework has made the measurement of relevance depend only on the LMs of the query and the document, i.e. the posterior parameters q and di . This paper proposes a refinement of the query and document LMs using the LMs obtained from social annotations. 6. LANGUAGE MODEL EXPANSION USING SOCIAL ANNOTATIONS Define our goal now to be improving the LMs of query and documents, say q q and di di . Here the q q is also known as query expansion [11] and the di di is also known as document expansion [19]. There are several ways for LM expansion. In this paper we focus on the linear interpolation [9] (a.k.a linear smoothing) for combining two LMs. Define an operator for linear smoothing where a b a + (1 - )b, assuming a, b are both normalized to the same scale. When applied to combining two LMs, 1 and 2 , we define that 1 2 : v 1 2 , P (v |1 2 ) = P (v |1 ) + (1 - )P (v |2 ) In this and the following section, we seek to make use of the topical analysis on documents previously made in Sec. 3. Recall in the standard framework, q is just the empirical distribution of the query q = w1 , ...wk . This original wordlevel query model has been shown to underperform [12, 11]. In our approach, we seek to estimate the LMs at higher level. In particular, we consider each topic discovered as a token in the LM. These tokens will later match the topics discovered for the documents to determine their relevance. First, we estimate the conditional probability that a query word belongs to the topic t, say P (t|w). Over all topics, we have a vector vt|w = P (t1 |w), ..., P (tT |w) . After normalization, vt|w becomes the probability distribution over topics, or rather, a topic-level LM. Second, we merge the multiple topic distributions for each query word into a single topic distribution. Let the desired topic-level query LM be Lq . In the unigram case. Lq is t t also a vector of T dimension where each element denotes the probability of a particular topic. Formally, we have: Lq = t w w vt|w . q (18) where w is the normalized weight for the word , and Lq (i) t denotes the probability of topic i unider this model. Note q q the setting of w allows us to have Lt Lt (i) = 1. Again, using , we combine the models at different levels. 6.3 Topic-Level Document Language Models (17) where the v here can be a word, a phrase, or simply a token that denotes special meaning (e.g. a topic). In the case when v 1 , P (v |1 2 ) = (1 - )P (v |2 ). Similarly, / P (v |1 2 ) = P (v |1 ) when v 2 . That is, one LM / can be easily improved by smoothing with another "better" LM as long as they can be combined using the above linear operator. Now let us focus on the document LMs. It is easy to see that each document already has a probability distribution over topics discovered from the proposed modeling, denoted by a vector vt|d = P (t1 |d), ..., P (tT |d) . Consider this vector as a LM where each topic is a unit. We use to combine this topic-level LM with the original document LM. Then how to leverage the user information in annotations?. Again, recall that the probabilistic model in Sec. 3 4 Note we use L instead of to denote the additional LMs in expansion for clarity. The Ld means LM trained at wordw level for document expansion. Similarly, the Lq indicates t the LM at topic level for query expansion. 721 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 also outputs the topic distribution for users. Denote the distribution by a T dimensional vector ut|x = P (t1 |x), ..., P (tT |x) . Here each element P (ti |x) denotes the probability of a user x belonging to the topic ti . Let the document d be tagged by a set of users, say U (d). We combine the multiple LMs of users in U (d). In particular, the desired model Ld is gent erated in addition to and will be combined with the original topic-level LM of document: vt|d . Let the trust or importance of user x be x . The Ld is t obtained as: d Lt = d vt|d + 6000 number of users 5000 4000 3000 2000 1000 0 50 100 150 200 250 300 350 400 450 500 550 600 650 >650 number of generated tags nunumber of users (log) 10 4 10 3 10 2 x 10 1 50 150 250 350 500 >650 x ut|x , U (d) (19) number of generated tags (log) where d + U (d) x = 1. The d accounts for the emphasis we place on the original discovery of topics for d, and x U (d), x determines the trust we place on each user x. Now we have successfully incorporated the topical analysis of documents and users into the original LM-based IR. User domain differences are also considered. How to evaluate user importance is out of the scope of this paper. x Figure 6: The numb er of users v.s. numb er of tags generated, in the normal scale and log-log scale. The power-law property of user activities is in fact helpful for determining the trust we should put on each authors. For most of the cases, users are more or less equivalent in their activity intensity, whom we should not differentiate in the trust scores; For some very active users, we might want to give higher priorities. For simplicity, this paper uses the number of annotations a user has made for the user-specific trust scores. One might consider combine other metrics such as the time duration from last visits or the visit frequencies. Note the framework we propose allows flexible definition of trust scores for users. 7. EXPERIMENTS ON IR QUALITY 7.1 User Domains & Expertise Evaluation Next we show the probability distribution over topics for several active users. We consider the topics as domains where the users belong to. A higher probability in certain topics indicates stronger interests of this user. For the users with insufficient observations, the domain discovery tends be to less reliable. Figure 5 illustrates the distributions over 80 topics for three random active users. Users are seperated by their distributions. In general, the overall interest of each user is a mixture of interests in several topics. Some topics for a user is more interesting than others. And for the interested topics, some are more preferred than others. 0.4 7.2 IR Quality Now let us evaluate the IR quality of various language modeling (LM) approaches. The methods we compare are: · Word-level LM on content (W-QD): Query LM is trained on the original query and the document LM is trained on the original document content. · Word-level LM on content and annotations (WQDA): The query LM is trained on the original query and the document LM is trained on both document content and annotations. · Word-level LM + LDA on content and annotations (WT-LDA): We run LDA on document plus annotations by treating annotations as additional words, without consideration of user differences. The topiclevel LM is combined with W-QDA using the parameter 1 . · Word-level LM + Topic-level LM (WT-QDA): We run the proposed topic analysis model on the documents and annotations, obtaining topic information of documents and users. Then, the topic-level LM is combined with the word-level LM W-QDA, using the parameter 1 . · Word-level LM + Topic-level LM on do cument and users (WT-QDAU): User domain interests are considered here. First, the word-level LM and topiclevel LM and their combination are trained using WTQDA. Second, the document LM is combined with the mixture of topics on users who tag the document, using the parameter 2 . Note here the users are treated the same in the first step. Probability 0.3 0.2 0.1 0 0.2 0 10 20 30 40 50 60 70 80 Probability 0.15 0.1 0.05 0 0.4 0 10 20 30 40 50 60 70 80 Probability 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 Topic ID's Figure 5: The probability distributions over topics for three active users. In the following, we discuss the evaluation of user-specific trusts. We start with showing the properties of user activities. Fig. 6 presents the number of authors w.r.t. to the number of tags she has made in the data. It is clear that over 60% of the users contribute less than 50 tags to the data and very few of them make more than 300 words. From the log-scale and log-log scale plots, we can see the intensities of user activities follow a power-law distribution. 722 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 · Word-level LM + Topic-level LM on do cument, and users with differentiation (WT-QDAU+ ): During the training of the WT-QDAU is obtained using the parameter 2 , the weights on users are set different. In addition, we implement the EM-based retrieval method proposed in a related work [20], which is defined as: · EM-based information retrieval (EM-IR): As proposed in [20], the URL's and users are first clustered using the EM algorithm. Then the probability of seeing certain words for a URL is estimated. Those probabilities are used for retrieval. For evaluation, we generate 40 queries with lengths varying from one to five words. The words are chosen from tag and document content. Then for each query, we use the above six approaches for document retrieval. The quality of retrieval is evaluated on the top 10 documents using the Discounted Cumulated Gain (DCG) metric [8]. In particular, two human judges are invited to provide feedback on the composite set of URL's which occur in any of the top 10 retrieval results, yielding the DCG10 scores. Judgments are carried out independently based on their experience of the relevance quality. Numerical judgment scores of 0, 1, 2, and 3 are collected to reflect the judges' opinion on the relevance of documents, which respectively imply the sentiment of poor, f air, g ood, and perf ect. In general, the judges represent high agreement on the ranking quality. The average judge scores are used for computing the DCG. In Table 2, we illustrate the DCG10 scores for the six approaches: W-QD, EM, W-QDA, WT-LDA, WT-QDA, WTQDAU, and WT-QDAU+ . We can see that both the EMbased IR and the newly proposed approaches outperform the traditional LM-based IR. We read Table 2 from several aspects: First, we take a look at the improvement according to the use of tags. The EM-based IR proposed in related work [20] increased the DCG scores by 11.5% over traditional LMbased IR (W-QD); The method that uses annotations as additional words improved the DCG by 18.3% (W-QDA over W-QD), which demonstrates that the use of annotation can dramatically improve IR quality. Second, we examine the improvement based on topical analysis on both document content and annotations. The basic use of the topic information (WT-LDA) further improves the use of annotations (W-QDA) by 2.7%. The topic analysis based on the new generative model, compared with WT-LDA, achieves a gain of 1.3%. It is worthwhile to mention that the LDA-based topic analysis improves a very recent related work [20] (EM-IR) by 9.1%. Third, we test the improvement by incorporating tagger interests. As illustrated in Table 2, WT-QDAU outperforms pure topic-based IR by 1.1%, showing the importance of user interests. Fourth, we show the improvement by considering the differences of users while incorporating user interests. The WT-QDAU+ adds another 1.3% in DCG over WT-QDAU. This shows that due to the different user expertise, the quality of tags can be different and thus should be taken into consideration. Overall, the top performance of our proposed model (WTQDAU+ ) improved the traditional LM-based IR model by W-QD 7.6192 WT-QDA 9.3820 EM-IR 8.4945 WT-QDAU 9.4938 W-QDA 9.0167 WT-QDAU+ 9.6167 WT-LDA 9.2602 Table 2: The DCG10 scores of six compared approaches: W-QD, EM-IR, W-QDA, WT-LDA, WTQDA, WT-QDAU, WT-QDAU+ . 26%, compared with the the 11.5% improvements by the EM-based approach in [20]. 9.2 1 2 9.1 9 DCG10 scores 8.9 8.8 8.7 8.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Different settings of 1 and 2 Figure 7: The change of DCG10 scores for different settings of 1 and 2 , where 1 is the parameter for combining topics with the original LM and 2 is the parameter for combining user topic mo dels. 7.3 Sensitivity to Parameter Selection Finally, we study the effects of parameters in the proposed approach. Two parameters are examined, one being for the WT-QDA (1 ) and the other for the WT-QDAU (2 ). Note 1 is the weight on the topic-level LM on query and documents and 2 is the weight on the LM generated on users. To determine the optimal 1 and 2 , we perform crossvalidation against user judgement. Figure 7 demonstrates the change of DCG scores for different settings of 1 and 2 . From the figure, we can see the proposed approach is very sensitive to 1 but less sensitive to 2 . The 1 reaches best performance at around 1 = 0.2. The 2 reaches best performances at about 2 = 0.3. This indicates a limited input of topic information will improve LM-based IR but relying on topic information too much fails to differentiate the information to be retrieved. 8. CONCLUSIONS & FUTURE WORK This paper presents a framework that combines the modeling of information retrieval on the documents associated with social annotations. A new probabilistic generative model is proposed for the generation of document content as well as the associated social annotations. A new way for discovering user domains is presented based on social annotations. Several methods are proposed for combining language models from tags with those from the documents. We then evaluate user expertise based on activity intensities. Experimental evaluation on real-world datasets demonstrates effectiveness of the proposed model and the improvements over traditional IR approach based on language modeling. 723 WWW 2008 / Refereed Track: Social Networks & Web 2.0 - Applications & Infrastructures for Web 2.0 For future work, one could take a closer look at the effects of the parameter sets. It would be useful to reduce the number of parameters for easier tuning for practical use, and focus on exploring more indicators regarding the domain expertise of users and their use in improving user experiences. The inter-personal social networks and communities of users can be more thoroughly studied. How the user social network correlates with social annotations is not clear and remains an interesting question. The temporal dimension of user activities could also be considered on specific queries. In addition, It would be interesting to model the changes in user annotation behaviors. Patterns of the development of user annotations might further advance the use of annotations for more effective information retrieval. [11] O. Kurland, L. Lee, and C. Domshlak. Better than the real thing?: iterative pseudo-query processing using cluster-based language models. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19­26, New York, NY, USA, 2005. ACM Press. [12] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In SIGIR '01: Proceedings of the 24th annual international conference on Research and development in information retrieval, pages 111­119, 2001. [13] A. K. McCallum. Multi-label text classification with a mixture model trained by em. In AAAI'09 Workshop on Text Learning, 1999. [14] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD '05: Proceeding of the eleventh ACM SIGKDD international conference on Know ledge discovery in data mining, pages 198­207, New York, NY, USA, 2005. ACM Press. [15] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 275­281, New York, NY, USA, 1998. ACM Press. [16] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Publisher, 2nd Edition, 2005. [17] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI '04: Proceedings of the 20th conference on Uncertainty in artificial intel ligence, pages 487­494. UAI Press, 2004. [18] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Know ledge discovery and data mining, pages 306­315. ACM Press, 2004. [19] T. Tao, X. Wang, Q. Mei, and C. Zhai. Language model information retrieval with document expansion. In HLT-NAACL, 2006. [20] X. Wu, L. Zhang, and Y. Yu. Exploring social annotations for the semantic web. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 417­426, New York, NY, USA, 2006. ACM Press. [21] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transaction of Information System, 22(2):179­214, 2004. [22] D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 173­182. ACM Press, 2006. 9. ACKNOWLEDGMENTS We would like to thank David J. Miller from the Department of Electrical Engineering at the Pennsylvania State University for his valuable discussions. This work was funded in part by National Science Foundation and the Raytheon Corporation. 10. REFERENCES [1] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34­43, 2001. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003. [3] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Ra jagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In Proceedings of the 12th international conference on World Wide Web, pages 178­186, 2003. [4] S. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, pages 198­208, 2006. [5] T. Griffiths and M. Steyvers. Finding scientific topics. In National Academy of Sciences, 2004. [6] A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. In Y. Sure and J. Domingue, editors, The Semantic Web: Research and Applications, volume 4011 of LNAI, pages 411­426, Heidelberg, June 2006. Springer. [7] P. Jackson. Introduction to expert systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986. [8] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 41­48, 2000. [9] F. Jelinek and R. Mercer. Interpolated estimation of markov source parameters from sparse data. In Pattern recognition in Practice, 1980. [10] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In Proceedings of the 12th international conference on World Wide Web, pages 568­576, 2003. 724