WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China Modeling Online Reviews with Multi-grain Topic Models Ivan Titov Ryan McDonald Google Inc. 76 Ninth Avenue New York, NY 10011 Depar tment of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 titov@uiuc.edu ryanmcd@google.com ABSTRACT In this pap er we present a novel framework for extracting the ratable asp ects of ob jects from online user reviews. Extracting such asp ects is an imp ortant challenge in automatically mining product opinions from the web and in generating opinion-based summaries of user reviews [18, 19, 7, 12, 27, 36, 21]. Our models are based on extensions to standard topic modeling methods such as LDA and PLSA to induce multi-grain topics. We argue that multi-grain models are more appropriate for our task since standard models tend to produce topics that corresp ond to global prop erties of ob jects (e.g., the brand of a product typ e) rather than the asp ects of an ob ject that tend to b e rated by a user. The models we present not only extract ratable asp ects, but also cluster them into coherent topics, e.g., waitress and bartender are part of the same topic staff for restaurants. This differentiates it from much of the previous work which extracts asp ects through term frequency analysis with minimal clustering. We evaluate the multi-grain models b oth qualitatively and quantitatively to show that they improve significantly up on standard topic models. Categories and Subject Descriptors H.2.8 [Information Systems]: Data Mining; H.3.1 [Information Systems]: Content Analysis and Indexing; H.4 [Information Systems]: Information Systems Applications General Terms Design, exp erimentation 1. INTRODUCTION The amount of Web 2.0 content is expanding rapidly. Due to its source, this content is inherently noisy. However, UI tools often allow for at least some minimal lab eling, such as topics in blogs, numerical product ratings in user reviews and helpfulness rankings in online discussion forums. This unique mix has led to the development of tailored mining and retrieval algorithms for such content [18, 11, 24]. In this study we focus on online user reviews that have b een provided for products or services, e.g., electronics, hoThis work was done while at Google Inc. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. tels and restaurants. The most studied problem in this domain is sentiment and opinion classification. This is the task of classifying a text as b eing either sub jective or ob jective, or with having p ositive, negative or neutral sentiment [34, 25, 31]. However, the sentiment of online reviews is often provided by the user. As such, a more interesting problem is to adapt classifiers to blogs and discussion forums to extract additional opinions of products and services [24, 21]. Recently, there has b een a focus on systems that produce fine-grained sentiment analysis of user reviews [19, 27, 6, 36]. As an example, consider hotel reviews. A standard hotel review will probably discuss such asp ects of the hotel like cleanliness, rooms, location, staff, dining exp erience, business services, amenities etc. Similarly, a review for a Mp3 player is likely to discuss asp ects like sound quality, battery life, user interface, app earance etc. Readers are often interested not only in the general sentiment towards an ob ject, but also in a detailed opinion analysis for each these asp ects. For instance, a couple on their honeymoon are probably not interested in quality of the Internet connection at a hotel, whereas this asp ect can b e of a primary imp ortance for a manager on a business trip. These considerations underline a need for models that automatically detect asp ects discussed in an arbitrary fragment of a review and predict the sentiment of the reviewer towards these asp ects. If such a model were available it would b e p ossible to systematically generate a list of sentiment ratings for each asp ect, and, at the same time, to extract textual evidence from the reviews supp orting each of these ratings. Such a model would have many uses. The example ab ove where users search for products or services based on a set of critical criteria is one such application. A second application would b e a mining tool for companies that want fine-grained results for tracking online opinions of their products. Another application could b e Zagat1 or TripAdvisor2 style asp ect-based opinion summarizations for a wide range of services b eyond just restaurants and hotels. Fine-grained sentiment systems typically solve the task in two phases. The first phase attempts to extract the asp ects of an ob ject that users frequently rate [18, 7]. The second phase uses standard techniques to classify and aggregate sentiment over each of these asp ects [19, 6]. In this pap er we focus on improved models for the first phase ­ ratable asp ect extraction from user reviews. In particular, we focus on unsup ervised models for extracting these asp ects. The model we describ e can extend b oth Probabilistic Latent Semantic 1 2 http://www.zagat.com http://www.tripadvisor.com 111 WWW 2008 / Refereed Track: Data Mining - Modeling Analysis [17] and Latent Dirichlet Allocation (LDA) [3] ­ b oth of which are state-of-the-art topic models. We start by showing that standard topic modeling methods, such as LDA and PLSA, do not model the appropriate asp ects of user reviews. In particular, these models tend to build topics that globally classify terms into product instances (e.g., Creative Labs Mp3 players versus iPods, or New York versus Paris Hotels). To combat this we extend b oth PLSA and LDA to induce multi-grain topics. Sp ecifically, we allow the models to generate terms from either a global topic, which is chosen based on the document level context, or a local topic, which is chosen based on a sliding window context over the text. The local topics more faithfully model asp ects that are rated throughout the review corpus. Furthermore, the numb er of quality topics is drastically improved over standard topic models that have a tendency to produce many useless topics in addition to a numb er of coherent ones. We evaluate the models b oth qualitatively and quantitatively. For the qualitative analysis we present a numb er of topics generated by b oth standard topic models and our new multi-grained topic models to show that the multi-grain topics are b oth more coherent as well as b etter correlated with ratable asp ects of an ob ject. For the quantitative analysis we will show that the topics generated from the multi-grained topic model can significantly improve multi-asp ect ranking [30], which attempts to rate the sentiment of individual asp ects from the text of user reviews in a sup ervised setting. The rest of the pap er is structured as follows. Section 2 b egins with a review of the standard topic modeling approaches, PLSA and LDA, and a discussion of their applicability to extracting ratable asp ects of products and services. In the rest of the section we introduce a multi-grain model as a way to address the discovered limitations of PLSA and LDA. Section 3 describ e an inference algorithm for the multigrain model. In Section 4 we provide an empirical evaluation of the prop osed method. We conclude in Section 5 with an examination of related work. Throughout this pap er we use the term aspect to denote prop erties of an ob ject that are rated by a reviewer. Other terms in the literature include features and dimensions, but we opted for aspects due to ambiguity in the use of alternatives. April 21-25, 2008 · Beijing, China that the document is generated using a mixture of K topics, where the mixture coefficients are chosen individually for each document. The model is defined by parameters , and , where z is the distribution P (w|z ) of words in latent topic z , d is the distribution P (z |d) of topics in document d and d is the probability of choosing document d, i.e. P (d). Then, generation of a word in this model is defined as follows: · choose document d , · choose topic z d , · choose word z z . The probability of the observed word-document pair (d, w) can b e obtained by marginalization over latent topics X d (z )z (w). P (d , w ) = (d ) z 2. UNSUPERVISED TOPIC MODELING The Exp ectation Maximization (EM) algorithm [10] is used to calculate maximum likelihood estimates of the parameters. This will lead to (d) b eing prop ortional to the length of document d. As a result, the interesting parts of the model are the distributions of words in latent topics , and , the distributions of topics in each document. The numb er of parameters grows linear with the size of the corpus which leads to overfitting. A regularized version of the EM algorithm, Temp ered EM (TEM) [26], is normally used in practice. Along with the need to combat overfitting by using appropriately chosen regularization parameters, the main drawback of the PLSA method is that it is inherently transductive, i.e., there is no direct way to apply the learned model to new documents. In PLSA each document d in the collection is represented as a mixture of topics with mixture coefficients d , but it does not define such representation for documents outside the collection. The hierarchical Bayesian LDA model prop osed in [3] solves b oth of these problems by defining a generative model for distributions d . In LDA, generation of a collection is started by sampling a word distribution z from a prior Dirichlet distribution Dir ( ) for each latent topic. Then each document d is generated as follows: · choose distribution of topics d Dir () · for each word i in document d ­ choose topic zd,i d , ­ choose word wd,i zd,i . The model is represented in Figure 1a using the standard graphical model notation. LDA has only two parameters, and ,3 which prevents it from overfitting. Unfortunately exact inference in such model is intractable and various approximations have b een considered [3, 23, 14]. Originally, the variational EM approach was prop osed in [3], which instead of generating from Dirichlet priors, p oint estimates of distributions are used and approximate inference in the resulting model is p erformed using variational techniques. The numb er of parameters in this empirical Bayes model UsuQly the symmetrical Dirichlet distribution Dir (a) = al a -1 1 is used for b oth of these priors, which implies i xi B (a ) that parameters and are b oth scalars. 3 As discussed in the preceding section, our goal is to provide a method for extracting ratable asp ects from reviews without any human sup ervision. Therefore, it is natural to use generative models of documents, which represent document as mixtures of latent topics, as a basis for our approach. In this section we will consider applicability of the most standard methods for unsup ervised modeling of documents, Probabilistic Latent Semantic Analysis, PLSA [17] and Latent Dirichlet Allocation, LDA [3] to the considered problem. This analysis will allow us to recognize limitations of these models in the context of the considered problem and to prop ose a new model, Multi-grain LDA. 2.1 PLSA & LDA Unsup ervised topic modeling has b een an area of active research since the PLSA method was prop osed in [17] as a probabilistic variant of the LSA method [9], the approach widely used in information retrieval to p erform dimensionality reduction of documents. PLSA uses the asp ect model [29] to define a generative model of a document. It assumes 112 WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China models are considerably more computationally exp ensive. Also, like LDA and PLSA, they will not b e able to distinguish b etween topics corresp onding to ratable asp ects and global topics representing prop erties of the reviewed item. In the following section we will introduce a method which explicitly models b oth typ es of topics and efficiently infers ratable asp ects from limited amount of training data. 2.2 MG-LDA We prop ose a model called Multi-grain LDA (MG-LDA), which models two distinct typ es of topics: global topics and local topics. As in PLSA and LDA, the distribution of global topics is fixed for a document. However, the distribution of local topics is allowed to vary across the document. A word in the document is sampled either from the mixture of global topics or from the mixture of local topics sp ecific for the local context of the word. The hyp othesis is that ratable asp ects will b e captured by local topics and global topics will capture prop erties of reviewed items. For example consider an extract from a review of a London hotel: ". . . public transp ort in London is straightforward, the tub e station is ab out an 8 minute walk . . . or you can get a bus for £1.50". It can b e viewed as a mixture of topic London shared by the entire review (words: "London", "tub e", "£"), and the ratable asp ect location, sp ecific for the local context of the sentence (words: "transp ort", "walk", "bus"). Local topics are exp ected to b e reused b etween very different typ es of items, whereas global topics will corresp ond only to particular typ es of items. In order to capture only genuine local topics, we allow a large numb er of global topics, effectively, creating a b ottleneck at the level of local topics. Of course, this b ottleneck is sp ecific to our purp oses. Other applications of multi-grain topic models conceivably might prefer the b ottleneck reversed. Finally, we note that our definition of multi-grain is simply for two-levels of granularity, global and local. In principle though, there is nothing preventing the model describ ed in this section from extending b eyond two levels. One might exp ect that for other tasks even more levels of granularity could b e b eneficial. We represent a document as a set of sliding windows, each covering T adjacent sentences within it. Each window v in document d has an associated distribution over local topics lc dov and a distribution defining preference for local topics , versus global topics d,v . A word can b e sampled using any window covering its sentence s, where the window is chosen according to a categorical distribution s . Imp ortantly, the fact that the windows overlap, p ermits to exploit a larger co-occurrence domain. These simple techniques are capable of modeling local topics without more exp ensive modeling of topics transitions used in [5, 15, 33, 32, 28, 16]. Introduction of a symmetrical Dirichlet prior Dir ( ) for the distribution s p ermits to control smoothness of topic transitions in our model. The formal definition of the model with K gl global and K loc local topics is the following. First, draw K gl word distributions for global topics gl from a Dirichlet prior z Dir ( gl ) and K loc word distributions for local topics loc f z rom Dir ( loc ). Then, for each document d: g · Choose a distribution of global topics dl Dir (gl ). (a) (b ) Figure 1: (a) LDA model. (b) MG-LDA model. is still not directly dep endent on the numb er of documents and, therefore, the model is not exp ected to suffer from overfitting. Another approach is to use a Markov chain Monte Carlo algorithm for inference with LDA, as prop osed in [14]. In section 3 we will describ e a modification of this sampling method for the prop osed Multi-grain LDA model. Both LDA and PLSA methods use the bag-of-words representation of documents, therefore they can only explore co-occurrences at the document level. This is fine, provided the goal is to represent an overall topic of the document, but our goal is different: extracting ratable asp ects. The main topic of all the reviews for a particular item is virtually the same: a review of this item. Therefore, when such topic modeling methods are applied to a collection of reviews for different items, they infer topics corresp onding to distinguishing prop erties of these items. E.g. when applied to a collection of hotel reviews, these models are likely to infer topics: hotels in France, New York hotels, youth hostels, or, similarly, when applied to a collection of Mp3 players' reviews, these models will infer topics like reviews of iPod or reviews of Creative Zen player. Though these are all valid topics, they do not represent ratable asp ects, but rather define clusterings of the reviewed items into sp ecific typ es. In further discussion we will refer to such topics as global topics, b ecause they corresp ond to a global prop erty of the ob ject in the review, such as its brand or base of op eration. Discovering topics that correlate with ratable asp ects, such as cleanliness and location for hotels, is much more problematic with LDA or PLSA methods. Most of these topics are present in some way in every review. Therefore, it is difficult to discover them by using only co-occurrence information at the document level. In this case exceedingly large amounts of training data is needed and as well as a very large numb er of topics K . Even in this case there is a danger that the model will b e overflown by very fine-grain global topics or the resulting topics will b e intersection of global topics and ratable asp ects, like location for hotels in New York. We will show in Section 4 that this hyp othesis is confirmed exp erimentally. One way to address this problem would b e to consider cooccurrences at the sentence level, i.e., apply LDA or PLSA to individual sentences. But in this case we will not have a sufficient co-occurrence domain, and it is known that LDA and PLSA b ehave badly when applied to very short documents. This problem can b e addressed by explicitly modeling topic transitions [5, 15, 33, 32, 28, 16], but these topic n-gram · For each sentence s choose a distribution d,s (v ) Dir ( ). · For each sliding window v 113 WWW 2008 / Refereed Track: Data Mining - Modeling lc ­ choose dov Dir (loc ), , April 21-25, 2008 · Beijing, China considered word at p osition i in document d. We denote by w a vector of all the words in the collection. We start by showing how the joint probability of the assignments and the words P (w, v, r, z) =P (w|r, z)P (v, r, z) can b e evaluated. By integrating out gl and loc we can obtain the first term: r Kr K Q r,z r Y ,, ( W r ) « Y w (nw + ) , (1) P (w|r, z) = W r,z + W r ) ( n ( r ) z =1 r {g l,loc} ­ choose d,v B eta( mix ). · For each word i in sentence s of document d ­ choose window vd,i d,s , ­ choose rd,i d,vd,i , g ­ if rd,i = g l choose global topic zd,i dl , lc ­ if rd,i = loc choose local topic zd,i dovd,i , , ­ choose word wd,i from the word distribution zd,i . d,i Here, B eta(mix ) is a prior Beta distribution for choosing b etween local and global topics. Though symmetrical Beta distributions can b e considered, we use a non-symmetrical one as it p ermits to regulate preference to either global or i local topics by setting mix and mcx accordingly. Figure 1b gl lo presents the corresp onding graphical model. As we will show in the following section this model allows for fast approximate inference with collapsed Gibbs sampling. We should note a fundamental difference b etween MGLDA and other methods that model topics at different levels or granularities such as hierarchical topic models like hLDA [2] and Pachinko Allocation [20, 22]. MG-LDA topics are multi-grain with resp ect to the context that they were derived from, e.g., document level or sentence level. Hierarchical topic models instead model semantic interactions b etween topics that are all typically at the document level. The two methods are complementary and one can conceive of a hierarchical MG-LDA. r where W is the size of the vocabulary, ngl,z and nloc,z are the w w numb ers of times word w app eared in global and local topic z , ngl,z and nloc,z are the total numb er of words assigned to global or local topic z , and is the gamma function. To evaluate the second term, we factor it as P (v, r, z) = P (v)P (r|v)P (z|r, v) and compute each of these factors individually. By integrating out we obtain ,, «Ns Y (nd,s + ) ( T ) v , (2) P (v) = T ( ) (nd,s + T ) d,s in which Ns denotes the numb er of sentences in the collection, nd,s denotes the length of sentence s in document d, and nd,s is the numb er of times a word from this sentence v is assigned to window v . Similarly, by integrating out we compute !Nv P ( r{gl,loc} mix ) r P (r|v) = Q mix ) r {g l,loc} (r Q Y r{gl,loc} (nd,v + mix ) r r P , (3) (nd,v + r{gl,loc} mix ) r d,v 3. INFERENCE WITH MG-LDA In this section we will describ e a modification of the inference algorithm prop osed in [14]. But b efore starting with our Gibbs sampling algorithm we should note that instead of sampling from Dirichlet and Beta priors we could fix d,s as a uniform distribution and compute maximum likelihood estimates for r and r . Such model can b e trained by using the EM algorithm or the TEM algorithm and viewed as a generalization of the PLSA asp ect model. Gibbs sampling is an example of a Markov Chain Monte Carlo algorithm [13]. It is used to produce a sample from a joint distribution when only conditional distributions of each variable can b e efficiently computed. In Gibbs sampling, variables are sequentially sampled from their distributions conditioned on all other variables in the model. Such a chain of model states converges to a sample from the joint distribution. A naive application of this technique to LDA would imply that b oth assignments of topics to words z and distributions and should b e sampled. However, Griffiths and Steyvers [14] demonstrated that an efficient collapsed Gibbs sampler can b e constructed, where only assignments z need to b e sampled, whereas the dep endency on distributions and can b e integrated out analytically. Though derivation of the collapsed Gibbs sampler for MG-LDA is similar to the one prop osed by Griffiths and Steyvers for LDA, we rederive it here for completeness. In order to p erform Gibbs sampling with MG-LDA we need to compute conditional probability P (vd,i = v , rd,i = r, zd,i = z |v', r', z', w), where v', r' and z' are vectors of assignments of sliding windows, context (global or local) and topics for all the words in the collection except for the In this expression Nv is the total numb er of windows in the collection, nd,v is the numb er of words assigned to window v , , , ndlv and ndov are the numb er of times a word from window v g lc was assigned to global and to local topics, resp ectively. Finally, we can compute conditional probability of assignments of words to topics by integrating out b oth gl and loc ,, «D Y Q d gl (K gl gl ) z (ng l,z + ) P (z|r, v) = gl (ndl + K gl gl ) (gl )K g d «Nv Y Q ,, d,v loc loc loc ) ( K ) z (nloc,z + , (4) loc d,v (loc )K (nloc + K loc loc ) d,v here D is the numb er of documents, ndl is the numb er of g times a word in document d was assigned to one of the global topics and ndl,z is the numb er of times a word in g this document was assigned to global topic z . Similarly, , , counts ndov and ndov,z are defined for local topics in winlc lc dow v in document d. Now the conditional distribution P (vd,i = v , rd,i = r, zd,i = z |v', r', z', w) can b e obtained by cancellation of terms in expressions (1-4). For global topics we get P (vd,i = v , rd,i = g l, zd,i = z |v', r', z', w) nd,s + × dvs n , + T × , ndlv g , ngldzi + gl w, ngl,z + W gl ndl,z + gl g , ndl + K gl gl g nd,v + P + mix gl mix × r {g l,loc} r where s is the sentence in which the word i app ears. Here factors corresp ond to the probabilities of choosing word wd,i , choosing window v , choosing global topics and choosing topic 114 WWW 2008 / Refereed Track: Data Mining - Modeling Table 1: Datasets used for Domain Reviews Sentences Mp3 players 3,872 69,986 Hotels 32,861 231,983 Restaurants 32,563 136,906 April 21-25, 2008 · Beijing, China qualitative evaluation. Words Words p er review 1,596,866 412.4 4,456,972 135.6 2,513,986 77.2 Search.5 These reviews are either entered by users directly through Google, or are taken from review feeds provided by external vendors. All the datasets were automatically tokenized and sentence split. Prop erties of these 3 datasets are presented in table 1. Before applying the topic models we removed punctuation and also removed stop words using the standard list of stop words.6 z among global topics. For local topics, the conditional probability is estimated as P (vd,i = v , rd,i = loc, zd,i = z |v', r', z', w) × i + mcx nd,s + lo v × d,v P d,s + T n n + r {gl,loc} mix r , ndov lc c,z nlod,i + loc w nloc,z + W loc , ndov,z + loc lc × , ndov + K loc loc lc . In b oth of these expressions, counts are computed without taking into account assignments of the considered word wd,i . Sampling with such model is fast and in practice convergence can b e achieved in time similar to that needed for standard LDA implementations. A sample obtained from such chain can b e used to approximate the distribution of words in topics: r (w) nr,z + r . ^z w (5) 4.1.2 Experiments and Results We used the Gibbs sampling algorithm b oth for MG-LDA and LDA, and ran the chain for 800 iterations to produce a sample for each of the exp eriments. Distributions of words in each topic were then estimated as in (5). The sliding windows were chosen to cover 3 sentences for all the exp eriments. Coarse tuning of parameters of the prior distributions was p erformed b oth for the MG-LDA and LDA models. We varied the numb er of topics in LDA and the numb er of local and global topics in MG-LDA. Quality of local topics for MG-LDA did not seem to b e influenced by the numb er of global topics K gl as long as K gl exceeded the numb er of local topics K loc by factor of 2. For Mp3 and hotel reviews' datasets, when increasing K loc most of the local topics represented ratable asp ects until a p oint when a further increase of K loc started to produce mostly nonmeaningful topics. For LDA we selected the topic numb er corresp onding to the largest numb er of discovered ratable asp ects. In this way our comparison was as fair to LDA as p ossible. Top words for the discovered local topics and for some of the global topics of MG-LDA models are presented in Table 2 - Table 3, one topic p er line, along with selected topics from the LDA models.7 We manually assigned lab els to coherent topics to reflect our interpretation of their meaning. Note that the MG-LDA local topics represent the entire set of local topics used in MG-LDA models. For LDA we selected only the coherent topics which captured ratable asp ects and additionally a numb er of example topics to show typical LDA topics. Global topics of MG-LDA are not supp osed to capture ratable asp ects and they are not of primary interest in these exp eriments. In the tables we presented only typical MG-LDA global topics and any global topics which, contrary to our exp ectations, discovered ratable asp ects. For the reviews of Mp3 players we present results of the MG-LDA model with 10 local and 30 global topics. All 10 local topics seem to corresp ond to ratable asp ects. Furthermore, the ma jority of global topics represent brands of Mp3 players or additional categorizations of players such as those with video capability. The only genuine ratable asp ect http://local.google.com http://www.dcs.gla.ac.uk/idom/ir resources/linguistic utils/ stop words 7 Though we did not remove numb ers from the datasets b efore applying the topic models, we removed them from the tables of results to improve readability. 6 5 The distribution of topics in sentence s of document d can b e estimated as follows ^gl d,s ( z ) = , X nd,s + ndlv + mix gl g v × d,v P nd,s+T n + r mix v r ndl,z + gl g , (6) ndl+K gl gl g × ^l ,c dos (z ) = , X nd,s + ndov,z + loc nd,v + mix lc v l . (7) × d,vocP loc , nd,s+T n + r mix ndov+K loc loc v × lc r One problem of the collapsed sampling approach is that when computing statistics it is not p ossible to aggregate over several samples from the probabilistic model [15]. It happ ens b ecause there is no corresp ondence b etween indices of topics in different samples. For large collections one sample is generally sufficient, but with small collections such estimates might b ecome very random. In all our exp eriments we used collapsed sampling methods. For smaller collections maximum likelihood estimation with EM can b e used or variational approximations can b e derived [3]. 4. EXPERIMENTS In this section we present qualitative and quantitative exp eriments. For the qualitative analysis we show that local topics inferred by MG-LDA do corresp ond to ratable asp ects. We compare the quality of topics obtained by MGLDA with topics discovered by the standard LDA approach. For the quantitative analysis we show that the topics generated from the multi-grain models can significantly improve multi-asp ect ranking. 4.1 Qualitative Experiments 4.1.1 Data To p erform qualitative exp eriments we used a subset of reviews for Mp3 players from Google Product Search4 and subsets of reviews of hotels and restaurants from Google Local 4 http://www.google.com/products 115 WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China Table 2: Top words from MG-LDA and LDA topics for Mp3 players' reviews. lab el sound quality features connection with PC tech. problems app earance controls batte ry accessories managing files radio/recording iPo d Creative Zen Sony Walkman video players supp ort iPo d Creative memory/battery radio/recording controls opinion top words sound quality headphones volume bass earphones go o d settings ear ro ck excellent games features clo ck contacts calendar alarm notes game quiz feature extras solitaire usb p c windows p ort transfer computer mac software cable xp connection plug firewire reset noise backlight slow freeze turn remove playing icon creates hot cause disconnect case p o cket silver screen plastic clip easily small blue black light white b elt cover button play track menu song buttons volume album tracks artist screen press select battery hours life batteries charge aaa rechargeable time p ower lasts hour charged usb cable headphones adapter remote plug p ower charger included case firewire files software music computer transfer windows media cd p c drag drop file using radio fm voice recording record recorder audio mp3 microphone wma formats ip o d music apple songs use mini very just itunes like easy great time new buy really zen creative micro touch xtra pad nomad waiting deleted labs nx sensitive 5gb eax sony walkman memory stick sonicstage players atrac3 mb atrac far software format video screen videos device photos tv archos pictures camera movies dvd files view player pro duct did just b ought unit got buy work $ problem supp ort time months ip o d music songs itunes mini apple battery use very computer easy time just song creative nomad zen xtra jukeb ox eax labs concert effects nx 60gb exp erience lyrics card memory cards sd flash batteries lyra battery aa slot compact extra mmc 32mb radio fm recording record device audio voice unit battery features usb recorder button menu track play volume buttons player song tracks press mo de screen settings p oints reviews review negative bad general none comments go o d please content aware player very use mp3 go o d sound battery great easy songs quality like just music MG-LDA lo cal (all topics) MG-LDA global LDA (out of 40) in the set of global topics is support. Though not entirely clear, the presence of support topic in the list of global topics might b e explained by the considerable numb er of reviews in the dataset focused almost entirely on problems with technical supp ort. The LDA model had 40 topics and only 4 of them (memory/battery, radio/recording, controls and p ossibly opinion ) corresp onded to ratable asp ects. And even these 4 topics are of relatively low quality. Though mixing related topics radio and recording is probably appropriate, combining concepts memory and battery is clearly undesirable. Also top words for LDA topics contain entries corresp onding to player prop erties or brands (as lyra in memory/battery ), or not so related words (as battery and unit in radio/recording ). In words b eyond top 10 this happ ens for LDA much more frequently than for MG-LDA. Other topics of the LDA model seem either semantically incoherent (as the last topic in Table 2) or represent player brands or typ es. For the hotels reviews we present results of the MG-LDA model with 15 local topics and 45 global topics and results of the LDA model with 45 topics. Again, top words for all the MG-LDA local topics are given in Table 3. Only 9 topics out of 45 LDA topics corresp onded to ratable asp ects and these are shown in the table. Also, as with the Mp3 player reviews, we chose 3 typical LDA topics (beach resorts, Las Vegas and an incoherent topic). All the local topics of MGLDA again reflect ratable asp ects and no global topics seem to capture any ratable asp ects. All the global topics of MGLDA app ear to corresp ond to hotel typ es and locations, such as b each resorts or hotels in Las Vegas, though some global topics are not semantically coherent. Most of LDA topics are similar to MG-LDA global topics. We should note that as with the Mp3 reviews, increasing numb er of topics for LDA b eyond 45 did not bring any more topics corresp onding to ratable asp ects. Additionally, we p erformed an exp eriment on the Mp3 reviews where we applied the LDA model to individual sentences. This `local' LDA model infers a numb er of valid asp ects, but still a significant prop ortion of the topics are related to brands of Mp3 players. Even the topics which corresp onded to ratable asp ects were contaminated by brand sp ecific words: 20 top words for ab out a half of the topics (dep ending on the total numb er of topics) contained brand- related words such as `ip od', `apple', `sony', `yepp' etc. This result suggests that simultaneous modeling of b oth local and global topics is imp ortant for discovery of coherent ratable asp ects. The dataset of restaurant reviews app eared to b e challenging for b oth of the models. Both MG-LDA and LDA models managed to capture only few ratable asp ects: MG-LDA discovered topics corresp onding to ratable dimensions service, atmosphere, location and decor, LDA discovered waiting time and service. Space constraints do not allow us to present detailed results for this domain. One problem with this dataset is that restaurant reviews are generally short (average review length is 4.2 sentences). Also these results can probably b e explained by observing the fact that the ma jority of natural ratable asp ects are sp ecific for a typ e of restaurants. E.g., appropriate ratable asp ects for Italian restaurants could b e pizza and pasta, whereas for Japanese restaurants they are probably sushi and nood les. We could imagine generic categories like meat dishes and fish dishes but they are unlikely to b e revealed by any unsup ervised model as the overlap in the vocabulary describing these asp ects in different cuisines is small. Preliminary exp eriments suggested that MG-LDA is able to infer appropriate ratable asp ects if applied to a set of reviews of restaurants with a sp ecific cuisine. For example, for MG-LDA with 15 local topics applied to the collection of Italian restaurant reviews, 9 topics corresp onded to ratable dimensions: wine, pizza, pasta, general food, location, service, waiting, value and atmosphere. Another approach to address this problem is to attempt hierarchical topic modeling [2, 22]. 4.2 Quantitative Experiments 4.2.1 Data and Problem Set-up Topic models are typically evaluated quantitatively using measures like likelihood on held-out data [17, 3, 16]. However, likelihood does not reflect our actual purp ose since we are not trying to predict whether a new piece of text is likely to b e a review of some particular category. Instead we wish to evaluate how well our learned topics corresp ond to asp ects of an ob ject that users typically rate. To accomplish this we will look at the problem of multi- 116 WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China Table 3: Top words from MG-LDA and LDA topics for hotel reviews. lab el amenities fo o d and drink noise/conditioning bathro om breakfast spa parking staff Internet g e t t i n g t h e re check in smells/stains comfort lo cation pricing b each resorts Las Vegas b each resorts Las Vegas smells/stains g e t t i n g t h e re breakfast lo cation pricing front desk noise opinion cleanliness top words coffee microwave fridge tv ice ro om refrigerator machine kitchen maker iron dryer fo o d restaurant bar go o d dinner service breakfast ate eat drinks menu buffet meal air noise do or ro om hear op en night conditioning loud window noisy do ors windows shower water bathro om hot towels toilet tub bath sink pressure soap shamp o o breakfast coffee continental morning fruit fresh buffet included free hot juice p o ol area hot tub indo or nice swimming outdo or fitness spa heated use kids parking car park lot valet garage free street parked rental cars spaces space staff friendly helpful very desk extremely help directions courteous concierge internet free access wireless use lobby high computer available sp eed business airp ort shuttle minutes bus to ok taxi train hour ride station cab driver line early check morning arrived late hours pm ready day hour flight wait ro om smoking bathro om smoke carp et wall smell walls light ceiling dirty ro om b ed b eds bathro om comfortable large size tv king small double b edro om walk walking restaurants distance street away close lo cation shopping shops $ night rate price paid worth pay cost charge extra day fee parking b each o cean view hilton balcony resort ritz island head club p o ol o ceanfront vegas strip casino las ro ck hard station palace p o ol circus renaissance b each great p o ol very place o cean stay view just nice stayed clean b eautiful vegas strip great casino $ go o d hotel fo o d las ro ck ro om very p o ol nice ro om did smoking b ed night stay got went like desk smoke non-smoking smell airp ort hotel shuttle bus very minutes flight hour free did taxi train car breakfast coffee fruit ro om juice fresh eggs continental very toast morning hotel ro oms very centre situated well lo cation excellent city comfortable go o d card credit $ charged hotel night ro om charge money dep osit stay pay cash did ro om hotel told desk did manager asked said service called stay ro oms ro om very hotel night noise did hear sleep b ed do or stay flo or time just like hotel b est stay hotels stayed reviews service great time really just say ro oms hotel ro om dirty stay bathro om ro oms like place carp et old very worst b ed motel ro oms nice hotel like place stay parking price $ santa stayed go o d MG-LDA lo cal (all topics) MG-LDA global LDA (out of 45) asp ect opinion rating [30]. In this task a system needs to predict a discrete numeric rating for multiple asp ects of an ob ject. For example, given a restaurant review, a system would predict on a scale of 1-5 how a user liked the food, service, and decor of the restaurant. This is a challenging problem since users will use a wide variety of language to describ e each asp ect. A user might say "The X was great", where X could b e "duck", "steak", "soup", each indicating that the food asp ect should receive a high rating. If our topic model identifies a food topic (or topics), then this information could b e used as valuable features when predicting the sentiment of an asp ect since it will inform the classifier which sentences are genuinely ab out which asp ects. To test this we used a set of reviews of hotels taken from TripAdvisor.com8 that contained 27,564 reviews. These reviews are lab eled with a rating of 1-5 for a variety of ratable asp ects for hotels. We selected our review set to span hotels from a large numb er of cities. Furthermore, we ensured that all reviews in our set had ratings for each of 6 asp ects: check-in, service, value, location, rooms, and cleanliness. The reviews were automatically sentence split and tokenized. The multi-asp ect rater we used was the PRanking algorithm [8], which is a p erceptron-based online learning method. The PRanking algorithm scores each input feature vector x Rm with a linear classifier, scor ei (x) = wi · x Where scor ei is the score and wi the parameter vector for the ith asp ect. For each asp ect, the PRanking model also maintains k-1 b oundary values bi,1 , . . . , bi,k-1 that divides the scores into k buckets, each representing a particular rating. For asp ect i a text gets the j th rating if and only if bi,j -1 < scor ei (x) < bi,j 8 Parameters and b oundary values are up dated using a p erceptron style online algorithm. We used the Snyder and Barzilay implementation9 that was used in their study on agreement models for asp ect ranking [30]. The input vector x is typically a set of binary features representing textual cues in a review. Our base set of features are unigram, bigram and frequently occurring trigrams in the text. To add topic model features to the input representation we first estimated the topic distributions for each sentence using b oth LDA and MG-LDA. For MG-LDA we could use estimators (6) and (7), but there is no equivalent estimators for LDA. Instead for b oth models we set the probability of a topic for a sentence to b e prop ortional to the numb er of words assigned to this topic. To improve the reliability of the estimator we produced 100 samples for each document while keeping assignments of the topics to all other words in the collection fixed. The probability estimates were then obtained by averaging over these samples. This approach allows for more direct comparison of b oth models. Also, unlike estimators given in (6) and (7), it is applicable to arbitrary text fragments, not necessarily sentences, which is desirable for topic segmentation. We then found top 3 topic for each sentence using b oth models, bucketed these topics by their probability and concatenated them with original features in x. For example, if a sentence is ab out topic 3 with probability b etween 0.4 and 0.5 and the sentence contains the word "great", then we might have the binary feature x contains "great" & topic=3 & bucket=0.4-0.5 To bucket the probabilities produced by LDA and MG-LDA we choose 5 buckets using thresholds to distribute the values as evenly as p ossible. We also tried many alternative methods for using the real value topic probabilities and found that bucketing with raw probabilities worked b est. Alternatives attempted include: using the probabilities directly as 9 (c) 2005-06, TripAdvisor, LLC All rights reserved http://p eople.csail.mit.edu/bsnyder/naacl07/ 117 WWW 2008 / Refereed Track: Data Mining - Modeling feature values; normalizing values to (0,1) with and without bucketing; using log-probabilities with and without bucketing; using z-score with and without bucketing. April 21-25, 2008 · Beijing, China ratable asp ects. There have b een many methods prop osed from unsup ervised to fully sup ervised systems. In terms of unsup ervised asp ect extraction, in which this work can b e categorized, the system of Hu and Liu [18, 19] was one of the earliest endeavors. In that study association mining is used to extract product asp ects that can b e rated. Hu and Liu defined an asp ect as simply a string and there was no attempt to cluster or infer asp ects that are mentioned implicitly, e.g., "The amount of stains in the room was overwhelming" is ab out the cleanliness asp ect for hotels. A similar work by Pop escu and Etzioni [27] also extract explicit asp ects mentions without describing how implicit mentions are extracted and clustered.10 Clustering can b e of particular imp ortance for domains in which asp ects are describ ed with a large vocabulary, such as food for restaurants or rooms for hotels. Both implicit mentions and clustering arise naturally out of the topic model formulation requiring no additional augmentations. Gamon et al. [12] present an unsup ervised system that does incorp orate clustering, however, their method clusters sentences and not individual asp ects to produce a sentence based summary. Sentence clusters are lab eled with the most frequent non-stop word stem in the cluster. Carenini et al. [7] present a weakly sup ervised model that uses the algorithms of Hu and Liu [18, 19] to extract explicit asp ect mentions from reviews. The method is extended through a user supplied asp ect hierarchy of a product class. Extracted asp ects are clustered by placing the asp ects into the hierarchy using various string and semantic similarity metrics. This method is then used to compare extractive versus abstractive summarizations for sentiment [6]. There has also b een some studies of sup ervised asp ect extraction methods. For example, Zhuang et al. [36] work on sentiment summarization for movie reviews. In that work, asp ects are extracted and clustered, but they are done so manually through the examination of a lab eled data set. The short-coming of such an approach is that it requires a lab eled corpus for every domain of interest. A key p oint of note is that our topic model approach is orthogonal to most of the methods mentioned ab ove. For example, the topic model can b e used to help cluster explicit asp ects extracted by [18, 19, 27] or used to improve the recall of knowledge driven approaches that require domain sp ecific ontologies [7] or lab eled data [36]. A closely related model to ours is that of Mei et al. [21] which p erforms joint topic and sentiment modeling of collections. Their Topic-Sentiment Model (TSM) is essentially equivalent to the PLSA asp ect model with two additional topics.11 One of these topics has a prior towards p ositive sentiment words and the other towards negative sentiment words, where b oth priors are induced from sentiment lab eled data. Though results on web-blog p osts are encouraging, it is not clear if their method can model sentiments towards discovered topics: induced distributions of the sentiment words are universal and indep endent of topics, and their model uses the bag-of-words assumption, which does not p ermit exploitation of co-occurrences of sentiment words with topical words. Also it is still not known whether their model can achieve good results on review data, b ecause, as discussed in section 2 and confirmed in the empirical exp eri10 11 4.2.2 Results All system runs are evaluated using ranking loss [8, 30] which measures the average distance b etween the true and predicted numerical ratings. If given N test instances, the ranking loss for an asp ect is equal to X |actual rating - predicted rating | n n n N Overall ranking loss is simply the average over each asp ect. Note that a lower loss means a b etter p erformance. We compared four models. The baseline simply rates each asp ect as a 5, which is the most common rating in the data set for all asp ects. The second model is the standard PRanking algorithm over input features, which we denote by "PRank". The third model is the PRanking algorithm but including features derived from the LDA topic model, which is denoted by "PRank+LDA". The fourth and final model uses the PRanking algorithm but with features derived from the MG-LDA topic model, which is denoted by "PRank+MG-LDA". All topic models were run to generate 15 topics. We ran two exp eriments. The first exp eriment used only unigram features plus LDA and MG-LDA features. Results can b e seen in Table 4. Clear gains are to b e had by adding topic model features. In particular, the MG-LDA features result in a statistically significant improvement in loss over using the LDA features. Significance was tested using a paired t-test over multiple runs of the classifier on different splits of the data. Results that are significant with a value of p < 0.001 are given in b old. Our second exp eriment used the full input feature space (unigrams, bigrams, and frequent trigrams) plus the LDA and MG-LDA features. In this exp eriment we would exp ect the gains from topic model features to b e smaller due to the bigram and trigram features capturing some non-local context, which in fact does happ en. However, there are still significant improvements in p erformance by adding the MG-LDA features. Furthermore, the PRank+MG-LDA model still out p erforms the PRank+LDA model providing more evidence that the topics learned by multi-grain topic models are more representative of the ratable asp ects of an ob ject. When analyzing the results we can note that for the TripAdvisor data the MG-LDA model produced clear topics for the check-in, location, and several coherent rooms asp ects. This corresp onds rather closely with the improvements that are seen over just the PRank system alone. Note that we still see an improvement in service, cleanliness and value since a users ranking of different asp ects is highly correlated [30]. In particular, users who have favorable opinions of most of the asp ects almost certainly rate value high. The LDA model produced clear topics that corresp ond to check-in, but noisy topics for location and rooms with location topics often sp ecific to a single locale (e.g., Paris) and room topics often mixed with service, dining and hotel lobby terms. 5. RELATED WORK Recently there has b een a tremendous amount of work on summarizing sentiment [1] and in particular summarizing sentiment by extracting and aggregating sentiment over Though they imply that this is done in their system. Another difference from PLSA is that Mei et al. use a background comp onent to capture common English words. 118 WWW 2008 / Refereed Track: Data Mining - Modeling April 21-25, 2008 · Beijing, China Table 4: Multi-aspect ranking experiments with the PRanking algorithm for hotel reviews. Unigram features only Check-in Service Value 1.126 1.208 1.272 0.831 0.799 0.793 0.786 0.762 0.749 0.748 0.731 0.725 Model Baseline PRank PRank + LDA PRank + MG-LDA Overall 1.118 0.774 0.735 0.706 Location 0.742 0.707 0.677 0.635 Rooms 1.356 0.798 0.746 0.719 Cleanliness 1.002 0.715 0.690 0.676 Model PRank PRank + LDA PRank + MG-LDA Unigram, bigram and trigram features Overall Check-in Service Value Location 0.689 0.735 0.725 0.710 0.627 0.682 0.728 0.717 0.705 0.620 0.669 0.717 0.700 0.696 0.607 Rooms 0.700 0.684 0.672 Cleanliness 0.637 0.637 0.636 ments, modeling co-occurrences at the document level is not sufficient. Another approach for joint sentiment and topic modeling was prop osed in [4]. They prop ose a sup ervised LDA (sLDA) model which tries to infer topics appropriate for use in a given classification or regression problem. As an application they consider prediction of the overall document sentiment, though they do not consider multi-asp ect ranking. Both of these joint sentiment-topic models are orthogonal to the multi-grain model prop osed in our pap er. It should b e easy to construct a sLDA or TSM model on top of MG-LDA. In our work we assumed a sentiment classifier as a next model in a pip eline, but building a joint sentimenttopic model is certainly a challenging next step in this work. Several models have b een prop osed to overcome the bagof-words assumption by explicitly modeling topic transitions [5, 15, 33, 32, 28, 16]. In our MG-LDA model we instead prop osed a sliding windows to model local topics, as it is computationally less exp ensive and leads to good results. However, it is p ossible to construct a multi-grain model which uses a n-gram topic model for local topics and a distribution fixed p er document for global topics. The model of Blei and Moreno [5] also uses windows, but their windows are not overlapping and, therefore, it is a priori known from which window a word is going to b e sampled. An approach related to ours is describ ed in [35]. They consider discovery of topics from a set of comparable text collections. Their cross-collection mixture model discovers cross-collection topics and a sub-topic of each crosscollection topic for every collection in the set. These subtopics summarize differences b etween collections for every discovered cross-collection topic. Though the use of different topics typ es b ears some similarity to the MG-LDA model, the models are in fact quite different. The MG-LDA model infers only typ es of collections (global topics) and cross-collection topics (local topics) and does not try to infer collection sp ecific topics. Topics in the cross-collection mixture model are all global b ecause words for b oth typ es of topics are generated from a mixture associated with an entire document. However, it should b e p ossible to construct a combination of the cross-collection mixture model of Zhai et al. and MG-LDA to infer b oth cross-collection local topics and their within-collection sub-topics. The crucial prop erty of the MG-LDA model is that the topic distributions in MGLDA are associated with different scop es in a text, which, to our knowledge, has not b een attempted b efore. 6. SUMMARY AND FUTURE WORK In this work we presented multi-grain topic models and showed that they are sup erior to standard topic models when extracting ratable asp ects from online reviews. These models are particularly suited to this problem since they not only identify imp ortant terms, but also cluster them into coherent groups, which is a deficiency of many previously prop osed methods. There are many directions we plan on investigating in the future for the problem of asp ect extraction from reviews. A promising p ossibility is to develop a sup ervised version of the model similar to sup ervised LDA [4]. In such a model it would b e p ossible to infer topics for a multi-asp ect classification task. Another direction would b e to investigate hierarchical topic models. Ideally for a corpus of restaurant reviews, we could induce a hierarchy representing cuisines. Within each cuisine we could then extract cuisine sp ecific asp ects such as food and p ossibly decor and atmosphere. Other ratable asp ects like service would ideally b e shared across all cuisines in the hierarchy since there typically is a standard vocabulary for describing them. The next ma jor step in this work is to combine the asp ect extraction methods presented here with standard sentiment analysis algorithms to aggregate and summarize sentiment for products and services. Currently we are investigating a two-stage approach where asp ects are first extracted and sentiment is then aggregated. However, we are also interested in examining joint models such as the TSM model [21]. 7. REFERENCES [1] P. Beineke, T. Hastie, C. Manning, and S. Vaithyanathan. An exploration of sentiment summarization. In Proc. of AAAI, 2003. [2] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, 2004. [3] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993­1022, 2003. [4] D. M. Blei and J. D. McAuliffe. Sup ervised topic models. In Advances in Neural Information Processing Systems (NIPS), 2008. [5] D. M. Blei and P. J. Moreno. Topic segmentation with an asp ect hidden Markov model. In Proc. of the 119 WWW 2008 / Refereed Track: Data Mining - Modeling Conference on Research & Development on Information Retrieval (SIGIR), pages 343­348, 2001. G. Carenini, R. Ng, and A. Pauls. Multi-Document Summarization of Evaluative Text. In Proc. of the Conf. of the European Chapter of the Association for Computational Linguistics, 2006. G. Carenini, R. Ng, and E. Zwart. Extracting knowledge from evaluative text. In Proc. of the 3rd Int. Conf. on Know ledge Capture, pages 11­18, 2005. K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems (NIPS), pages 641­647, 2002. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391­407, 1990. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithms. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1­38, 1977. K. Fujimura, T. Inoue, and M. Sugisaki. The EigenRumor Algorithm for Ranking Blogs. In WWW Workshop on the Weblogging Ecosystem, 2005. M. Gamon, A. Aue, S. Corston-Oliver, and E. Ringger. Pulse: Mining customer opinions from free text. In Proc. of the 6th International Symposium on Intel ligent Data Analysis, pages 121­132, 2005. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 6:721­741, 1984. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc. of the Natural Academy of Sciences, 101 Suppl 1:5228­5235, 2004. T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems, 2004. A. Grub er, Y. Weiss, and M. Rosen-Zvi. Hidden Topic Markov Models. In Proc. of the Conference on Artificial Intel ligence and Statistics, 2007. T. Hofmann. Unsup ervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1):177­196, 2001. M. Hu and B. Liu. Mining and summarizing customer reviews. In Proc. of the 2004 ACM SIGKDD international conference on Know ledge discovery and data mining, pages 168­177, 2004. M. Hu and B. Liu. Mining Opinion Features in Customer Reviews. In Proc. of Nineteenth National Conference on Artificial Intel lgience, 2004. W. Li and A. McCallum. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. In Proc. Int. Conference on Machine Learning, 2006. Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and April 21-25, 2008 · Beijing, China opinions in weblogs. In Proc. of the 16th Int. Conference on World Wide Web, pages 171­180, 2007. D. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with Pachinko allocation. In Proc. 24th Int. Conf. on Machine Learning (ICML), 2007. T. Minka and J. La. Exp ectation-propagation for the generative asp ect model. In Proc. of the 18th Conf. on Uncertainty in Artificial Intel ligence, 2002. I. Ounis, M. de Rijke, C. Macdonald, G. Mishne, and I. Sob oroff. Overview of the TREC-2006 Blog Track. In Text REtrieval Conference (TREC), 2006. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proc. of the Conf. on Empirical Methods in Natural Language Processing, 2002. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proc. 31st Meeting of Association for Computational Linguistics, 1993. A. Pop escu and O. Etzioni. Extracting product features and opinions from reviews. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2005. M. Purver, K. Kording, T. Griffiths, and J. Tenenbaum. Unsup ervised topic modelling for multi-party sp oken discourse. In Proc. of the Annual Meeting of the ACL and the International Conference on Computational Linguistics, pages 17­24, 2006. L. Saul and F. Pereira. Aggregate and mixed-order Markov models for statistical language processing. In Proc. of the 2nd Int. Conf. on Empirical Methods in Natural Language Processing, 1997. B. Snyder and R. Barzilay. Multiple Asp ect Ranking using the Good Grief Algorithm. In Proc. of the Joint Conference of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies, pages 300­307, 2007. P. Turney. Thumbs up or thumbs down? Sentiment orientation applied to unsup ervised classification of reviews. In Proc. of the Annual Meeting of the ACL, 2002. H. M. Wallach. Topic modeling; b eyond bag of words. In Int. Conference on Machine Learning, 2006. X. Wang and A. McCallum. A note on topical n-grams. Technical Rep ort UM-CS-2005-071, University of Massachusetts, 2005. J. Wieb e. Learning sub jective adjectives from corp ora. In Proc. of the National Conference on Artificial Intel ligence, 2000. C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text Mining. In Proc. of the 2004 ACM SIGKDD international conference on Know ledge discovery and data mining, pages 743­748, 2004. L. Zhuang, F. Jing, and X. Zhu. Movie review mining and summarization. In Proc. of the 15th ACM international conference on Information and know ledge management (CIKM), pages 43­50, 2006. [6] [22] [23] [7] [24] [8] [25] [9] [26] [10] [27] [11] [28] [12] [13] [29] [14] [30] [15] [31] [16] [32] [33] [17] [18] [34] [19] [35] [20] [36] [21] 120