WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China Contextual Advertising by Combining Relevance with Click Feedback Deepayan Chakrabar ti Yahoo! Research 701 First Ave Sunnyvale, CA 94089. Deepak Agarwal Yahoo! Research 701 First Ave Sunnyvale, CA 94089. Vanja Josifovski Yahoo! Research 701 First Ave Sunnyvale, CA 94089. deepay@yahoo-inc.com ABSTRACT dagarwal@yahooinc.com vanjaj@yahoo-inc.com Contextual advertising supp orts much of the Web's ecosystem today. User exp erience and revenue (shared by the site publisher ad the ad network) dep end on the relevance of the displayed ads to the page content. As with other document retrieval systems, relevance is provided by scoring the match b etween individual ads (documents) and the content of the page where the ads are shown (query). In this pap er we show how this match can b e improved significantly by augmenting the ad-page scoring function with extra parameters from a logistic regression model on the words in the pages and ads. A key prop erty of the prop osed model is that it can b e mapp ed to standard cosine similarity matching and is suitable for efficient and scalable implementation over inverted indexes. The model parameter values are learnt from logs containing ad impressions and clicks, with shrinkage estimators b eing used to combat sparsity. To scale our computations to train on an extremely large training corpus consisting of several gigabytes of data, we parallelize our fitting algorithm in a Hadoop [10] framework. Exp erimental evaluation is provided showing improved click prediction over a holdout set of impression and click events from a large scale real-world ad placement engine. Our b est model achieves a 25% lift in precision relative to a traditional information retrieval model which is based on cosine similarity, for recalling 10% of the clicks in our test data. users, user engagement, content diversity, the last few years have seen a tremendous growth in sp ending on web advertising. According to eMarketer [9] the total internet advertiser sp ending in 2007 will reach almost 20 billion US dollars. This establishes the Web as one of the top 3 advertisement mediums, along with TV and print media. A ma jor part of the advertising on the web falls into the category of textual ads: short textual messages usually marked as "sp onsored links" or similar. There are two main typ es of textual ads on the web today: 1. Sponsored Search (SS) or Paid Search advertising places ads on the result pages from a web search engine based on the search query. All ma jor current web search engines supp ort such ads and act simultaneously as a search engine and an ad agency. 2. Contextual advertising or Context Match (CM) advertising places ads within the content of a generic, thirdparty web page. There usually is a commercial intermediary, called an ad-network, in charge of optimizing the ad selection with the twin goal of increasing revenue (shared b etween publisher and ad-network) and improving user exp erience. Here also the main players are the ma jor search engines; however, there are also many smaller players. While the methods prop osed in this pap er could b e adapted for b oth SS and CM advertising, we will focus in our analysis and exp eriments on the CM scenario. Studies have shown that displaying ads that are closely related to the content of the page1 provide a b etter user exp erience and increase the probability of clicks [4, 21]. This intuition is analogous to that in conventional publishing, where there are very successful magazines (e.g., Vogue) where a ma jority of the content is topical advertising (fashion, in the case of Vogue). Hence, estimating the relevance of an ad to a page is critical in serving ads at run-time. Previously published approaches estimated the ad relevance based on co-occurrence of the same words or phrases within the ad and within the page (see [18, 13, 3] and the related work section for more details). The model used in this b ody of work is to translate the ad search into a similarity search in a vector space. Each ad is represented as a vector of features, as for example, unigrams, phrases and classes [3]. The page is also translated to a vector in the 1 Ads can also b e b ehaviorally targeted; however, without loss of generality, we focus on contextual features. Categories and Subject Descriptors H.4.m [Information Systems]: Miscellaneous General Terms Algorithms, Exp erimentation, Measurements Keywords Clickthrough rate, Modeling, Interaction Effects 1. INTRODUCTION Web advertising provides financial supp ort for a large p ortion of today's Internet ecosystem, catering to a diverse set of sites like blogs, news, reviews etc. Spurred by the tremendous growth in traffic in terms of volume, numb er of Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. 417 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China same space as the ads. The search for the b est ads is now translated into finding the ad vectors that are closest to the page vector. To make the search efficient and scalable to hundreds of millions of ads and billions of requests p er day, we can use an inverted index and an efficient similarity search algorithm as the one rep orted in [2]. A drawback of this method is that it relies on a-priori information and does not use the feedback (a p osteriori) information that is collected in the form of ad impressions (displays) and clicks. Another line of work uses click data to produce a CTR estimate for an ad, indep endent of the page (or query, in the Sp onsored Search scenario [19, 17]). The CTR is estimated based on features extracted from the ads that are then used in a learning framework to build models for estimation of the CTR of unseen ads. In this approach the assumption is that the ads are selected by a deterministic method - by matching the bid phrase to a phrase from the page (or the query in Sp onsored Search) and therefore to select the most clickable ads we only need to estimate the CTR on the ads with the matching bid phrase. This simplifying assumption of the matching process is an obvious drawback of these approaches. Another drawback is that these methods do not account for differential click probabilities on different pages: If some pages in the corpus attract an audience that clicks on ads significantly more than average, then the learning of feature weights for ads will b e biased towards ads that were (only by circumstance) shown on such pages. Contributions. In this work we combine the two different lines of work and prop ose a relevance based approach that is augmented to use the the click data to produce a CTR estimate. We adapt the vector space similarity formulation and keep the format of the scoring formulas so that they are still suitable for inverted index evaluation that has low cost and high scalability. To incorp orate the feedback, we modify the scoring formulas with correction parameters that are learned from the click data. We examine several scoring formulas in order of increasing complexity and prop ose learning the parameters by statistical learning methods. We overcome the problem of learning from data with low click rates by a sampling method that equalizes the sizes of clicked and non-clicked p ortions of the data, and then solves the problem of the (page, ad) sparsity by explaining click propensities (scores) of ads on pages in terms of the words that occur in b oth. In summary our main contributions are: · We prop ose a set of intuitive models of click prop ensities in terms of the words that occur in the webpage and ad. The basic model takes into consideration the location of the words (title, main b ody, sidebar, etc.), while the most complex model also considers word synonyms, the tf-idf values of the words and relevance measures, among others. · The models were carefully chosen to extend scoring formulas in current use so that they could b e easily implemented on top of existing systems. Thus, they can b e immediately used for efficient ad selection from a very large corpus of ads. · We describ e a fast method for fitting the parameters of these models, and prescriptions for picking the right model given the dataset size and runtime execution constraints. · Extensive exp eriments on real-world datasets convincingly demonstrate the accuracy of our models. 2. RELATED WORK Online advertising is an up coming area of research and the published literature is sparse. A study presented in [21] confirms the intuition that ads need to b e relevant to the user's interest to avoid degrading the user's exp erience and increase the probability of reaction. IR methods used in web search engines are one of the most prominent methods to ensure relevance of search results. A pioneering rep ort by Rib eiro-Neto et. al on contextual matching [18] examines a numb er of strategies to match pages to ads based on extracted keywords. Here also the ads and pages are represented as vectors in a vector space. The work first presents five strategies that use cosine similarity to match page and the ad vectors. The authors explore using different ad sections (bid phrase, title, b ody) as a basis for the ad vector. The winning strategy out of the first five requires the bid phrase to app ear on the page and then ranks all such ads by the cosine of the union of all the ad sections and the page vectors. While b oth pages and ads are mapp ed to the same space, there is a discrepancy (imp edance mismatch) b etween the vocabulary used in the ads and in the pages. Furthermore, since in the vector model the dimensions are determined by the numb er of unique words, plain cosine similarity will not take into account synonyms. To solve this problem, Rib eiroNeto et al expand the page vocabulary with terms from other similar pages weighted based on the overall similarity of the origin page to the matched page, and show improved matching precision. A follow-up work [13] prop oses a method to learn impact of individual features using genetic programming to produce a matching function. represented as a tree. Arithmetic op erators and the log function are internal nodes while different numerical features of the query and ad terms can b e leafs of the function tree. The results show that genetic programming finds matching functions that significantly improve the matching compared to the b est method (without page side expansion) rep orted in [18]. In our previous work [3] we expand on the search of the ads in a vector space by adding classes as additional dimensions. This allows for having the ads topically match the content of the page. We used tf-idf for weighting the unigram and phrase features and class confidence score returned by the classifier as weight of the class features. While the class weights are trained over a lab eled set of ads, there is no direct use of the click data to improve on the ad selection. All of these approaches share the use of similarity in a feature space to relate ads to pages. However none of these approaches uses statistical learning techniques to factor the implicit relevance feedback in a form of click data to learn how to b etter match pages to ads. Another approach to contextual advertising is to reduce it to the problem of sp onsored search advertising by extracting phrases from the page and matching them with the bid phrase of the ads. In [22] a system for phrase extraction is describ ed that used a variety of features to determine the imp ortance of page phrases for advertising purp oses. The system is trained with pages that have b een hand annotated with imp ortant phrases. The learning algorithm takes into account features based on tf-idf, HTML meta data and 418 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China query logs to detect the most imp ortant phrases. During evaluation, each page phrase up to length 5 is considered as p otential result and evaluated against a trained classifier. In our work we also use phrases as feature and we learn parameters to determine the relative imp ortance of a particular phrase compared to other phrases and unigrams. Another key difference related to this work is that we use multiple features to select the ads. Another line of research attempts to predict the click through rate of ads using similar tools (clustering, keyword matching and classification) for search advertising [17]. In this work, the ads are clustered by their bid phrases. The click through rate is averaged over each cluster. The CTR estimate for new ads is obtained by finding the nearest cluster and assuming that cluster's CTR. Results show that this improves over naive global CTR priors and CTR based on the bid phrase deciles. Roughly sp eaking, our method consists of three broad steps: (a) Feature extraction, (b) Feature selection, and (c) Coefficient estimation for features through a logistic regression. We provide a detailed description of each b elow. 3.1 Feature Extraction Pages and ads are treated as b eing comp osed of several regions. For instance, a page is comp osed of page title, page metadata, page b ody, page URL etc. Similarly, an ad is comp osed of ad title, ad b ody etc. Within each region, we extract a set of words/phrases after stop word removal. We associate a score (e.g. region sp ecific tf, tf-idf ) to each word that measures its imp ortance in a given region. For a given (page, ad) region combination, our model has three sets of features describ ed b elow. Page region specific main effects. Web pages are usually comp osed of multiple regions with different visibility and prominence. The impact of each region on the ad selection can thus vary. We follow the intuition of our previous work [3] and we learn the effect of each region separately. For a word w in page region p(r ) with score tp(r)w , the regionsp ecific main effect is defined as Mp(r)w = 1(w p(r )) · tp(r)w , that is, if the word is present in the page region p(r ), the feature contributes its score else it does not contribute. These features provide an estimate of word p opularity. They are not useful at the time of selecting relevant ads for a given page but help in getting b etter estimates of other terms in the model after adjusting for the effect of p opular words on a page. For instance, if "camera" pages are p opular in terms of click-through rates and 90% of our corpus consists of camera pages, "camera" ads that were the ones mostly shown on camera pages would tend to b ecome p opular even on "soccer" pages which constitute only 1% of the total corpus. By incorp orating page words in the model, we adjust for this effect and get the correct matching ads for "soccer" pages. Ad region specific main effects. Ads are also comp osed of multiple regions, some visible to the user (title, abstract) and some used only in the ad selection (bid phrase, targeting attributes). As with the page regions, the ad regions can have different impact on the ad selection. For a word w in ad region a(r ) with score ta(r)w , this is defined as Ma(r)w = 1(w a(r )) · ta(r)w . Unlike page sp ecific main effects, it does play an imp ortant role when selecting relevant ads for a given page and provides more weight to p opular ads. Interaction effects between page and ad regions. For a word w1 in page region p(r1 ) and word w2 in ad region a(r2 ) with score f (tp(r1 )w1 , ta(r2 )w2 ) for some function f , this is given as Ip(r1 )w1 ,a(r2 )w2 = 1(w1 p(r1 ), w2 a(r2 )) · f (tp(r1 )w1 , ta(r2 )w2 ). (1) 3. METHOD Our prop osed method to match relevant ads to pages is based on logistic regression, a p opular technique in statistics and machine learning [14]. The regression enables us to combine click feedback and semantic information available from b oth pages and ads to determine relevancy. This is more general than a pure relevance based approach that does not use click feedback in any form. Indeed, our exp eriments convincingly demonstrate the usefulness of using click feedback to find more relevant ads. There has b een recent work on using regression models for determining relevant ads [19]. While it has the same flavor as our work, only ad-sp ecific features are learnt, which is only a subset of the features we consider. In particular, in addition to page and ad sp ecific features, we learn features that capture interactions b etween pages and ads. Furthermore, we combine word based features with traditional relevance measures to enhance matching relevant ads to pages. Our models are more granular and can incorp orate larger numb er of features, which reduces bias in CTR estimates and leads to b etter p erformance. However, reduced bias comes at the price of increased variance, which can b ecome a serious problem if the models b ecome too granular and start overfitting the training data. To balance these two issues, we use a two-pronged strategy. First, we use a relatively large but special ly selected set of features, where the selection mechanism ensures that the features have reasonable supp ort. We also provide a mechanism based on prior probabilities to down-weight features that are too sparse. The second strategy we use to prevent overfitting is to train our models on an extremely large corpus (billions of records, several thousand features) which automatically increases the supp ort of a large numb er of features. Fortunately, data is plentiful esp ecially for big ad-networks that serve a large numb er of publishers and advertisers. However, increased training size p oses a difficult computational challenge of scaling logistic regression to web scale data. We overcome this by using an approximation based on a "divide and conquer strategy", i.e., we randomly split our training corpus into several pieces and fit a separate logistic regression to each piece. The final result is obtained by combining estimates from all the pieces. Our computation is carried out in the software framework called MapReduce[12] that supp orts large scale parallel computations using a cluster of commodity p ersonal computers. In this pap er, we confine ourselves to the case where w1 = w2 (i.e., the feature "fires" only if the same word occurs in b oth the corresp onding page and ad regions), but in general, one can consider co-occurrences of synonyms or related words. Examples of f include the product function 419 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China p tp(r1 )w1 × ta(r2 )w2 , the geometric mean tp(r1 )w1 × ta(r2 )w2 and so on. Interaction effects are imp ortant comp onents of our method and help in matching relevant ads to a given page. For instance, occurrence of the word "camera" in the ad b ody is a strong indication of the ad b eing relevant for the page whose title contains the word "camera," with the degree of relevance b eing determined by the regression. 3.2 Feature Selection For any given (page, ad) region combination, a large numb er of words occur in the training data. Using them all as features might make the logistic regression ill-conditioned and inflate variance of the coefficient estimates. Hence we take recourse to variable selection techniques which select a subset of imp ortant words to b e used in our regression. Variable selection in the context of regression is a well studied area with a rich literature. Stepwise backward-forward automated variable selection algorithms are widely used for large scale applications but these methods have drawbacks, esp ecially when features are correlated [8]. The general recommendation is to use as much domain knowledge as p ossible instead of using an automated procedure to select relevant variables. However, in large scale settings as ours, some level of automation is necessary. For reasons of scalability, we explore a two-stage approach. In the first stage, we conservatively prune non-informative features using simple measures that can b e computed using only a few passes over the training corpus. In the second stage, we fit a regression to all the selected features from the first stage but down-weight them through a sp ecially constructed prior that p ools data from all the features, while putting more coefficient on those that are less sparse. The latter approach will b e describ ed in more detail in the next subsection; we discuss our variable selection methods next. We selected the variables using two methods, the first based on clicks and views, and the second based on relevance scores of words that are indep endent of any click feedback. In the first approach (data-based), we rank words based on a measure that quantifies the interaction b etween words occurring in the page and ad regions. For a word w, the interaction measure is defined as iw = bo C T Rw th pag e a C T Rw · C T Rwd Figure 1: Distribution of (page, ad) impressions: Plots are on log-log scale but ticks are on the original scale. them in the logistic regression. To avoid noisy estimates of CTRs in the ratio, we only consider words that were shown simultaneously on ad and page regions at least 10 times and had non-zero marginal probabilities. As our exp eriments will show later, the data-based approach gives b etter results for the same numb er of words. 3.3 Approximate Logistic Regression Let yij denote the binary click outcome (1 for click, 0 for no click) when ad j is shown on page i. We assume yij has a Bernoulli distribution with CTR pij , i.e., the probability yi distribution of yij is given by P (yij ) = pij j (1 - pij )1-yij . To determine relevant ads for a given page i, we need to estimate pij 's, with higher values indicating more relevant ads. For ads that are shown a large numb er of times on a page, the CTR can b e estimated empirically by clicks p er impression. However, in our application a large fraction of page-ad pairs having a small numb er of impressions (Figure 1). In fact, since the CTRs are typically low (0.1% - 20% with a substantial right skewness in the distribution), numb er of impressions required to get precise empirical estimates are high. For instance, to estimate a 5% CTR, we need 1K impressions to b e even 85% confident that our estimate is within 1% of the true CTR. Thus, we take recourse to feature based models, i.e., pij is a function of features extracted from page and ad regions as discussed in section 3.1. To allow for arbitrary real-valued coefficients for features, it is routine to map pij onto the real line via a monotonically increasing function. The most widely used function is the logit which maps pij to log it(pij ) = log [pij /(1 - pij )]. We assume that log it(pij ) is a linear function of features representing the main effects and interaction effects discussed in section 3.1. For simplicity, consider a single (page, ad) region combination (p(r1 ), a(r2 )). The linear function in the logistic regression is given by: X w Mp(r1 )w (3) log it(pij ) = log it(qij ) + + X w w (2) bo where C T Rw th denotes the click-through rate (CTR) when w occurred b oth on page region and ad region of an ad p a displayed on a page, and C T Rwage and C T Rwd denote the marginal CTRs when w shown on the page and ad regions resp ectively. Higher values of the ratio indicates stronger interaction b eing induced by the presence of the word which in turn should enhance the matching quality of ads to pages. We also tried a variation of the measure ab ove with a square root of the denominator with no significant impact. In the second approach (relevance-based), words are ranked by computing the average tf-idf scores across the entire page and ad corpus for the resp ective regions under consideration. Here, we exp eriment with two measures: (a) create a single corpus by treating page and ad regions as documents and compute a single tf-idf average score for each word, and (b) Treat the page and ad regions as different corp ora and use the geometric mean of tf-idf scores computed separately from page and ad regions for each word. For b oth measures, we picked the top 1000 words and used w Ma(r2 )w + X w w,r1 ,r2 Ip(r1 )w,a(r2 )w where w = (, , ) are unknown feature coefficients to b e estimated by logistic regression, and log it(qij ) are known prior log-odds that could have b een derived from a different 420 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China model. For instance, a uniform prior would assume qij = p, ^ where p is the average CTR on the entire training corpus. ^ Another p ossibility we explore is to derive prior log-odds qij by combining relevance scores with click feedback. To add new (page,ad) region combination, we only need to augment equation 3 with the appropriate linear terms for the page main and ad main effects. For the interaction effects, we re-parametrize our model to facilitate indexing. We explain our re-parametrization here and discuss the connection to indexing later in section 4. For each (page,ad) combination (r1 , r2 ), a word w that occurs in b oth r1 and r2 has a coefficient w,r1 ,r2 which dep ends on the word, the page region and the ad region. We assume the following parametrization. w,r1 ,r2 = w · p(r1 ) · a(r2 ) (4) of the log-likelihood (Eq. 5) and the log-prior of the coefficients, as discussed ab ove. Next, we discuss the optimization process itself. Several approaches to optimize our ob jective function exist in the literature. Among the ones that have b een used in large-scale applications are iterative scaling [20], nonlinear conjugate gradient, quasi-Newton (in particular, limited memory BFGS) [7], iteratively-reweighted least squares [14], truncated Newton [16], and trust-region Newton [5]. All the methods are iterative and generate a sequence of estimates that converge to the optimal solution. For all methods except iterative scaling, cost p er iteration is high but the convergence is fast. For iterative scaling which up dates one comp onent at a time, cost p er iteration is low but convergence is slower. For our application, the training corpus typically has several millions data p oints and several thousand features making it extremely slow to fit the model using these approaches on a single machine. To scale our computations, we adopt a simple parallelization approach that randomly splits the data into several parts, fits a logistic regression separately to each part and then combines the estimates obtained from each piece. For convenience, we p erform our computation in a MapReduce framework [12]. MapReduce is a programming model for processing large data sets. It runs on a large cluster of commodity machines; it is highly scalable processing several gigabytes of data on thousands of machines and easy to use. The run-time system automatically takes care of the details of partitioning the data, scheduling job across machines, handling failures and managing inter-machine communication. To fit a logistic regression for a given piece, we use a simple iterative scaling (also known as conditional maximization) approach. The algorithm is as follows: We initialize the coefficients 's, 's, and 's to 0, and p(.) 's and a(.) 's to 1. We up date the value of each coefficient one at a time holding the others fixed at the current value by maximizing the likelihood through a Newton-Raphson method. This completes a single iteration. The procedure is continued through several iterations until convergence. The method is guaranteed to converge since every step can only increase the likelihood. Along with a coefficient estimate, the Newton-Raphson procedure provides an estimate of the negative Hessian, the inverse of which provides an estimate of variance of the coefficient from maximum likelihood theory. The results on the various data partitions are combined using a weighted average of the individual estimates, where the weight assigned to partition-sp ecific estimate is its relative precision obtained from the negative Hessian values. This weighting scheme is the b est way to combine estimates through a linear function [6]. Figure 2 describ es our fitting procedure in detail. i.e., the interaction of a word for a given page and ad region combination is factored into word-sp ecific, page-sp ecific and ad-sp ecific comp onents. Thus, for M words, R1 page regions, R2 ad regions, the numb er of parameters equals M + R1 + R2 as opp osed to M · R1 · R2 in the original model. The estimate of coefficients is obtained by maximizing the log-likelihood of the data as given by X (yij log (pij ) + (1 - yij )log (1 - pij )) (5) ij where pij is given by equation 3. The optimization problem describ ed ab ove may b ecome ill-conditioned and lead to high variance estimates if features tend to b e correlated or are sparse or b oth. For exact technical details on necessary and sufficient conditions that ensure convergence, we refer the reader to [15]. This is a drawback in our scenario where feature sparsity and correlations are routine. To provide a robust solution, we put additional constraints on the coefficients in the form of priors. A N (0, 2 ) prior would mean that the parameter estimates are pinned down in the range (-3, 3 ) with 99% probability a-priori. In the absence of enough information ab out the coefficient from data, this ensures that the coefficient estimates do not diverge to the b oundaries and cause numerical instability. To put more stringent constraints on sparse features, we down-weight the prior variance 2 by a measure of relative sparsity which we define to b e the variance of the feature occurrence process relative to average feature occurrence variance. The feature occurrence variance is given by s(1 - s), where s is the fraction of times the feature occurs. In particular, we assume: ,, « s p (w )(1 - s p (w )) w N 0, 2 · s p (1 - s p ) « ,, s a (w )(1 - s a (w )) w N 0, 2 · s a (1 - s a ) « ,, s I (w )(1 - s I (w )) w N 0, 2 · s I (1 - s I ) Note that separate averages are used for the main page and ad effects, and interaction effects (indicated by the subscripts p, a, and I ). In all our exp eriments, we fix 2 = 9; exp eriments with several other values in the range of 3 to 20 did not yield much difference. Now, the optimization problem reduces to estimating the coefficients by maximizing the log-p osterior which is the sum 4. AD SEARCH PROTOTYPE A key feature of the model prop osed in this pap er is that it is suitable for efficient evaluation over an inverted indexes. In this section we outline an implementation of a prototyp e ad search engine based on the WAND [2] algorithm and inverted indexing of the ads using the Hadoop distributed computing framework [10]. Our method allows for any kind of feature to b e used in the ad search. In our prototyp e we use unigrams, phrases and classes as features. The inverted index is comp osed of one 421 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China Figure 3: Ad search prototype architecture Algorithm CondMax Initialize coeffi = 0 for all word features i and coeffi = 1 for page-region and ad-region sp ecific parameters in the re-parametrized model of equation 4. Iterate until convergence Iterate i over the set of features Set f (i) = f (i) = loglik = 0 Iterate over all impressions, with the data providing scorei for feature i P x = i coeffi · scorei p = 1.0/(1.0 + exp(-x)) f (i) = f (i) + (click or not - p) × scorei f (i) = f (i) - p(1 - p) × score2 i loglik = loglik + log p - x · (1 - click or not) ( ( 2 f i) = f i) - coeffi /i 2 f (i) = f (i) - 1.0/i ( coeffi = coeffi - f i)/f (i) if loglik is lower than previous Return (coeffi , f (i)) for all i Algorithm CombineCoeffs Randomly split the data Call CondMax on each split Iterate ovePall features i r ( splitj coeffi ·|f i)| P C (i) = |f (i)| splitj Return C (i) as the final coefficient value for feature i Figure 2: Implementation of Logistic Regression. postings list for each feature that has one entry (p osting) for each ad that contains this feature. The ads are represented by adIDs - unique numeric identifiers assigned to each ad. Figure 3 shows the architecture of our ad search prototyp e. We produce the inverted index over a grid of machines running the Hadoop framework [10]. The indexing starts by extracting features from the ads. Each feature is represented by an unique numeric featureID. The resulting data file is sorted by < adI D, f eatur eI D >. Next we invert this file by sorting the file by < f eatur eI D, adI D > as a key. Finally we write the delta compressed p osting lists used at runtime to evaluate the query. There are a few imp ortant differences in the ad search engine problem that require different approach compared to web search engines. First, in web search, the queries are short and the documents are long. In the ad search case, the numb er of features p er ad is usually lower than the numb er of features extracted from a web page, which in our case represent the ad space query. So it is almost never the case that an ad will contain all the features of the ad search query. Therefore the ad search engine p erforms similarity search in the vector space with a long query and relatively short ad vectors. In contrast for the ma jority of the web queries there are many pages that contain all the query words and one of the key issue is how to rank the pages containing the query. Features are extracted from the input query. The query is a bag of pairs < f eatur eI D, weig ht >. For each query feature, WAND op ens a cursor over the p osting list of this feature. During the evaluation the cursors are moved forward examining the documents as they are encountered. WAND is a document at the time algorithm [1] that finds the next cursor to b e moved based on an upp er b ound of the score for the documents at which the cursors are currently p ositioned. The algorithm keeps a heap of current candidates. The invariant of the algorithm is that the heap contains the b est 422 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China matches (highest scores) among the documents (ads) with IDs less than the document p ointed by the current minimum cursor. Cursors p ointing on documents with upp er b ound smaller than the minimum score among the candidate docs are candidates for a move. To find the upp er b ound for a document, the algorithm assumes that all cursors that are b efore the current will hit this document (i.e. the document contains all those terms represented by cursors b efore or at that document). It has b een shown that WAND can b e used with any function that is monotonic with resp ect to the numb er of matching terms in the document. It can also b e easily shown that some non-monotonic scoring functions can also b e used as long as we can find a mechanism to estimate the score upp er b ounds. One family of such functions is a set of functions where a fixed subset of the features (known a priori) always decrease the score. In such cases, the upp er b ound estimates just assume that these features do not app ear in the ad. An example of such function is a cosine similarity where some of the query coefficients are negative. The scoring function prop osed in this pap er might have such coefficients and fits well within the WAND framework. Incorp orating the logistic-regression based model in this framework is simple. The scoring equation 3 is modified to exclude the page effect and used as WAND scoring formula. Figure 3 shows the stages in which the learned parameters are factored into the evaluation framework. Ma(r) is used at indexing time to calculate a static score for each individual ad. We use this score to assign an adID to the ads in decreasing ad score order. This allows for estimating upp er b ounds of the ads that are skipp ed by using scores by using the score of the ad p ointed by the preceding cursor in the sorted cursor list. After the page is parsed and the features are extracted along with their TF-IDF scores, we apply the reweighing based on the table Iw . The Mp(r) table is not used in the ad selection but just to adjust the final scores to calculate the probabilities according to equation 3. pressions and equals the numb er of ads shown on the page. There are approximately .5B - 1B page views on a day in the p ortion of the data that we examined. Defining CTR as the numb er of clicks p er impressions, overall CTRs for (page,ad) combinations are low and approximately in the range of .1% - 20% with extreme right skew. Lab eling the clicks as 1 and non-clicks as 0, we have a learning problem with extreme class imbalance. The logistic regression is known to provide biased estimates of weights in such scenarios [11]. A widely used strategy in such settings reduces imbalance by sampling from the ma jority class. We also adopt a similar strategy but our sampling scheme is as follows: we retain all impressions associated with clicked page views and get a 1/2000 random sample from the pageviews associated with non-clicked page views. All results rep orted in this pap er are on data sampled in this fashion. 5.2 Word selection schemes We conducted exp eriments with several (page,ad) region combinations but provide detailed analysis for exp eriments conducted with one region combination, viz, page title (call it pti ) and ad title (call it oti ). In the training corpus, we obtained approximately 112K unique words after stop word removal. To avoid introducing extremely sparse words in our regression, we truncated our word list to consider only those that had at least 100 impressions where the word occurred b oth on the page and ad. Also, we further truncated our list by considering only words that had at least 5 clicks separately when shown on pages and ads. The truncation left us with approximately 3.4K words to choose from. We implemented one data based and two relevance-based feature selection schemes. The data based feature selection criteria was describ ed in section 3.1 and given by the measure defined as iw in equation 2. We conduct two exp eriments, one with the top 1K words and other with all 3.4K words. Similar exp eriments were also conducted on words selected using two relevance-based measures describ ed previously in Section 3.2. Briefly, the first relevance-based measure assigns the average tf-idf score to a word. It is computed from a single corpus which is the union of page title and ad title regions. The second relevance-based measure is the geometric mean of average tf-idf scores of a word computed separately from page region and ad region corp ora. We also conducted a set of exp eriments by randomly selecting a set of 1K words from our list of words which p erformed very p oorly and hence its results are not rep orted. 5. EXPERIMENTS We now provide a detailed discussion of all our exp eriments, conducted on a large-scale real-world dataset. After a brief description of the data, we present results on the different word selection schemes, followed by comparisons of several variants of our model with a baseline model that does not utilize any click feedback and is based solely on relevance. The results convincingly demonstrate the improved p erformance of models that use click feedback. 5.3 Hybrid model A pure relevance-based based model finds relevance by using semantic information. More sp ecifically, the relevance of an ad for a page is quantified by a score which is a function of page and ad content. We tested two such scores for region combination (pti, oti), viz. X tf-idfw,pti · tf-idfw,oti Relev ance1 = w 5.1 Data To demonstrate the efficacy of our approach, we conduct a series of exp eriments on retrosp ective data collected from some back-end servers of a large contextual advertising system. We would like to emphasize that the servers were chosen sub jectively and are not representative of the actual p erformance of the entire advertising system. We conducted our exp eriments on 30 days of data. The first 15 days of data were used as our training corpus while the results were evaluated on the remaining days. Every display of an ad on a page is called an impression. Thus, a single page view by a user generates multiple im- and Relev ance2 = qP Relev ance1 P 2 2 w tf-idfw,pti · w tf-idfw,oti Relev ance2 is the cosine of the angle b etween tf-idf vectors for page and ad regions. The distribution of Relev ance1 423 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China Figure 5: Precision-recal l curves for the three word selection schemes with the best hybrid model that uses both words and relevance measure as features. Figure 4: Relationship between log-odds of CTR and Relev ance2 . was extremely right skewed with several outliers, hence we conduct all our exp eriments with Relev ance2 . In addition to a model based solely on Relev ance2 , we explore a hybrid model which combines our word-based logistic regression with the relevance-based score model. We explore two options: (a) use Relev ance2 as an additional feature in the logistic regression but without adding substantial overhead to the indexing algorithm, and (b) use Relev ance2 to derive the priors qij in equation 3. Both these options require understanding the relationship b etween the log-odds of CTR and Relev ance2 ; the relevance score needs to b e transformed on to the scale of log-odds of probability. We empirically studied this relationship by plotting empirically estimated log-odds against Relev ance2 scores as follows: sort all p ositive Relev ance2 values in the training data and create 100 bins. For each bin, we compute the empirical log-odds and the average Relev ance2 score. Figure 4 shows the relationship. Approximately 90% of impressions had a Relev ance2 value of zero. Excluding those, the curve clearly reveals a quadratic relationship. Thus, for scheme (a) ab ove, we add quadratic terms a + b Relev ance2 + c Relev ance2 to the logistic re2 gression and estimate parameters a, b and c. For scheme (b) ab ove, we derive qij 's by using the approximate quadratic curve shown in figure 4: log it(qij ) = -1.75 + 2.52 Relev ance2 - 1.56 Relev ance2 . 2 score is the fraction of all clicks that were present (recalled) in the test cases ab ove that threshold. Precision for a given recall value is the CTR on the test cases ab ove the threshold. Since our end goal is to rank ads for a given page, we are more interested in algorithms that recall the top fraction of clicks (e.g 2%, 5%, 10%) with greater precision. Also note that, for a random classifier that classifies a test case as a click with a constant probability (overall CTR in training corpus), the precision equals that constant probability value for every recall value. 5.5 Results Our key results are as follows: · Figure 5 shows the precision-recall curves for the three word selection schemes for our b est model, viz, the hybrid model that uses b oth words and relevance measure as features (results with other models were similar). Of the three word selection schemes we tried, the data-based one using measure iw of equation 2 gave the b est results, esp ecially for low values of recall. Of the two relevance-based word selection criteria, the one that uses the geometric mean of the average tf-idf score was slightly b etter than the one that used average tfidf from the union of page and ad regions. However, b oth the relevance schemes were inferior to the databased one. In addition, we found no significant difference in going from the top 1000 words to the top 3400 words in the data-based scheme; hence, those results are not shown. This does indicate, however, that the top few words according to our data-based measure iw are enough to capture most of the signal in the data. · We conducted exp eriments using b oth tf and tf-idf as our scores for features in equation 3. We consistently get b etter results with tf; all our subsequent results are rep orted with this measure. · Figure 6 shows the precision-recall curve for the hybrid, word-only and relevance-based models. Both our word-only logistic regression using tf as feature scores 5.4 Evaluation Criteria Our test data had a total of 14.48M impressions. Of these, 12.8M had at least one word from our 1K word list or a p ositive tf-idf score when considering only words in pti and oti. We only rep ort results on this p ool. We use the precisionrecall metric to measure the p erformance of our algorithms. In our scenario, recall for a given threshold on the model 424 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China (a) Truncated at 20% recall (b) All recall values Figure 6: Precision recal l curves for three methods. Al l curves are reported for the best word selection scheme. The second curve was truncated at a recal l value of 20% to provide a better summary of performance at smal l recal l values. Statistical error in al l cases was less than 1% of the mean. in equation 3 and the hybrid model that uses b oth word-based and Relev ance2 based features outp erformed the pure relevance model that ranks ads solely based on Relev ance2 . In fact, the hybrid model is sup erior to the word only model; esp ecially for low recall values like 2%, 5% and 10%. The hybrid models that use Relev ance2 as features and prior have approximately similar p erformance. In fact, for a recall of 10%, the hybrid model provides 25% lift relative to a pure relevance-based model and 100% lift relative to random model in terms of precision. We notice sharp spikes in the precision-recall curves, esp ecially at low recall values. This is indicative of high click density regions in the test data when ranked by model scores. On further scrutiny we discovered some of these high density regions were caused by (page,ad) impressions that had extremely high click rates (e.g in the range of 30% - 60%). This can occur in our data due to a numb er of reasons: p opular ads, p opular pages, sudden burst of heavy clickers, etc. The 95% confidence intervals obtained by producing precision-recall curves on several indep endent splits of test data was within 1% of the mean; hence, all our results are statistically significant. · Figure 7 shows histograms of parameter estimates obtained from our b est hybrid model. The histograms show the distribution of coefficient estimate × averagetf for words for page main effects, ad main effects and page-ad interaction effects. Interestingly, the interaction effects are small for a ma jority of words except for a small fraction that have large p ositive values. This ties in with the observation that using the top 1000 words was as accurate as using the top 3400 words; if only a few words have significant interaction effects, then adding more words as features into the logistic regression will not help. 5.6 Summary of Results Based on an extensive set of large scale exp eriments, we demonstrated that using click feedback significantly improves the accuracy of matching relevant ads to pages compared to traditional relevance scoring models that are solely based on semantic similarity. We incorp orate click feedback through a logistic regression by using the words on pages and ads as features. We tested several variable selection schemes and found a simple data based scheme p erforms the b est. We also prop osed a hybrid model which uses words and relevance scores as features and significantly outp erforms the traditional relevance based models in our exp eriments. Our solution is scalable and can process several gigabytes of data through a Hadoop parallelization framework. In addition, it can b e easily incorp orated into existing ad-serving architectures. 6. CONCLUSIONS We prop osed a new class of models to combine relevance with click feedback for a contextual advertising system. Our model is based on a logistic regression and allows for a large numb er of granular features. The key feature of our modeling approach is the ability to model interactions that exist among words b etween page and ad regions in a way that is suitable for efficient evaluation over inverted indexes. In fact, we employ a multiplicative factorization to model the interaction effects for several (page, ad) regions in a parsimonious way that facilitates fast look-up of ads at run time. Through large scale exp eriments, we convincingly demonstrate the advantage of combining relevance with click feedback. In fact, we achieve a 25% lift in precision for a recall value of 10% relative to a pure relevance based model in our exp eriments. In the future, we want to enrich our models to discover interactions that are caused due to word synonyms b oth using a sup ervised and unsup ervised approach. We are currently exp erimenting with a wide range of feature selection and model fitting scheme for the logistic regression prop osed 425 WWW 2008 / Refereed Track: Search - Ranking & Retrieval Enhancement April 21-25, 2008 · Beijing, China (a) Page Main Effects (b) Ad Main Effects (c) Page Ad Interaction Effects Figure 7: Average estimated weight for page main effects, ad main effects and interaction effects. The distribution is obtained as product of parameter estimate from regression and the average tf score associated with a word in the training corpus. in the pap er. We are also exploring the application of our method to other systems where there is implicit feedback in the form of clicks, such as web search and search advertising. [14] 7. REFERENCES [1] R. Baeza-Yates and B. Rib eiro-Neto. Modern Information Retrieval. ACM, 1999. [2] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM '03: Proc. of the twelfth intl. conf. on Information and know ledge management, pages 426­434, New York, NY, 2003. ACM. [3] A. Z. Broder, M. Fontoura, V. Josifovski, and L. Riedel. A semantic approach to contextual advertising. In SIGIR, pages 559­566, 2007. [4] P. Chatterjee, D. L. Hoffman, and T. P. Novak. Modeling the clickstream: Implications for web-based advertising efforts. Marketing Science, 22(4):520­541, 2003. [5] C.Lin, R.C.Weng, and S.S.Keerthi. Trust region newton methods for large-scale logistic regression. In International Conference on machine learning, 2007. [6] C.R.Rao. Linear Statistical Inference and its Applications. Wiley-Interscience, 2002. [7] D.C.Liu and J.Nocedal. On the limited memory bfgs method for large scale optimization. Mathmematical Programming, 45:503­528, 1989. [8] S. Derksen and H. J. Keselman. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology, 45:265­282, 1992. [9] Online ad sp ending to total $19.5 billion in 2007. eMarketer, February 2007. Available from http: //www.emarketer.com/Article.aspx?id=1004635. [10] A. Foundation. Apache hadoop pro ject. In lucene.apache.org/hadoop. [11] G.King and L.Zeng. Logistic regression in rare events data. Political Analysis, 9:137­162, 2001. [12] J.Dean and S.Ghemawat. Mapreduce:simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, pages 137­150, 2004. [13] A. Lacerda, M. Cristo, M. A. G., W. Fan, N. Ziviani, [15] [16] [17] [18] [19] [20] [21] [22] and B. Rib eiro-Neto. Learning to advertise. In SIGIR '06: Proc. of the 29th annual intl. ACM SIGIR conf., pages 549­556, New York, NY, 2006. ACM. P. McCullagh and J.A.Nelder. Generalized Linear Models. Chapman and Hall, 1989. M.J.Silvapulle. On the existence of maximum likelihood estimates for the binomial resp onse models. Journal of the Royal Statistical Society, Series B, 43:310­313, 1981. P.Komarek and A.W.Moore. Making logistic regression a core data mining tool with tr-irls. In International Conference on Data Mining, pages 685­688, 2005. M. Regelson and D. Fain. Predicting click-through rate using keyword clusters. In In Proc. of the Second Workshop on Sponsored Search Auctions, 2006. B. Rib eiro-Neto, M. Cristo, P. B. Golgher, and E. S. de Moura. Imp edance coupling in content-targeted advertising. In SIGIR '05: Proc. of the 28th annual intl. ACM SIGIR conf., pages 496­503, New York, NY, 2005. ACM. M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In WWW, pages 521­530, 2007. S.D.Pietra, V.D.Pietra, and J.Lafferty. Inducing features of random fields. IEEE PAMI, 19:380­393, 1997. C. Wang, P. Zhang, R. Choi, and M. D. Eredita. Understanding consumers attitude toward advertising. In Eighth Americas conf. on Information System, pages 1143­1148, 2002. W. Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In WWW '06: Proc. of the 15th intl. conf. on World Wide Web, pages 213­222, New York, NY, 2006. ACM. 426