WWW 2008 / Refereed Track: Internet Monetization - Online Advertising April 21-25, 2008 · Beijing, China Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising Hamid Nazerzadeh hamidnz@stanford.edu ABSTRACT We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme in online advertising. In this scheme, instead of paying p er click, the advertisers pay only when a user takes a sp ecific action (e.g. fills out a form) or completes a transaction on their websites. We focus on designing efficient and incentive compatible mechanisms that use this charging scheme. We describ e a mechanism based on a sampling-based learning algorithm that under suitable assumptions is asymptotically individually rational, asymptotically Bayesian incentive compatible and asymptotically ex-ante efficient. In particular, we demonstrate our mechanism for the case where the utility functions of the advertisers are indep endent and identically-distributed random variables as well as the case where they evolve like indep endent reflected Brownian motions. Stanford University Stanford, CA 94304 saberi@stanford.edu Stanford University Stanford, CA 94304 Amin Saberi r-vohra@kellogg.nwu.edu Nor thwestern University Evanston, IL 60208 Rakesh Vohra Categories and Subject Descriptors J.4 [Social and Behavioral Sciences]: Economics; F.2.0 [Analysis of Algorithms and Problem Complexity]: General; I.2.6 [Artificial Intelligence]: Learning General Terms Economics, Algorithm, Theory Keywords Mechanism Design, Cost-Per-Action, Internet Advertising 1. INTRODUCTION Currently, the main two charging models in the online advertising industry are cost-p er-impression (CPM) and costp er-click (CPC). In the CPM model, the advertisers pay the publisher for the impression of their ads. CPM is commonly used in traditional media (e.g. magazines and television) or banner advertising and is more suitable when the goal of the advertiser is to increase brand awareness. A more attractive and more p opular charging model in online advertising is the CPC model in which the advertisers pay the publisher only when a user clicks on their ads. In the last few years, there has b een a tremendous shift towards the CPC charging model. CPC is adopted by search engines Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. such as Google or Yahoo! for the placement of ads next to search results (also known as sp onsored search) and on the website of third-party publishers. In this pap er we will focus on another natural and widely advocated charging scheme known as the Cost-Per-Action or Cost-Per-Acquisition (CPA) model. In this model, instead of paying p er click, the advertiser pays only when a user takes a sp ecific action (e.g. fills out a form) or completes a transaction. Recently, several companies like Google, eBay, Amazon, Advertising.com, and Snap.com have started to sell advertising in this way. CPA models can b e the ideal charging scheme, esp ecially for small and risk averse advertisers. We will briefly describ e a few advantages of this charging scheme over CPC and refer the reader to [18] for a more detailed discussion. One of the drawbacks of the CPC scheme is that it requires the advertisers to submit their bids b efore observing the profits generated by the users clicking on their ads. Learning the exp ected value of each click, and therefore the right bid for the ad, is a prohibitively difficult task esp ecially in the context of sp onsored search in which the advertisers typically bid for thousands of keywords. CPA eliminates this problem b ecause it allows the advertisers to rep ort their payoff after observing the user's action. Another drawback of the CPC scheme is its vulnerability to click fraud. Click fraud refers to clicks generated by someone or something with no genuine interest in the advertisement. Such clicks can b e generated by the publisher of the content who has an interest in receiving a share of the revenue of the ad or by a rival who wishes to increase the cost of advertising for the advertiser. Click fraud is considered by many exp erts to b e the biggest challenge facing the online advertising industry [13, 10, 23, 20]. CPA schemes are less vulnerable b ecause generating a fraudulent action is typically more costly than generating a fraudulent click. For example, an advertiser can define the action as a sale and pay the publisher only when the ad yields profit1 . On the other hand, there is a fundamental difference b etween CPA and CPC charging models. A click on the ad can b e observed by b oth advertiser and publisher. However, the action of the user is hidden from the publisher and is observable only by the advertiser. Although the publisher can require the advertisers to install a software that will monitor actions that take place on their web site, even moderately sophisticated advertisers can find a way to manipulate the software if they find it sufficiently profitable. 1 CPA makes generating a fraudulent action a more costly enterprize, but not imp ossible (e.g., using a stolen credit). 179 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising Are the publishers exp osed to the manipulation or misrep orting of the advertisers in the CPA scheme? Does CPA create an incentive for the advertisers to misrep ort the numb er of actions or their payoffs for the actions? The main result of this pap er is to give a negative answer to these questions. We design a mechanism that, asymptotically and under reasonable assumptions, removes the incentives of the advertisers to misrep ort their payoffs. At the same time, our mechanism has the same asymptotic efficiency and hence revenue as the currently used CPC mechanisms. We will use techniques in learning and mechanism design to obtain this result. In the next section, we will formally describ e our model in mechanism design terminology (see [21].) We will refer to advertisers as agents and to the impression of an ad as an item. For simplicity of exp osition only, we assume only one advertisement slot p er page. In section 6 we outline how to extend our results to the case where more than one advertisement can b e displayed in each page. Although our work is essentially motivated by online advertising, we b elieve that the application of our mechanism is not limited this domain. April 21-25, 2008 · Beijing, China 1.1 Model We study the following problem: there are a numb er of self-interested agents comp eting for identical items sold rep eatedly at times t = 1, 2, · · · . At each time t, a mechanism allocates the item to one of the agents. Agents discover their utility for the good only if it is allocated to them. If agent i receives the good at time t, she discovers utility uit (denominated in money) for it and rep orts (not necessarily truthfully) the realized utility to the mechanism. Then, the mechanism determines how much the agent has to pay for receiving the item. We allow the utility of an agent to change over time. For this environment we are interested in auction mechanisms which have the following four prop erties. 1. The mechanism is individually rational in each p eriod. 2. Agents have an incentive to truthfully rep ort their realized utilities. 3. The efficiency (and revenue) is, in an appropriate sense, not too small compared to a second price auction. 4. The correctness of the mechanism does not dep end on an a-priori knowledge of the distribution of uit 's. This feature is motivated by the Wilson doctrine [24] 2 . The precise manner in which these prop erties are formalized is describ ed in section 2. We will build our mechanisms on a sampling-based learning algorithm. The learning algorithm is used to estimate the exp ected utility of the agents, and consists of two alternating phases: exploration and exploitation. During an exploration phase, the item is allocated for free to a randomly chosen agent. During an exploitation phase, the mechanism allocates the item to the agent with the highest estimated exp ected utility. After each allocation, the agent who has received the item, discovers her utility and rep orts it to the mechanism. Subsequently, the mechanism up dates the estimate of utilities and determines the payment. 2 Wilson criticizes relying too much on common-knowledge assumptions. We characterize a class of learning algorithms that ensure that the corresp onding mechanism has the four desired prop erties. The main difficulty in obtaining this result is the following: since there is uncertainty ab out the utilities, it is p ossible that in some p eriods the item is allocated to an agent who does not have the highest utility in that p eriod. Hence, the natural second-highest price payment rule would violate individual rationality. On the other hand, if the mechanism does not charge an agent b ecause her rep orted utility after the allocation is low, it gives her an incentive to shade her rep orted utility down. Our mechanism solves these problems by using an adaptive, cumulative pricing scheme. We illustrate our results by identifying simple mechanisms that have the desired prop erties. We demonstrate these mechanisms for the case in which the uit 's are indep endent and identically-distributed random variables as well as the case where their exp ected values evolve like indep endent reflected Brownian motions. In these cases the mechanism is actually ex-post individually rational. In our prop osed mechanism, the agents do not have to bid for the items. This is advantageous when the bidders themselves are unaware of their utility values. However, in some cases, an agent might have a b etter estimate of her utility for the item than our mechanism. For this reason, we describ e how we can slightly modify our mechanism to allow those agents to bid directly. 1.2 Related Work There is a large numb er of interesting results on using machine learning techniques in mechanism design. We only briefly survey the main techniques and ideas and compare them with the approach of this pap er. Most of these works, like [5, 8, ?, 17], consider one-shot games or rep eated auctions in which the agents leave the environment after they received an item. In our setting we may allocates items to an agent several times and hence, we need to consider the strategic b ehavior of the agents over time. There is also a big literature on regret minimization or exp ert algorithms. In our context, these algorithms are applicable even if the utilities of the agents are changing arbitrarily. However, the efficiency (and therefore the revenue) of these algorithms is comparable to the mechanisms that allocates the item to the single b est agent (exp ert) (e.g. see [16]). Our goal is more ambitious: our efficiency is close the most efficient allocation which might allocate the item to different agents at different times. On the other hand, we focus on utility values that change smoothly (e.g. like a Brownian motion). In a finitely rep eated version of the environment considered here, Athey and Segal [2] construct an efficient, budget balanced, mechanism where truthful revelation in each p eriod is Bayesian incentive compatible. Bapna and Web er [4] consider the infinite horizon version of [2] and prop ose a class of incentive compatible mechanisms based on the Gittins index (see [11]). Taking a different approach, Bergemann and V¨lim¨ki [6] and Cavallo et al. [9] prop ose an incentive comaa patible generalization of the Vickrey-Clark-Groves mechanism based on the marginal contribution of each agent for this environment. All these mechanisms need the exact solution of the underlying optimization problems, and therefore require complete information ab out the prior of the utilities 180 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising of the agents; also, they do not apply when the evolution of the utilities of the agents is not stationary over time. This violates the last of our desiderata. For a comprehensive survey in dynamic mechanism design literature see [22]. In the context of sp onsored search, attention has focused on ways of estimating click through rates. Gonen and Pavlov [12] give a mechanism which learns the click-through rates via sampling and show that truthful bidding is, with high probability, a (weakly) dominant strategy in this mechanism. Along this line, Wortman et al. [25] introduced an exploration scheme for learning advertisers' click-through rates in sp onsored search which maintains the equilibrium of the system. In these works, unlike ours, the distribution of the utilities of agents are assumed to b e fixed over time. Immorlica et al. [14], and later Mahdian and Tomak [18], examine the vulnerability of various procedures for estimating click through, and identify a class of click through learning algorithms in which fraudulent clicks cannot increase the exp ected payment p er impression by more than o(1). This is under the assumption that the slot of an agent is fixed and the bids of other agents remain constant overtime. In contrast, we study conditions which guarantee incentive compatibility and efficiency, while the utility of (all) agents may evolve over time. April 21-25, 2008 · Beijing, China M is asymptotical ly ex-ante individual ly rational if: T X lim inf E [ xit µit - pit ] 0. T t= 1 Incentive Compatibility: This prop erty implies that truthfulness defines an asymptotic Bayesian Nash equilibrium. Consider agent i and supp ose all agents except i are truthful. Let Ui (T ) b e the exp ected total profit of agent i, if agent i is truthful b etween time 1 and T . e Also, let Ui (T ) b e the maximum of exp ected profit of agent i under any other strategy. Asymptotic incentive compatibility requires that e Ui (T ) - Ui (T ) = o(Ui (T )). Efficiency: An ex-ante efficient mechanism allocates the item to an agent in argmaxi {µit } at each time t (and for each history). The total social welfare obtained byPn ex-ante efficient mechanism up to time T is a T E [ t=1 maxi {µit }]. Let W (T ) b e the exp ected welfare of mechanism M b etween time 1 and T , when all agents are truthful, i.e., T n XX W (T ) = E [ xit µit ] t=1 i=1 2. DEFINITIONS AND NOTATION Supp ose n agents comp eting in each p eriod for a single item. The item is sold rep eatedly at time t = 1, 2, · · · . Denote by uit the nonnegative utility of agent i for the item at time t. Utilities are denominated in a common monetary scale. The utilities of agents may evolve over time according to a stochastic process. We assume that for i = j , the evolution of uit and uj t are indep endent stochastic processes. We also define µit = E [uit |ui1 , · · · , ui,t-1 ]. Throughout this pap er, exp ectations are taken conditioned on the complete history. For simplicity of notation, we now omit those terms that denote such a conditioning. With notational convention, it follows, for example, that E [uit ] = E [µit ]. Here the second exp ectation is taken over all p ossible histories. Let M b e a mechanism used to sell the items. At each time, M allocates the item to one of the agents. Let i b e the agent who has received the item at time t. Define xit to b e the variable indicating the allocation of the item to i at time t. After the allocation, agent i observes her utility, uit , and then rep orts rit , as her utility for the item, to the mechanism. Note that we do not require an agent to know her utility for p ossessing the item in advance of acquiring it. The mechanism then determines the payment, denoted by pit . Definition 1. An agent i is truthful if rit = uit , for al l time xit = 1, t > 0. Our goal is to design a mechanism which has the following prop erties. We assume n, the numb er of agents, is constant. Individual Rationality: A mechanism is ex-post individually rational if for any time T > 0 and any agent 1 i n, the total payment of agent i does not exceed the sum of her rep orts: T X t= 1 Then, M is asymptotical ly ex-ante efficient if: T X max{µit }] - W (T ) = o(W (T )). E[ t= 1 i 3. PROPOSED MECHANISM We build our mechanism on top of a learning algorithm that estimates the exp ected utility of the agents. We refrain from an explicit description of the learning algorithm. Rather, we describ e sufficient conditions for a learning algorithm that can b e extended to a mechanism with all the prop erties we seek (see section 3.1). In section 4 and 5 we give two examples of environments where learning algorithms satisfying these sufficient conditions exist. The mechanism consists of two phases: explore and exploit. During the explore phase, with probability (t), : N [0, 1], the item is allocated for free to a randomly chosen agent. During the exploit phase, the mechanism allocates the item to the agent with the highest estimated exp ected utility. Afterwards, the agent rep orts her utility to the mechanism and the mechanism determines the payment. We first formalize our assumptions ab out the learning algorithm and then we discuss the payment scheme. The mechanism is given in Figure 1. The learning algorithm, samples uit 's at rate (t), and based on the history of the rep orts of agent i, returns an estimate of µit . Let µit (T ) b e the estimate of the algorithm b for µit conditional on the history of the rep orts up to time T . The history of the rep orts of agent i up to time T is the sequence of the rep orted values and times of observation of uit up to but not including time T . Note that we allow T > t. Thus, information at time T > t can b e used to revise an estimate of µit made at some earlier time. We assume that increasing the numb er of samples only increases the xit rit - pit > 0. 181 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising For t = 1, 2, . . . With probability (t), explore: April 21-25, 2008 · Beijing, China Uniformly at random, allocate the item to an agent i, 1 i n. pit 0 With probability 1 - (t), exploit: Randomly allocate the item to an agent i argmaxi {µit (t)}. b Pt-1 Pt-1 pit k=1 yik min{k (t), µik (k)} - k=1 pik b b rit the rep ort of agent i. pj t 0, j = i Figure 1: Mechanism M accuracy of the estimations, i.e. for any truthful agent i, and times T1 T2 : In the inequality ab ove, and in the rest of the pap er, the exp ectations of µit are taken over the evolution of uit 's and b the random choices of the mechanism. For simplicity of notation, we omit those terms that denote such a conditioning. To describ e the payments recall that t is the second highest µit and let t (T ) = maxj =i {µj t (T )}, where i is the agent b b who received the item at time t. We define yit to b e the indicator variable of the allocation of the item to agent i during an exploit phase. The payment of agent i at time t, denoted pit , is determined so that: t X E [|µit (T1 ) - µit |] b E [|µit (T2 ) - µit |]. b (1) pik = k =1 t -1 X We outline the proof first. As we prove in Lemma 2, by condition (C 1), the exp ected profit of a truthful agent up P 1 to time T is at least ( n - o(1))E [ T=1 (t)µit ]. Also, the t exp ected total error in the estiPates of the payments up m T to time T is b ounded by O(E [ t=1 t ]). We prove that the total utility an agent could obtain by deviating from the truthful strategy, b etween time 1 and T , is b ounded by P O(max1tT {µit } + E [ T=1 t ]). Hence, the claim follows t by condition (C 1). Similar to other applications of learning algorithms, we can observe a natural trade-off b etween exploitation and exploration rates in our context: higher exploration rates lead to more accurate estimates of the utilities of the agents, at the cost of efficiency. Condition (C 1) provides us with a lower bound on the exploration rate. k =1 An agent only pays for items that are allocated to her during the exploit phase, up to but not including time t. At time t, the payment of agent i for the item she received at time k < t is min{k (t), µik (k)}. The first term is the remb b iniscence of the second highest pricing scheme. The second term, under some reasonable conditions, leads to individually rationality. Since the estimations of learning algorithm for the utilities of agents b ecome more precise over time, our adaptive cumulative payment scheme allows it to correct for errors in the past. yik min{k (t), µik (k)}. b b Lemma 2. If condition (C 1) holds, then the expected profit of a truthful agent i up to time T , Ui (T ), is at least: T X 1 - o(1))E [ (t)µit ]. n t= 1 ( 3.1 Sufficient Conditions We start with a condition that guarantees asymptotic exante individual rationality and asymptotic incentive compatibility. Let lit b e the last time up to time t that the item is allocated to agent i within an exploit phase. If i has not b een allocated any item yet, lit is defined to b e zero. Also, define t = maxi {|µit (t) - µit |}, assuming all agents were b truthful up to time t. Theorem 1. If for the learning algorithm, for al l 1 i n, and T > 0: P P (C 1) E [max1tT {µit } + T=1 t ] = o(E [ T=1 (t)µit ]) t t Proof. The items that agent i receives during the explore phase are free. The exp ected total utility of agent i P 1 from these items up to time T is n E [ T=1 (t)µit ]. Let t CT = {t < liT |yit = 1, if i is truthful} b e the subset of p eriods that agent i is charged for the item she received within the p eriod. T X E[ xit uit - pit ] t= 1 Ui (T ) = = E[ tCT / then mechanism M is asymptotical ly ex-ante individual ly rational and incentive compatible. T 1X E[ (t)µit ] n t= 1 X +E [ (µit - min{t (T ), µit (t)})] b b tCT X xit uit ] + E [ tCT X uit - pit ] (2) 182 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising For t CT : E [(µit - min{t (T ), µit (t)})I (t CT )] b b E [(µit - µit (t))I (t CT )] b -E [|µit - µit (t)|] b -E [t ] sides, by (1): April 21-25, 2008 · Beijing, China E [min{t (T ), µit (t)}I (t DT CT )] b b E [(t - (t - bt (T ))+ - (t - t (t))+ )I (t DT CT )] b E [t I (t DT CT )] - E [2t ] (6) 2. If t DT \ CT , agent i cannot increase her "exp ected profit", µit - min{t (T ), µit (t)}, by more than O(t ): b b Substituting into inequality (2), by condition (C 1): Ui (T ) = T T X 1X t ] (t)µit ] - E [ E[ n t= 1 t= 1 T T X 1X E[ (t)µit ] - o(E [ (t)µit ]) n t= 1 t= 1 (3) µit - min{t (T ), µit (t)} µit - min{t (T ), bt (t)} b b b (µit - µit (t)) + (µit (t) - t ) b b + max{t - t (T ), t - bt (t)} b 2t + (t - bt (T ))+ + ( t - t ( t ) )+ b Taking exp ectation from b oth sides, by (1): E [(µit - min{t (T ), µit (t)})I (t DT - CT )] b b E [2t I (t DT - CT )] + E [((t - t (T ))+ + (t - bt (t))+ )I (t DT - CT )] b E [4t ] (7) Proof of Theorem 1: Lemma 2 yields asymptotic exante individual rationality. We show that truthfulness is asymptotically a b est resp onse when all other agents are truthful. Fix an agent i intending to deviate and let S b e the strategy she deviates to. Fixing the evolution of all uj t 's, 1 j n, and all random choices of the mechanism, i.e. the steps in the explore phase and the randomly chosen agents, let DT b e the times that i receives the item under strategy S during the exploit phase b efore time liT , i.e. DT = {t < liT |yit = 1, if the strategy of i is S }. Similarly, let CT = {t < liT |yit = 1, if i is truthful}. Also, let µit , and b t corresp ond to the estimates of the mechanism when the b strategy of i is S . We first b ound the exp ected profit of i, under strategy S , during the exploit phase: T X yit uit - pit ] E[ t= 1 Substituting inequalities (6) and (7) into (5): T T X X 6t ] + E [max{µit }] yit uit - pit ] E [ E[ t= 1 t= 1 +E [ tDT CT T X 6t ] + E [max{µit }] + E[ t= 1 X t T µit - t ] (8) µit - t ] E[ (4) E[ t DT E [max{µit }] t T X = E[ µit - min{t (T ), µit (t)}] + b b tDT \CT X tCT X t T µit - t ] - E [ tCT \DT X µit - min{t (T ), µit (t)}] + b b Substituting into (8): For t CT , since µit (t) t (t), we have: b b E [t - µit ] E [2t ] E[ tDT CT t T E [max{µit }] X µit - min{t (T ), µit (t)}] + b b T T X X t ] + E [max{µit }] yit uit - pit ] 8E [ E[ t= 1 t= 1 +E [ (5) tCT X t T µit - t ] With algebraic manipulation, using (1), we get: T T X X t ] + E [max{µit }]) yit uit - pit ] O(E [ E[ t= 1 The term E [maxtT {µit }] b ounds the outstanding payment of agent i; recall that the agent has not paid for the last allocated item. For time t 1, we examine two cases: 1. If t DT CT , then agent i, in exp ectation, cannot decrease the "current price", min{t (T ), µit (t)}, by more b b than O(t ): +E [ tCT X t= 1 t T By condition (C1), we get the inequality b elow which completes the proof: T T X X (t)µit ]) yit uit - pit ] o(E [ E[ t= 1 t= 1 µit - min{t (T ), µit (t)}] b b b b min{t (T ), µit (t)} min{t (T ), bt (t)} b t - max{t - t (T ), t - bt (t)} b t + t - ( t - (T ) ) - ( t - t (t ) )+ b b Recall that t (T ) = maxj =i {µit (T )} and all other agent b b are truthful. Hence, taking exp ectation from b oth where (z )+ = max{z , 0}. +E [ tCT X and the last inequality follows by (C 1). The exp ected utility of the truthful strategy and S during the explore phase b b µit - min{t (T ), µit (t)}] 183 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising is equal. Therefore, by Lemma 2, the mechanism is asymptotically incentive compatible. 2 In the next theorem we show if the loss in efficiency during exploration asymptotically goes to zero, then by Condition (C 1) the mechanism is asymptotically ex-ante efficient. Theorem 3. If for the learning algorithm, in addition to (C 1), the fol lowing condition holds P P (C 2) E [ T=1 (t) maxi {µit }] = o(E [ T=1 maxi {µit }]) t t Proof. M may fail to b e ex-ante efficient for two reasons. First one is the loss in welfare during the exploration when the item is allocated randomly to one of advertisers. The P exp ected loss in this case is equal to E [ T=1 (t) maxi {µit }]. t Another reason is the mistakes during exploitation. The error in estimation can lead to allocation to an agent who does not value the item the most. At time t, in the worst case, the item might b e allocated to an agent whose exp ected utility is at most 2t less than the highest exp ected utility. Therefore, the exp ected efficiency loss during exploration is P b ounded by O(E [ T=1 t ]). Since, for the exp ected welfare t of M b etween time 1 and T , denoted by W (T ), we have: T X E[ max{µit }] - W (T ) t= 1 i April 21-25, 2008 · Beijing, China 3.2 Allowing agents to bid In mechanism M no agent explicitly bids for an item. Whether an agent receives an item or not dep ends on the history of their rep orted utilities and the estimates that M forms from them. This may b e advantageous when the bidders themselves are unaware of what their utilities will b e. However, when agents may p osses a b etter estimate of their utilities we would like to make use of that. For this reason we describ e how to modify M so as to allow agents to bid for an item. If time t occurs during an exploit phase let Bt b e the set of the agents who bid at this time. The mechanism bids on the b ehalf of all agent i Bt . Denote by bit the bid of / agent i Bt for the item at time t. The modification of M sets bit = µit (t), for i B . Then, the item is allocated at b / random to one of the agents in arg maxi bit . If i is the agent who received the item at time t, let A = {bj t |j Bt } {µj t |, j Bt }. Define t as the sec/ ond highest value in A. Let t (T ) to b e equal to maxj =i bj k . b The payment of agent i will b e pit t -1 X k =1 then, M is asymptotical ly ex-ante efficient. To incorp orate the fact that bidders can bid for an item, we must modify the definition of truthfulness. Definition 2. Agent i is truthful if: (9) 1. rit = uit , for al l time xit = 1, t 1. 2. If i bids at time t, then E [|bit - µit |] E [|µit - µit |]. b yik min{k (t), bik } - b t -1 X pik . k =1 But, condition (C 1) implies: T X E[ t ] = t= 1 T X = O(E [ (t + (t) max{µit })]) t= 1 i n T XX (t)µit }]) o(E [ t=1 i=1 i = Plugging into (9): T X (E [ (t) max{µit }]) t= 1 Note that item 2 does not require that agent i bid their actual utility only that their bid b e closer to the mark than the estimate. With this modification in definition, Theorems 1 and 3 continue to hold. T T X X (t) max{µit }]) max{µit }] - W (T ) = O(E [ E[ t= 1 i t= 1 i 4. INDEPENDENT AND IDENTICALLY DISTRIBUTED UTILITIES In this section, we assume that for each i, uit 's are indep endent and identically-distributed random variables. For simplicity, we define µi = E [uit ], t > 0. Without loss of generality, we also assume 0 < µi 1. In this environment, the learning algorithm we use is an greedy algorithm for the multi-armed bandit problem3 . Let P nit = t-1 xit . For (0, 1), we define: k =1 nit = (t) = min{1, nt- ln1+ t} (P ( T=1 xik rik )/niT , k µit (T ) = b 0, t -1 X = o(W (T )) The last equality is followed by (C 2) and implies asymptotic ex-ante efficiency. While Condition (C 1) gives a lower b ound on the exploration rate, Condition (C 2) gives an upper bound. In the next section, we will show with two examples how conditions (C 1) and (C 2) can b e used to adjust the exploration rate of a learning algorithm in order to obtain efficiency and incentive compatibility. Remark 1. In Theorem 3 we showed that under some assumptions, the welfare obtained by the mechanism is asymptotical ly equivalent to efficient mechanism that every time al locates the item to the agent with the highest expected utility. We can give similar conditions to (C 2) to guarantee that the revenue of the mechanism is also asymptotical ly equal to the revenue of the efficient mechanism that every time charges the winning agent the second highest expected utility. To avoid repetition, we refrain from explaining this condition in details. xit k =1 niT > 0 niT = 0 Call the mechanism based on this learning algorithm M (iid). Lemma 4. If al l agents are truthful, then, under M (iid) E [t ] = O( 3 1 . t1- ) See [3] for a similar algorithm. 184 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising The proof of this lemma is given in app endix A. 1 We show that M (iid), for 3 , satisfies all the desired prop erties we discussed in the previous section. Moreover, it satisfies a stronger notion of individual rationality. M (iid) satisfies ex-post individual rationality if for any agent i, and for all T 1: T X t= 1 April 21-25, 2008 · Beijing, China exp ected value of the error of these estimates at time t is 1 o(t 6 ). We b egin analyzing the mechanism by stating some wellknown prop erties of reflected Brownian motions (see [7]). Proposition 6. Let [Wt , t 0] be a reflected Brownian motion with mean zero and variance 2 ; the reflection barrier is 0. Assume the value of Wt at time t is equal to y : (12) E [y ] = ( t 2 ) For T > 0, let z = Wt+T . For the probability density tion of z - y we have: r -x 2 2 Pr[(z - y ) dx] e 2T 2 T 2 r 8T 2 1 -x22 Pr[|z - y | x] e 2T x r 8T 2 -x2 E [|z - y |I (|z - y | x)] e 2T 2 func- pit T X t= 1 xit rit Theorem 5. M (iid) is ex-post individual ly rational. Also, for 0 1 , M (iid) is asymptotical ly incentive compati3 ble and ex-ante efficient. Proof. We first prove ex-p ost individual rationality. It is sufficient to prove it only for the p eriods that agent i has received the item within an exploit phase. For T , such that yiT = 1, we have: T X t= 1 T -1 X t= 1 T -1 X t= 1 (13) (14) (15) pit = yit min{t (T ), µit (t)} b b yit bt (T ) T -1 X t= 1 T -1 X t= 1 The third inequality follows b ecause the item is allocated b b to i at time T which implies µiT (T ) t (T ). We complete the proof by showing that conditions (C 1) and (C 2) hold. Note that µi 1. By lemma 4, for 1 : 3 E [1+ T -1 X t= 1 nit µiT (T ) = b xit rit yit µiT (T ) b Corollary 7. The expected value of the maximum of µiT , 1 i n, is ( T ). Note that in the corollary ab ove n and are constant. Now, similar to Lemma 4, we b ound E [T ]. The proof is given in app endix B. Lemma 8. Suppose under M (B) al l agents are truthful 2 until time T , then, E [T ] = O(T ). Now we are ready to prove the main theorem of this section: Theorem 9. M (B) is ex-post individual ly rational. Also, for 0 1 , M (B) is asymptotical ly incentive compatible 3 and ex-ante efficient. Proof. We first prove ex-p ost individual rationality. It is sufficient to prove it only for the p eriods that agent i has received the item within an exploit phase. For T , such that yiT = 1, we have: T X t= 1 T -1 X t= 1 T -1 X t= 1 t ] = O ( T 1+ 2 ) = o(T 1- ln1+ T ) = O( T X t= 1 (t)µi ). Therefore, (C 1) holds. The welfare of any mechanism b etP en time 1 and T is we - b ounded by T . For any > 0, E [1 + T=11 t + t ] = o(T ) t which implies (C 2). 5. BROWNIAN MOTION In this section, we assume for each i, 1 i n, the evolution of µit is a reflected Brownian motion with mean 2 zero and variance i ; the reflection barrier is 0. In addition, 2 we assume µi0 = 0, and i 2 , for some constant . The mechanism observes the values of µit at discrete times t = 1, 2, · · · . In this environment our learning algorithm estimates the reflected Brownian motion using a mean zero martingale. We define lit is defined as the last time up to time t that the item is allocated to agent i. This includes b oth explore and exploit phases. If i has not b een allocated any item yet, lit is zero. (t ) = µit (T ) = b min{1, nt- ln2+2 t} 8 trilit < r t=T > ili,t-1 :r t>T ili,T pit = yit min{t (T ), µit (t)} b b yit µit (t) = b T -1 X t= 1 yit rili,t-1 T X t= 1 xit rit. (10) (11) We complete the proof by showing the conditions (C 1) and (C 2) hold. By (12), the exp ected utility of each agent at time t from random exploration is 1 ( t 2 t- ln1+ t) = (t 2 - ln1+ t). Therefore, the exp ected utility up to time T from explo3 ration is (T 2 - ln1+ T ). By Lemma (8) and Corollary 7: E [max{µiT } + t T T -1 X t= 1 Call this mechanism M (B). For simplicity, we assume that the advertiser rep orts the exact value of µit . It is not difficult to verify that the results in this section hold as long as the t ] = O(T 1+ ). 2 For 13 , 32 - 1+ 2 this yields Condition(C 1). 185 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising By Corollary 7, the exp ected value of maxi {µiT } and T are ( T ). Therefore, the exp ected welfare of an efficient 3 mechanism b etween time 1 and T is (T 2 ). For any 0 < < 1, we have: (T 2 ) = (T 2 - ln1+ 3 3 April 21-25, 2008 · Beijing, China t + T 1+ ) 2 By condition (C 2), M (B) is asymptotically ex-ante efficient. To apply this model to sp onsored search we treat each item as a bundle of search queries. Each time step is defined by the arrival of m queries. The mechanism allocates all m queries to an advertiser and after that, the advertiser rep orts the average utility for these queries. The payment pit is now the price p er item, i.e. the advertiser pays mpit for the bundle of queries. The value of m is chosen such that µit can b e estimated with high accuracy. 6. DISCUSSION AND OPEN PROBLEMS In this section we discuss some extensions of the mechanisms. multiple slots we b orrow from Gonen and Pavlov [12], who assume there exist a set of conditional distributions which determine the conditional probability that the ad in slot j1 is clicked conditional on the ad in slot j2 b eing clicked. During the exploit phase, M allocates the slots to the advertisers with the highest exp ected utility, and the prices are determined according to Holmstrom's lemma ([19], see also [1]) The estimates of the utilities are up dated based on the rep orts, using the conditional distribution. Multiple Slots. To modify M so that it can accommodate Delayed Reports. In some applications, the value of receiving the item is realized at some later date. For example, a user clicks on an ad and visits the website of the advertiser. A couple of days later, she returns to the website and completes a transaction. It is not difficult to adjust the mechanism to accommodate this setting by allowing the advertiser to rep ort with a delay or change her rep ort later. Creating Multiple Identities. When a new advertiser joins the system, in order to learn her utility value our mechanism gives it a few items for free in the explore phase. Therefore our mechanism is vulnerable to advertisers who can create several identities and join the system. It is not clear whether creating a new identity is cheap in our context b ecause the traffic generated by advertising should eventually b e routed to a legitimate business. Still, one way to avoid this problem is to charge users without a reliable history using CPC. Acknowledgment. We would like to thank Arash Asadp our, Peter Glynn, Ashish Goel, Ramesh Johari, and Thomas Web er for fruitful discussions. The second author acknowledges the supp ort from NSF and a gift from Google. 7. [1] G. Aggarwal, A. Goel, and R. Motwani. Truthful auctions for pricing search keywords. Proceedings of ACM conference on Electronic Commerce, 2006. REFERENCES [2] S. Athey, and I. Segal. An Efficient Dynamic Mechanism. manuscript, 2007. [3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning archive, Volume 47 , Issue 2-3, 235-256, 2002. [4] A. Bapna, and T. Web er. Efficient Dynamic Allocation with Uncertain Valuations. Working Paper, 2006. [5] M. Balcan, A. Blum, J. Hartline, and Y. Mansour. Mechanism Design via Machine Learning. Proceedings of 46th Annual IEEE Symposium on Foundations of Computer Science, 2005. [6] D. Bergemann, and J. Valimaki. Efficient Dynamic ¨¨ Auctions. Proceedings of Third Workshop on Sponsored Search Auctions, 2007. [7] A. Borodin, and P. Salminen. Handb ook of Brownian Motion: Facts and Formulae. Springer, 2002. [8] A. Blum, V. Kumar, A. Rudra, and F. Wu. Online Learning in Online Auctions. Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete Algorithms, 2003. [9] R. Cavallo, D. Parkes, and S. Singh, Efficient Online Mechanism for Persistent, Periodically Inaccessible Self-Interested Agents. Working Pap er, 2007. [10] K. Crawford. Google CFO: Fraud A Big Threat. CNN/Money, Decemb er 2, 2004. [11] J. Gittins. Multi-Armed Bandit Allocation Indices. Wiley, New York, NY, 1989. [12] R. Gonen, and E. Pavlov. An Incentive-Compatible Multi-Armed Bandit Mechanism. Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, 2007. [13] B. Grow, B. Elgin, and M. Herbst. Click Fraud: The dark side of online advertising. BusinessWeek. Cover Story, Octob er 2, 2006. [14] N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Proceedings of the 1st Workshop on Internet and Network Economics, 2005. [15] B. Kitts, P. Laxminarayan, B. LeBlanc, and R. Meech. A Formal Analysis of Search Auctions Including Predictions on Click Fraud and Bidding Tactics. Workshop on Sponsored Search Auctions, 2005. [16] R. Kleinb erg. Online Decision Problems With Large Strategy Sets. Ph.D. Thesis, MIT, 2005. [17] S. Lahaie, and D. Parkes. Applying Learning Algorithms to Preference Elicitation. Proceedings of the 5th ACM conference on Electronic Commerce, 2004. [18] M. Mahdian, and K. Tomak. Pay-p er-action model for online advertising. Proceedings of the 3rd International Workshop on Internet and Network Economics, 549-557, 2007. [19] P. Milgrom, Putting Auction Theory to Work. Cambridge University Press, 2004. [20] D. Mitchell. Click Fraud and Halli-bloggers. New York Times, July 16, 2005. [21] N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani, editors. Algorithmic Game Theory, Cambridge University Press, 2007. [22] D. Parkes. Online Mechanisms Algorithmic Game Theory (Nisan et al. eds.), 2007. 186 WWW 2008 / Refereed Track: Internet Monetization - Online Advertising [23] B. Stone. When Mice Attack: Internet Scammers Steal Money with "Click Fraud". Newsweek, January 24, 2005. [24] R. Wilson. Game-Theoretic Approaches to Trading Processes. Economic Theory: Fifth World Congress, ed. by T. Bewley, chap. 2, pp. 33-77, Cambridge University Press, Cambridge, 1987. [25] J. Wortman, Y. Vorob eychik, L. Li, and J. Langford. Maintaining Equilibria During Exploration in Sp onsored Search Auctions. Proceedings of the 3rd International Workshop on Internet and Network Economics, 2007. April 21-25, 2008 · Beijing, China b = Pr[|µi - µit (t)| 2e = o( E [nit ] 1 ] i nit 2 t1- µ E [nit ] 1 + Pr[|µi - µit (t)| b ] i nit < 2 t1- µ - 11 t1- ln1+ t µi t- 2d +e -t1- ln1+ t 8d 1 ), c > 0 tc APPENDIX A. PROOF OF LEMMA 4 Therefore, with probability 1 - o( 1 ), for all agents, t t 1 Since the maximum value of uit is 1, E [t ] = t1- . O( t1- ). 1 B. PROOF OF LEMMA 8 Proof. We prove the lemma by showing that for any agent i, Pr[|µi - µit (t)| b 1 t1- i] µ = o( 1 ), c > 0. tc Proof. Define Xit = |µi,T - µi,T -t |. We first prove 2 Pr[Xit > T ] = o( T1c ), c > 0. There exists a constant Td such that for any time T Td , the probability that i has not b een randomly allocated the item in the last t < Td step is at most: Pr[T - li,T -1 > t] < (1 - T - ln2+2 Let t = t 1 ln1+ T Tt First, we estimate E [nit ]. There exists a constant d such that: t -1 t -1 X X (k ) 1 = min{ , k- ln1+ E [nit ] n n k =1 k =1 T 2 . ) e -t ln2+2 T . T (16) By equation (14) and (16), 2 k 1 } > t1- ln1+ d Pr[Xit > T ] = Pr[Xit > T By the Chernoff-Hoeffding b ound: Pr[nit -t1- ln1+ E [nit ] 8d ]e 2 t . Inequality (1) and the Chernoff-Hoeffding b ound imply: Pr[|µi - µit (t)| b 1 t1- i] µ = + Pr[Xit > T T - li,T -1 > t] 1 = o( c ), c > 0. T Hence, with high probability, for all the n agents, Xit 2 2 T . If for some of the agents Xit T , then, by Corollary 7, the ep ected value of the maximum of µit over these x 2 agents is ( T ). Therefore, E [maxi {Xit }] = O(T ). The lemma follows b ecause E [T ] E [maxi {Xit }]. T - li,T -1 t] 2 187