{"title": "Repeated Contextual Auctions with Strategic Buyers", "book": "Advances in Neural Information Processing Systems", "page_first": 622, "page_last": 630, "abstract": "Motivated by real-time advertising exchanges, we analyze the problem of pricing inventory in a repeated posted-price auction. We consider both the cases of a truthful and surplus-maximizing buyer, where the former makes decisions myopically on every round, and the latter may strategically react to our algorithm, forgoing short-term surplus in order to trick the algorithm into setting better prices in the future. We further assume a buyer\u2019s valuation of a good is a function of a context vector that describes the good being sold. We give the first algorithm attaining sublinear (O(T^{2/3})) regret in the contextual setting against a surplus-maximizing buyer. We also extend this result to repeated second-price auctions with multiple buyers.", "full_text": "Repeated Contextual Auctions with Strategic Buyers\n\nKareem Amin\n\nUniversity of Pennsylvania\n\nakareem@cis.upenn.edu\n\nAfshin Rostamizadeh\n\nGoogle Research\n\nrostami@google.com\n\nUmar Syed\n\nGoogle Research\n\nusyed@google.com\n\nAbstract\n\nMotivated by real-time advertising exchanges, we analyze the problem of pricing\ninventory in a repeated posted-price auction. We consider both the cases of a truth-\nful and surplus-maximizing buyer, where the former makes decisions myopically\non every round, and the latter may strategically react to our algorithm, forgoing\nshort-term surplus in order to trick the algorithm into setting better prices in the\nfuture. We further assume a buyer\u2019s valuation of a good is a function of a context\nvector that describes the good being sold. We give the \ufb01rst algorithm attaining\n\nsublinear (eO(T 2/3)) regret in the contextual setting against a surplus-maximizing\n\nbuyer. We also extend this result to repeated second-price auctions with multiple\nbuyers.\n\n1\n\nIntroduction\n\nA growing fraction of Internet advertising is sold through automated real-time ad exchanges. In\na real-time ad exchange, after a visitor arrives on a webpage, information about that visitor and\nwebpage, called the context, is sent to several advertisers. The advertisers then compete in an auction\nto win the impression, or the right to deliver an ad to that visitor. One of the great advantages of\nonline advertising compared to advertising in traditional media is the presence of rich contextual\ninformation about the impression. Advertisers can be particular about whom they spend money\non, and are willing to pay a premium when the right impression comes along, a process known\nas targeting. Speci\ufb01cally, advertisers can use context to specify which auctions they would like to\nparticipate in, as well as how much they would like to bid. These auctions are most often second-\nprice auctions, wherein the winner is charged either the second highest bid or a prespeci\ufb01ed reserve\nprice (whichever is larger), and no sale occurs if the reserve price isn\u2019t cleared by one of the bids.\nOne side-effect of targeting, which has been studied only recently, is the tendency for such exchanges\nto generate many auctions that are rather uncompetitive or thin, in which few advertisers are willing\nto participate. Again, this stems from the ability of advertisers to examine information about the\nimpression before deciding to participate. While this selectivity is clearly bene\ufb01cial for advertisers,\nit comes at a cost to webpage publishers. Many auctions in real-time ad exchanges ultimately involve\njust a single bidder, in which case the publisher\u2019s revenue is entirely determined by the selection of\nreserve price. Although a lone advertiser may have a high valuation for the impression, a low reserve\nprice will fail to extract this as revenue for the seller if the advertiser is the only participant in the\nauction.\nAs observed by [2], if a single buyer is repeatedly interacting with a seller, selecting revenue-\nmaximizing reserve prices (for the seller) reduces to revenue-maximization in a repeated posted-\nprice setting: On each round, the seller offers a good to the buyer at a price. The buyer observes her\nvalue for the good, and then either accepts or rejects the offer. The seller\u2019s price-setting algorithm is\nknown to the buyer, and the buyer behaves to maximize her (time-discounted) cumulative surplus,\ni.e., the total difference between the buyer\u2019s value and the price on rounds where she accepts the\noffer. The goal of the seller is to extract nearly as much revenue from the buyer as would have been\n\n1\n\n\fpossible if the process generating the buyer\u2019s valuations for the goods had been known to the seller\nbefore the start of the game. In [2] this goal is called minimizing strategic regret.\nOnline learning algorithms are typically designed to minimize regret in hindsight, which is de\ufb01ned\nas the difference between the loss of the best action and the loss of the algorithm given the observed\nsequence of events. Furthermore, it is assumed that the observed sequence of events are generated\nadversarially. However, in our setting, the buyer behaves self-interestedly, which is not necessarily\nthe same as behaving adversarially, because the interaction between the buyer and seller is not\nzero-sum. A seller algorithm designed to minimize regret against an adversary can perform very\nsuboptimally. Consider an example from [2]: a buyer who has a large valuation v for every good.\nIf the seller announces an algorithm that minimizes (standard) regret, then the buyer should respond\nby only accepting prices below some \u270f \u2327 v. In hindsight, posting a price of \u270f in every round would\nappear to generate the most revenue for the seller given the observed sequence of buyer actions,\nand therefore \u270fT cumulative revenue is \u201cno-regret\u201d. However, the seller was tricked by the strategic\nbuyer; there was (v  \u270f)T revenue left on the table. Moreoever, this is a good strategy for the buyer\n(it must have won the good for nearly nothing on \u2326(T ) rounds).\nThe main contribution of this paper is extending the setting described above to one where the buyer\u2019s\nvaluations in each round are a function of some context observed by both the buyer and seller.\nWhile [2] is motivated by our same application, they imagine an overly simplistic model wherein\nthe buyer\u2019s value is generated by drawing an independent vt from an unknown distribution D. This\nignores that vt will in reality be a function of contextual information xt, information that is available\nto the seller, and the entire reason auctions are thin to begin with (without xt there would be no\ntargeting). We give the \ufb01rst algorithm that attains sublinear regret in the contextual setting, against a\nsurplus-maximizing buyer. We also note that in the non-contextual setting, regret is measured against\nthe revenue that could have been made if D were known, and the single \ufb01xed optimal price were\nselected. Our comparator will be more challenging as we wish to compete with the best function (in\nsome class) from contexts xt to prices.\nThe rest of the paper is organized as follows. We \ufb01rst introduce a linear model by which values vt are\nderived from contexts xt. We then demonstrate an algorithm based on stochastic gradient descent\n(SGD) which achieves sublinear regret against an truthful buyer (one that accepts price pt iff pt \uf8ff vt\non every round t). The analysis for the truthful buyer uses prexisting high probability bounds for\nSGD when minimizing strongly convex functions [22]. Our main result requires an extension of\nthis analysis to cases in which \u201cincorrect\u201d gradients are occasionally observed. This lets us study\na buyer that is allowed to best-respond to our algorithm, possibly rejecting offers that the truthful\nbuyer would not, in order to receive better offers on future rounds. We also adapt our algorithm\nto non-linear settings via a kernelized version of the algorithm. Finally, we extend our results to\nsecond-price auctions with multiple buyers.\nRelated Work: The pricing of digital good in repeated auctions has been considered by many other\nauthors, including [2, 17, 4, 3, 6, 19]. However, most of these papers do not consider a buyer who\nbehaves strategically across rounds. Buyers either behave randomly [19], or only participate in a\nsingle round [17, 4, 3, 6], or participate in multiple rounds but only desire a single good [20, 12]\nand therefore, in each of these cases, are not incentivized to manipulate the seller\u2019s behavior on\nfuture rounds. In reality buyers repeatedly interact with the same seller. There is empirical evidence\nsuggesting that buyers are not myopic, and do in fact behave strategically to induce better prices in\nthe future [9], as well as literature studying different strategies for strategic buyers [5, 15, 16].\nRepeated posted price actions against the same strategic buyer have been considered in the eco-\nnomics literature under the heading of behavior-based price discrimination (BBPD) by [13, 23, 1,\n11], and more recently by [8]. These works differ from ours in two key ways. First, all these works\nimagine that the buyer\u2019s type is drawn from some \ufb01xed publicly available distribution. Therefore\nlearning D is not at issue. In contrast, we argue that access to an accurate prior is particularly prob-\nlematic in these settings. After all, the seller cannot expect to reliably estimate D from data when\nthe buyer is explicitly incentivized to hide its type (as illustrated in the Introduction; see also [2]).\nThis tension between learning and buyer truthfulness is in many ways central to our study.\nSecondly, given a \ufb01xed prior, the most common solution concept in the BBPD literature is a perfect\nBayes-Nash equilibrium, in which both the seller and buyer strategies are best responses to each\nother. However, in the context of Internet advertising, a seller must \ufb01rst deploy an algorithm which\n\n2\n\n\fautomates the pricing strategy, and buyers subsequently react to the observed behavior of the pric-\ning algorithm. Any modi\ufb01cations the seller wishes to make to the pricing algorithm will typically\nrequire changes to the end-user licensing agreement, which the seller will not want to do too fre-\nquently. Therefore, in this paper, we make a commitment assumption on the seller: the seller acts\n\ufb01rst, announcing its pricing strategy, after which the buyer plays a best response strategy. Such\nStackleberg models of commitment [10] have sparked a great deal of recent interest due to their suc-\ncess in security games (see [7] and [18] for an overview), including practical deployment [21, 14].\n\n2 Preliminaries\n\nThroughout this work, we will consider a repeated auction where at every round a single seller\nprices an item to sell to a single buyer (extensions to multiple buyers are discussed in Section 5).\nThe good sold at step t in the repeated auction is represented by a context (feature) vector xt 2X =\n{x : kxk2 \uf8ff 1} and is drawn according a \ufb01xed distribution D, which is unknown to the seller. The\ngood has a value vt that is a linear function of a parameter vector w\u21e4, also unknown to the seller,\nvt = w\u21e4>xt (extensions to non-linear functions of the context are considered in Section 5). We\nassume that w\u21e4 2W = {w : kwk2 \uf8ff 1} and also that 0 \uf8ff w\u21e4>x \uf8ff 1 with probability one with\nrespect to D.\nFor rounds t = 1, . . . , T the repeated posted-price auction is de\ufb01ned as follows: (1) The buyer and\nseller both observe xt \u21e0D . (2) The seller offers a price pt. (3) The buyer selects at 2{ 0, 1}. (4)\nThe seller receives revenue atpt.\nHere, at is an indicator variable that represents whether or not the buyer accepted the offered price\n(1 indicates yes). The goal of the seller is to select a price pt in each round t such that the expected\n\nregret R(T ) = EhPT\n\nt=1 vt  atpti is o(T ). The choice of at will depend on the buyer\u2019s behavior.\n\nWe will analyze two types of buyers in the subsequent sections of the paper: truthful and surplus-\nmaximizing buyers, and will attempt to minimize regret against the truthful buyer and regret against\nthe surplus-maximizing buyer. Note, the regret is the difference between the maximum revenue\npossible and the amount made by the algorithm that offers prices to the buyer.\n\n3 Truthful Buyer\n\nIn this section we introduce the Learn-Exploit Algorithm for Pricing (LEAP), which we show has\n\nregret of the form O(T 2/3plog(T )) against a truthful buyer. A buyer is truthful if she accepts\nany offered price that gives a non-negative surplus, which is de\ufb01ned as the difference between the\nbuyer\u2019s value for the good minus the price paid: vt  pt. Therefore, for a truthful buyer we de\ufb01ne\nat = 1{pt \uf8ff vt}.\nAt this point, we note that the loss function vt  1{pt \uf8ff vt}pt, which we wish to minimize over\nall rounds, is not convex, differentiable or even continuous. If the price is even slightly above the\ntruthful buyers valuation it is rejected and the seller makes zero revenue. To circumvent this our\nalgorithm will attempt to learn w\u21e4 directly by minimizing a surrogate loss function for which w\u21e4\nin the minimizer. Our analysis hinges on recent results [22] which give optimal rates for gradient\ndescent when the function being minimized is strongly convex. Our key trick is to offer prices so\nthat, in each round, the buyer\u2019s behavior reveals the gradient of the surrogate loss at our current\nestimate for w\u21e4. Below we de\ufb01ne the LEAP algorithm (Algorithm 1), which we show addresses\nthese dif\ufb01culties in the online setting.\nThe algorithm depends on input parameters \u21b5, \u270f and . The \u21b5 parameter determines what fraction\nof rounds are spent in the learning phase as oppose to the exploit phase. During the learning phase,\nuniform random prices are offered and the model parameters are updated as a function of the feed-\nback given by the buyer. During the exploit phase, the model parameters are \ufb01xed and the offered\nprice is computed as a linear function of these parameters minus the value of the \u270f parameter. The\n\u270f parameter can be thought of as inversely proportional to our con\ufb01dence in the \ufb01xed model pa-\nrameters and is used to hedge against the possibility of over-estimating the value of a good. The \nparameter is a learning-rate parameter set according to the minimum eigenvalue of the covariance\nmatrix, and is de\ufb01ned below in Assumption 1. In order to prove a regret bound, we \ufb01rst show that\n\n3\n\n\fAlgorithm 1 LEAP algorithm\n\n(Learning phase)\n\n\u2022 Let 0 \uf8ff \u21b5 \uf8ff 1, w1 = 0 2W , \u270f  0, > 0, T\u21b5 = d\u21b5Te.\n\u2022 For t = 1, . . . , T\u21b5\n\u2013 Offer pt \u21e0 U, where U is the uniform distribution on the interval [0, 1].\n\u2013 Observe at.\n\u2013 \u02dcgt = 2wt \u00b7 xt  atxt.\n\u2013 wt+1 =\u21e7 W (wt  1\n\u2013 Offer pt = wT\u21b5+1 \u00b7 xt  \u270f.\n\n\u2022 For t = T\u21b5 + 1, . . . , T\n\nt \u02dcgt).\n\n(Exploit phase)\n\nthe learning phase of the algorithm is minimizing a strongly convex surrogate loss and then show\nthat this implies the buyer enjoys near optimal revenue during the exploit phase of the algorithm.\n\nLet gt = 2(w>t xt  1{pt \uf8ff vt})xt and F (w) = Ex\u21e0D\u21e5(w\u21e4>x  w>x)2\u21e4. Note that when the\nbuyer is truthful \u02dcgt = gt. Against a truthful buyer, gt is an unbiased estimate of the gradient of F .\nProposition 1. The random variable gt satis\ufb01es E[gt | wt] = rF (wt). Also, kgtk \uf8ff 4 with\nprobability 1.\nProof. First note that E[gt | wt] = Ext\u21e52wt\u00b7xtEpt[1{pt \uf8ff vt}]\u21e4 = Ext\u21e52wt\u00b7xtPrpt(pt \uf8ff\nvt)\u21e4. Since pt is drawn uniformly from [0, 1] and vt is guaranteed to lie in [0, 1] we have that\nPr(pt \uf8ff vt) =R 1\n0 1{pt \uf8ff vt}dpt = vt. Plugging this back into gt gives us exactly the expression\nfor rF (wt). Furthermore, kgtk = 2|w>t xt  1{pt \uf8ff vt}|kxtk \uf8ff 4 since |w>t xt|\uf8ff k wtkkxtk \uf8ff\n1 and kxtk \uf8ff 1\nWe now introduce the notion of strong convexity. A twice-differentiable function H(w) is -\nstrongly convex if and only if the Hessian matrix r2H(w) is full rank and the minimum eigenvalue\nof r2H(w) is at least . Note that the function F is strongly convex if and only if the covariance\nmatrix of the data is full-rank, since r2F (w) = 2Ex[xx>]. We make the following assumption.\nAssumption 1. The minimum eigenvalue of 2Ex[xx>] is at least > 0.\n\nNote that if this is not the case then there is redundancy in the features and the data can be pro-\njected (for example using PCA) into a lower dimensional feature space with a full-rank covariance\nmatrix and without any loss in information. The seller can compute an of\ufb02ine estimate of both this\nprojection and  by collecting a dataset of context vectors before starting to offer prices to the buyer.\nThus, in view of Proposition 1 and the strong convexity assumption, we see the learning phase of\nthe LEAP algorithm is conducting a stochastic gradient descent to minimize the -strongly convex\nt \u02dcgt) and \u02dcgt = gt is an unbiased\nfunction F , where at each time step we update wt+1 =\u21e7 W (wt  1\nestimate of the gradient. We now make use of an existing bound ([22]) for stochastic gradient\ndescent on strongly convex functions.\nLemma 1 ([22] Proposition 1). Let  2 (0, 1/e), T\u21b5  4 and suppose F is -strongly convex over\nthe convex set W. Also suppose E[gt | wt] = rF (w) and kgtk2 \uf8ff G2 with probability 1. Then\nwith probability at least 1   for any t \uf8ff T\u21b5 it holds that\n(624 log(log(T\u21b5)/) + 1)G2\n\nwhere w\u21e4 = argminwF (w) .\n\nkwt  w\u21e4k2 \uf8ff\n\n2t\n\nThis guarantees that, with high probability, the distance between the learned parameter vector wt\nand the target weight vector w\u21e4 is bounded and decreasing as t increases. This allows us to carefully\ntune the \u270f parameter that is used in the exploit phase of the algorithm (see Lemma 6 in the appendix).\nWe are now equipped to prove a bound on the regret of the LEAP algorithm.\nTheorem 1. For any T > 4, 0 <\u21b5< 1 and assuming a truthful buyer, the LEAP algorithm\n, where G = 4, has regret against a truthful buyer at most\n\nwith \u270f = q (624 log(pT\u21b5 log(T\u21b5))+1)G2\n\n2T\u21b5\n\n4\n\n\fR(T ) \uf8ff 2\u21b5T + 4q T\n\n\u21b5q (624 log(pT\u21b5 log(T\u21b5))+1)G2\n\n2\n\nR(T ) \uf8ff 2T 2/3 + 4T 2/3r (624 log(T 1/3 log(T 2/3)) + 1)G2\n\n2\n\nProof. We \ufb01rst decompose the regret\n\n, which implies for \u21b5 = T 1/3 a regret at most\n\n= O\u21e3T 2/3plog(T )\u2318 .\nTXt=T\u21b5+1\n\nEh TXt=1\n\nvt  atpti + Eh\n\nvt  atpti = Eh T\u21b5Xt=1\n\nEhvt  atpti , (1)\nwhere we have used the fact |vtatpt|\uf8ff 1. Let A denote the event that, for all t 2{ T\u21b5 +1, . . . , T},\nat = 1^ vt pt \uf8ff \u270f. Lemma 6 (see Appendix, Section A.1) proves that A occurs with probability at\n. For brevity let N =p(624 log(pT\u21b5 log(T\u21b5)) + 1)G2/2, then we can decompose\nleast 1T 1/2\nthe expectation in the following way:\n\nvt  atpti \uf8ff T\u21b5 +\n\nTXt=T\u21b5+1\n\n\u21b5\n\nEhvt  atpti = Pr[A]E[vt  atpt|A] + (1  Pr[A])E[vt  atpt|\u00acA]\n+r 1\n\n\uf8ff Pr[A]\u270f + (1  Pr[A]) \uf8ff \u270f + T 1/2\n\nT\u21b5\n\n\u21b5\n\n=r N\n\nT\u21b5 \uf8ff 2r N\nt=T\u21b5+1 E[vt  atpt] \uf8ff T\u21b5 + d(1\u21b5)Te\npT\u21b5\n\nT\u21b5\n\n,\n\nwhere the inequalities follow from the de\ufb01nition of A, Lemma 6, and the fact that |vt  atpt| < 1.\n2pN \uf8ff\nPlugging this back into equation (1) gives T\u21b5 +PT\n2\u21b5T + 4q T\n\npN, proving the \ufb01rst result of the theorem. \u21b5 = T 1/3 gives the \ufb01nal expression.\n\nIn the next section we consider the more challenging setting of a surplus-maximizing buyer, who\nmay accept/reject prices in a manner meant to lower the prices offered.\n4 Surplus-Maximizing Buyer\n\n\u21b5\n\nIn the previous section we considered a truthful buyer who myopically accepts every price below\n\nher value, i.e., she sets at = 1{pt \uf8ff vt} for every round t. Let S(T ) = EhPT\n\nt=1 tat(vt  pt)i\nbe the buyer\u2019s cumulative discounted surplus, where {t} is a decreasing discount sequence, with\nt 2 (0, 1). When prices are offered by the LEAP algorithm, the buyer\u2019s decisions about which\nprices to accept during the learning phase have an in\ufb02uence on the prices that she is offered in the\nexploit phase, and so a surplus-maximizing buyer may be able to increase her cumulative discounted\nsurplus by occasionally behaving untruthfully. In this section we assume that the buyer knows the\npricing algorithm and seeks to maximize S(T ).\nAssumption 2. The buyer is surplus-maximizing, i.e., she behaves so as to maximize S(T ), given\nthe seller\u2019s pricing algorithm.\nWe say that a lie occurs in any round t where at 6= 1{pt \uf8ff vt}. Note that a surplus-maximizing\nbuyer has no reason to lie during the exploit phase, since the buyer\u2019s behavior during exploit rounds\nhas no effect on the prices offered. Let L = {t : 1 \uf8ff t \uf8ff T\u21b5 ^ at 6= 1{pt \uf8ff vt}} be the set of\nlearning rounds where the buyer lies, and let L = |L| be the number of lies. Observe that \u02dcgt 6= gt\nin any lie round (recall that E[gt | wt] = rF (wt), i.e., gt is the stochastic gradient in round t).\nWe take a moment to note the necessity of the discount factor t. This essentially models the buyer\nas valuing surplus at the current time step more than in the future. Another way of interpreting this,\nis that the seller is more \u201cpatient\u201d as compared to the buyer. In [2] the authors show a lower bound\non the regret against a surplus-maximizing buyer in the contextless setting of the form O(T), where\nT =PT\ni=1 t. Thus, if no decreasing discount factor is used, i.e. t = 1, then sublinear regret is\nnot possible. Note, the lower bound of the contextless setting applies here as well, since the case of\na distribution D that induces a \ufb01xed context x\u21e4 on every round is a special case of our setting. In\nthat case the problem reduces to the \ufb01xed unknown value setting since on every round vt = w\u21e4>x\u21e4.\nIn the rest of this section we prove an OT 2/3plog(T )(1 + 1/ log(1/)) bound on the seller\u2019s\n\nregret under the assumption that the buyer is surplus-maximizing and that her discount sequence is\n\n5\n\n\f1\n log( 2\nlog(1/)\n\n )+1)\n\n.\n\nt = t1 for some  2 (0, 1). The idea of the proof is to show that the buyer incurs a cost for\ntelling lies, and therefore will not tell very many, and thus the lies she does tell will not signi\ufb01cantly\naffect the seller\u2019s estimate of w\u21e4.\nBounding the cost of lies: Observe that in any learning round where the surplus-maximizing buyer\ntells a lie, she loses surplus in that round relative to the truthful buyer, either by accepting a price\nhigher than her value (when at = 1 and vt < pt) or by rejecting a price less than her value (when\nat = 0 and vt > pt). This observation can be used to show that lies result in a substantial loss of\nsurplus relative to the truthful buyer, provided that in most of the lie rounds there is a nontrivial gap\nbetween the buyer\u2019s value and the seller\u2019s price. Because prices are chosen uniformly at random\nduring the learning phase, this is in fact quite likely, and with high probability the surplus lost\nrelative to the truthful buyer during the learning phase grows exponentially with the number of lies.\nThe precise quantity is stated in the Lemma below. A full proof appears in the appendix, Section A.3.\nLemma 2. Let the discount sequence be de\ufb01ned as t = t1 for 0 << 1 and assume the buyer\nhas told L lies. Then for > 0 with probability at least 1   the buyer loses a surplus of at least\nL+31\n8T\u21b5 log( 1\n\n )\u21e3 T\u21b5\n1\u2318 relative to the truthful buyer during the learning phase.\n\nBounding the number of lies: Although we argued in the previous lemma that lies during the\nlearning phase cause the surplus-maximizing buyer to lose surplus relative to the truthful buyer,\nthose lies may result in lower prices offered during the exploit phase, and thus the overall effect of\nlying may be bene\ufb01cial to the buyer. However, we show that there is a limit on how large that bene\ufb01t\ncan be, and thus we have the following high-probability bound on the number of learning phase lies.\nLemma 3. Let the discount sequence be de\ufb01ned as t = t1 for 0 << 1. Then for > 0 with\nprobability at least 1  , the number of lies L \uf8ff log(32T\u21b5\nThe full proof is found in the appendix (Section A.4), and we provide a proof sketch here. The\nargument proceeds by comparing the amount of surplus lost (compared to the truthful buyer) due to\ntelling lies in the learning phase to the amount of surplus that could hope to be gained (compared to\nthe truthful buyer) in the exploit phase. Due to the discount factor, the surplus lost will eventually\noutweigh the surplus gained as the number of lies increases, implying a limit to the number of lies a\nsurplus maximizing buyer can tell.\nBounding the effect of lies:\nIn Section 3 we argued that if the buyer is truthful then, in each\nlearning round t of the LEAP algorithm, \u02dcgt is a stochastic gradient with expected value rF (wt).\nWe then use the analysis of stochastic gradient descent in [22] to prove that wT\u21b5+1 converges to w\u21e4\n(Lemma 1). However, if the buyer can lie then \u02dcgt is not necessarily the gradient and Lemma 1 no\nlonger applies. Below we extend the analysis in Rakhlin et al. [22] to a setting where the gradient\nmay be corrupted by lies up to L times.\nLemma 4. Let  2 (0, 1/e), T\u21b5  2. If the buyer tells L lies then with probability at least 1  ,\nkwT\u21b5+1  w\u21e4k2 \uf8ff 1\nThe proof of the lemma is similar to that of Lemma 1, but with extra steps needed to bound the\nadditional error introduced due to the erroneous gradients. Due to space constraints, we present\nthe proof in the appendix, Section A.6. Note that, modulo constants, the bound only differs by the\nadditive term L/T\u21b5. That is, there is an extra additive error term that depends on the ratio of lies to\nnumber of learning rounds. Thus, if no lies are told, then there is no additive error. While if many\nlies are told, e.g. L = T\u21b5, then the bound become vacuous.\nMain result: We are now ready to prove an upper bound on the regret of the LEAP algorithm when\nthe buyer is surplus-maximizing.\nTheorem 2. For any 0 <\u21b5< 1 (such that T\u21b5  4), 0 << 1 and assuming a surplus-maximizing\nbuyer with exponential discounting factor t = t1, then the LEAP algorithm using parame-\nter \u270f =q 1\n , where G = 4, has regret\nR(T ) \uf8ff 2\u21b5T + 4r T\n4e2 log(128pT\u21b5 log(4pT\u21b5) + 1)\n\n\u21b5s (624 log(2pT\u21b5 log(T\u21b5)) + e2)G2\n\nT\u21b5 (624 log(2pT\u21b5 log(T\u21b5))+e2)G2\n\n+ 4e2 log(128pT\u21b5 log(4pT\u21b5)+1)\n\nT\u21b5+1\u21e3 (624 log(log(T\u21b5)/)+e2)G2\n\n2\n\nagainst a surplus-maximizing buyer at most\n\n+ 4e2L\n\n \u2318.\n\n2\n\n log(1/)\n\n+\n\n log(1/)\n\n,\n\n2\n\n6\n\n\f1\n\n2\n\n1\n log( 4\n\n )+1)\n\n log(1/)\n\n4e2 log(64T\u21b5\n\nwhich for \u21b5 = T 1/3 implies R(T ) \uf8ff O\u21e3T 2/3qlog(T )1 +\n\nlog(1/)\u2318.\nProof. Taking the high probability statements of Lemma 3 and Lemma 4 with /2 2 [0, 1/e]\nT\u21b5\u21e3 (624 log(2 log(T\u21b5)/)+e2)G2\ntells us that with probability at least 1  , kwT\u21b5  w\u21e4k2 \uf8ff 1\n+\nSince we assume T\u21b5  4, if we set  = T 1/2\n/2 \uf8ff 1/e, which is required\nfor Lemma 4 to hold. Thus, if we set the algorithm parameter \u270f as indicated in the statement of\ntheorem, we have that with probability at least 1  T 1/2\nfor all t 2{ T\u21b5 + 1, . . . , T} that at = 1\nand vt  pt \uf8ff \u270f, which follows from the same argument used for Lemma 6.\nFinally, the same steps as in the proof of Theorem 1 we can be used to show the \ufb01rst inequality.\nSetting \u21b5 = T 1/3 shows the second inequality and completes the theorem.\n\nit implies /2 = T 1/2\n\n\u21b5\n\n\u2318.\n\n\u21b5\n\n\u21b5\n\nNote that the bound shows that if  ! 1 (i.e. no discounting) the bound becomes vacuous, which\nis to be expected since the \u2326(T) lower bound on regret demonstrates the necessity of a discounting\nfactor. If  ! 0 (i.e. buyer become myopic, thereby truthful), then we retrieve the truthful bound\nmodulo constants. Thus for any < 1, we have shown the \ufb01rst sublinear bound on the seller\u2019s regret\nagainst a surplus-maximizing buyer in the contextual setting.\n\n5 Extensions\n\nDoubling trick: A drawback of Theorem 2 is that optimally tuning the parameters \u270f and \u21b5 re-\nquires knowledge of the horizon T . The usual way of handling this problem in the standard online\nlearning setting is to apply the \u2018doubling trick\u2019: If a learning algorithm that requires knowledge\nof T has regret O(T c) for some constant c, then running independent instances of the algorithm\nduring consecutive phases of exponentially increasing length (i.e., the ith phase has length 2i) will\nalso have regret O(T c). We can also apply the doubling trick to our strategic setting, but we must\nexercise caution and argue that running the algorithm in phases does not affect the behavior of a\nsurplus-maximizing buyer in a way that invalidates the proof of Theorem 2. We formally state and\nprove the relevant corollary in Section A.8 of the Appendix.\n\nKernelized Algorithm:\nIn some cases, assuming that the value of a buyer is a linear function of\nthe context may not be accurate. In Section A.7 of the Appendix we describe a kernelized version\nof LEAP, which allows for a non-linear model of the buyer value as a function of the context x. At\nthe same time, the regret guarantees provided in the previous sections still apply since we can view\nthe model as linear function of the induced features (x), where (\u00b7) is a non-linear map and the\nkernel function K is used to compute the inner product in this induced feature space: K(x, x0) =\n(x)>(x0).\n\nMultiple Buyers: So far we have assumed that the seller is interacting with a single buyer across\nmultiple posted price auctions. Recall that the motivation for considering this setting was repeated\nsecond price auctions against a single buyer, a situation that happens often in online advertising\nbecause of targetting. One might nevertheless wonder whether the algorithm can be applied to a\nsetting where there can be multiple buyers, and whether it remains robust in such a setting. We\ndescribe a way in which the analysis for the posted-price setting can carry over to multiple buyers.\nFormally, suppose there are K buyers, and on round t, buyer k receives a valuation of vk,t. We let\nt = vkval(t),t, and vt = maxk6=kval(t) vk,t: the buyer with the highest\nkval(t) = arg maxk vk,t, v+\nvaluation, the highest valuation itself, and the second-highest valuation respectively. In a second\nt and bt analogously\nprice auction, each buyer also submits a bid bk,t, and we de\ufb01ne kbid(t), b+\nt , vt , corresponding to the highest bidder, the largest bid, and the second-largest bid.\nto kval(t), v+\nAfter the seller announces a reserve price pt, buyers submit their bids {bk,t}, and the seller receives\nt } max{bt , pt}. The goal of the seller is to minimize R(T ) =\nround t revenue of rt = 1{pt \uf8ff b+\nE[PT\nt  rt]. We assume that buyers are surplus-maximizing, and select a strategy that maps\nprevious reserve prices p1, ..., pt1, pt, and vk,t to a choice of bid on round t.\n\nt=1 v+\n\n7\n\n\fWe call v+\nt the market valuation for good t. The key to extending the LEAP algorithm to the multiple\nbuyer setting will be to treat market valuations in the same way we treated the individual buyer\u2019s\nvaluation in the single-buyer setting. In order to do so, we make an analogous modelling assumption\nt = w\u21e4>t xt.1 Note\nto that of Section 2. Speci\ufb01cally, we assume that there is some w\u21e4 such that v+\nthat we assume a model on the market price itself.\nAt \ufb01rst glance, this might seem like a strange assumption since v+\nis itself the result of a maxi-\nt\nmization over vk,t. However, we argue that it\u2019s actually rather unrestrictive. In fact the individual\nvaluations vk,t can be generated arbitrarily so long as vk,t \uf8ff w\u21e4>t xt and equality holds for some k.\nIn other words, we can imagine that nature \ufb01rst computes the market valuation v+\nt , then arbitrarily\n(even adversarialy) selects which buyer gets this valuation, and the other buyer valuations.\nNow we can de\ufb01ne at = 1{pt \uf8ff b+\nt }, whether the largest bid was greater than the reserve, and\nconsider running the LEAP algorithm, but with this choice of at. Notice that for any t, atpt \uf8ff rt,\nthereby giving us the following key fact: R(T ) \uf8ff R0(T ) , E[PT\nt  atpt]. We also rede\ufb01ne\nL to be the number of market lies: rounds t \uf8ff T\u21b5 where at 6= 1{pt \uf8ff v+\nt }. Note the market tells\na lie if either all valuations were below pt, but somebody bid over pt anyway, or if some valuation\nwas above pt but no buyer decided to outbid pt. With this choice of L, Lemma 4 holds exactly as\nwritten but in the multiple buyer setting.\nIt\u2019s well-known [24] that single-shot second price auctions are strategy-proof. Therefore, during the\nexploit phase of the algorithm, all buyers are incentivized to bid truthfully. Thus, in order to bound\nR0(T ) and therefore R(T ), we need only rederive Lemma 3 to bound the number of market lies. We\nbegin partitioning the market lies. Let L = {t : t \uf8ff T\u21b5, 1{pt \uf8ff v+\nt }}, while letting\nLk = {t : t \uf8ff T\u21b5, v+\nt < pt \uf8ff v+\nt , kval(t) = k}. In other\nwords, we attribute a lie to buyer k if (1) the reserve was larger than the market value, but buyer k\nwon the auction anyway, or (2) buyer k had the largest valuation, but nobody cleared the reserve.\nChecking that L = [kLk and letting Lk = |Lk| tells us that L \uf8ff PK\nk=1 Lk. Furthermore, we\ncan bound Lk using nearly identical arguments to the posted price setting, giving us the subsequent\nCorollary for the multiple buyer setting.\nLemma 5. Let the discount sequence be de\ufb01ned as t = t1 for 0 << 1. Then for > 0 with\nprobability at least 1  , Lk \uf8ff log(32T\u21b5/+1)\nProof. We \ufb01rst consider the surplus buyer k loses during learning rounds, compared to if he had\nbeen truthful. Suppose buyer k unilateraly switches to always bidding his value (i.e. bk,t = vk,t).\nFor a single-shot second price auction, being truthful is a dominant strategy and so he would only\nincrease his surplus on learning rounds. Furthermore, on each round in Lk he would increase his\n(undiscounted) surplus by at least |vk,t  pt|. Now the analysis follows as in Lemmas 2 and 3.\nCorollary 1.\nIn the multiple surplus-maximizing buyers setting the LEAP algorithm with\n\u21b5 = T 1/3, \u270f = q 1\nT\u21b5 (624 log(2pT\u21b5 log(T\u21b5))+e2)G2\n , has regret\nR(T ) \uf8ff R0(T ) \uf8ff O\u21e3T 2/3qlog(T ) + K log(T )\nlog(1/)\u2318\n\nt , kbid(t) = k}[{ t \uf8ff T\u21b5, b+\n\n+ 4e2K log(128pT\u21b5 log(4pT\u21b5)+1)\n\nt }6 = 1{pt \uf8ff b+\n\nlog(1/)\n\n, and L \uf8ff KLk.\n\nt < p+\n\nt \uf8ff b+\n\nt=1 v+\n\n2\n\n log(1/)\n\n6 Conclusion\n\nIn this work, we have introduced the scenario of contextual auctions in the presence of surplus-\nmaximizing buyers and have presented an algorithm that is able to achieve sublinear regret in this\nsetting, assuming buyers receive a discounted surplus. Once again, we stress the importance of the\ncontextual setting, as it contributes to the rise of targeted bids that result in auction with single high-\nbidders, essentially reducing the auction to the posted-price scenario studied in this paper. Future\ndirections for extending this work include considering different surplus discount rates as well as\nunderstanding whether small modi\ufb01cations to standard contextual online learning algorithms can\nlead to no-strategic-regret guarantees.\n\n1Note that we could also apply the kernelized LEAP algorithm in the multiple buyer setting.\n\n8\n\n\fReferences\n[1] Alessandro Acquisti and Hal R Varian. Conditioning prices on purchase history. Marketing Science, 24\n\n(3):367\u2013381, 2005.\n\n[2] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for repeated auctions with strategic\n\nbuyers. In Advances in Neural Information Processing Systems, pages 1169\u20131177, 2013.\n\n[3] Ziv Bar-Yossef, Kirsten Hildrum, and Felix Wu. Incentive-compatible online auctions for digital goods.\n\nIn Proceedings of Symposium on Discrete Algorithms, pages 964\u2013970. SIAM, 2002.\n\n[4] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. In Proceedings\n\nSymposium on Discrete algorithms, pages 202\u2013204. SIAM, 2003.\n\n[5] Matthew Cary, Aparna Das, Ben Edelman, Ioannis Giotis, Kurtis Heimerl, Anna R Karlin, Claire Mathieu,\nand Michael Schwarz. Greedy bidding strategies for keyword auctions. In Proceedings of the 8th ACM\nconference on Electronic commerce, pages 262\u2013271. ACM, 2007.\n\n[6] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices in\n\nsecond-price auctions. In Proceedings of the Symposium on Discrete Algorithms. SIAM, 2013.\n\n[7] Vincent Conitzer and Tuomas Sandholm. Computing the optimal strategy to commit to. In Proceedings\n\nof the 7th ACM conference on Electronic commerce, pages 82\u201390. ACM, 2006.\n\n[8] Nikhil R Devanur, Yuval Peres, and Balasubramanian Sivan. Perfect bayesian equilibria in repeated sales.\n\narXiv preprint arXiv:1409.3062, 2014.\n\n[9] Benjamin Edelman and Michael Ostrovsky. Strategic bidder behavior in sponsored search auctions. De-\n\ncision support systems, 43(1):192\u2013198, 2007.\n\n[10] Drew Fudenberg and Jean Tirole. Game theory. MIT Press Books, 1, 1991.\n[11] Drew Fudenberg and J Miguel Villas-Boas. Behavior-based price discrimination and customer recogni-\n\ntion. Handbook on economics and information systems, 1:377\u2013436, 2006.\n\n[12] Mohammad Taghi Hajiaghayi, Robert Kleinberg, and David C Parkes. Adaptive limited-supply online\nauctions. In Proceedings of the 5th ACM conference on Electronic commerce, pages 71\u201380. ACM, 2004.\n[13] Oliver D Hart and Jean Tirole. Contract renegotiation and coasian dynamics. The Review of Economic\n\nStudies, 55(4):509\u2013540, 1988.\n\n[14] Manish Jain, Jason Tsai, James Pita, Christopher Kiekintveld, Shyamsunder Rathi, Milind Tambe, and\nFernando Ord\u00b4o\u02dcnez. Software assistants for randomized patrol planning for the lax airport police and the\nfederal air marshal service. Interfaces, 40(4):267\u2013290, 2010.\n\n[15] Brendan Kitts and Benjamin Leblanc. Optimal bidding on keyword auctions. Electronic Markets, 14(3):\n\n186\u2013201, 2004.\n\n[16] Brendan Kitts, Parameshvyas Laxminarayan, Benjamin Leblanc, and Ryan Meech. A formal analysis\nof search auctions including predictions on click fraud and bidding tactics. In Workshop on Sponsored\nSearch Auctions, 2005.\n\n[17] Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret for online\nposted-price auctions. In Symposium on Foundations of Computer Science, pages 594\u2013605. IEEE, 2003.\n[18] Dmytro Korzhyk, Zhengyu Yin, Christopher Kiekintveld, Vincent Conitzer, and Milind Tambe. Stackel-\nberg vs. nash in security games: An extended investigation of interchangeability, equivalence, and unique-\nness. J. Artif. Intell. Res.(JAIR), 41:297\u2013327, 2011.\n\n[19] Andres Munoz Medina and Mehryar Mohri. Learning theory and algorithms for revenue optimization\nin second price auctions with reserve. In Proceedings of The 31st International Conference on Machine\nLearning, pages 262\u2013270, 2014.\n\n[20] David C Parkes. Online mechanisms. In Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay Vazirani,\n\neditors, Algorithmic Game Theory. Cambridge University Press, 2007.\n\n[21] James Pita, Manish Jain, Janusz Marecki, Fernando Ord\u00b4o\u02dcnez, Christopher Portway, Milind Tambe, Craig\nWestern, Praveen Paruchuri, and Sarit Kraus. Deployed armor protection: the application of a game\ntheoretic model for security at the los angeles international airport. In Proceedings of the 7th interna-\ntional joint conference on Autonomous agents and multiagent systems: industrial track, pages 125\u2013132.\nInternational Foundation for Autonomous Agents and Multiagent Systems, 2008.\n\n[22] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly\n\nconvex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.\n\n[23] Klaus M Schmidt. Commitment through incomplete information in a simple repeated bargaining game.\n\nJournal of Economic Theory, 60(1):114\u2013139, 1993.\n\n[24] Hal R Varian and Jack Repcheck. Intermediate microeconomics: a modern approach, volume 6. WW\n\nNorton & Company New York, NY, 2010.\n\n9\n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "Kareem", "family_name": "Amin", "institution": "University of Pennsylvania"}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": "Google Research"}, {"given_name": "Umar", "family_name": "Syed", "institution": "Google Research"}]}