{"title": "Revenue Optimization with Approximate Bid Predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 1858, "page_last": 1866, "abstract": "In the context of advertising auctions, finding good reserve prices is a notoriously challenging learning problem. This is due to the heterogeneity of ad opportunity types, and the non-convexity of the objective function. In this work, we show how to reduce reserve price optimization to the standard setting of prediction under squared loss, a well understood problem in the learning community. We further bound the gap between the expected bid and revenue in terms of the average loss of the predictor. This is the first result that formally relates the revenue gained to the quality of a standard machine learned model.", "full_text": "Revenue Optimization with Approximate Bid\n\nPredictions\n\nAndr\u00b4es Mu\u02dcnoz Medina\n\nGoogle Research\n\n76 9th Ave\n\nNew York, NY 10011\n\nSergei Vassilvitskii\nGoogle Research\n\n76 9th Ave\n\nNew York, NY 10011\n\nAbstract\n\nIn the context of advertising auctions, \ufb01nding good reserve prices is a notoriously\nchallenging learning problem. This is due to the heterogeneity of ad opportunity\ntypes, and the non-convexity of the objective function. In this work, we show how\nto reduce reserve price optimization to the standard setting of prediction under\nsquared loss, a well understood problem in the learning community. We further\nbound the gap between the expected bid and revenue in terms of the average loss\nof the predictor. This is the \ufb01rst result that formally relates the revenue gained to\nthe quality of a standard machine learned model.\n\n1\n\nIntroduction\n\nA crucial task for revenue optimization in auctions is setting a good reserve (or minimum) price. Set\nit too low, and the sale may yield little revenue, set it too high and there may not be anyone willing\nto buy the item. The celebrated work by Myerson [1981] shows how to optimally set reserves in\nsecond price auctions, provided the value distribution of each bidder is known.\nIn practice there are two challenges that make this problem signi\ufb01cantly more complicated. First,\nthe value distribution is never known directly; rather, the auctioneer can only observe samples drawn\nfrom it. Second, in the context of ad auctions, the items for sale (impressions) are heterogeneous, and\nthere are literally trillions of different types of items being sold. It is therefore likely that a speci\ufb01c\ntype of item has never been observed previously, and no information about its value is known.\nA standard machine learning approach addressing the heterogeneity problem is to parametrize each\nimpression by a feature vector, with the underlying assumption that bids observed from auctions\nwith similar features will be similar. In online advertising. these features encode, for instance, the\nad size, whether it\u2019s mobile or desktop, etc.\nThe question is, then, how to use the features to set a good reserve price for a particular ad opportu-\nnity. On the face of it, this sounds like a standard machine learning question\u2014given a set of features,\npredict the value of the maximum bid. The dif\ufb01culty comes from the shape of the loss function.\nMuch of the machine learning literature is concerned with optimizing well behaved loss functions,\nsuch as squared loss, or hinge loss. The revenue function, on the other hand is non-continuous and\nstrongly non-concave, making a direct attack a challenging proposition.\nIn this work we take a different approach and reduce the problem of \ufb01nding good reserve prices to\na prediction problem under the squared loss. In this way we can rely upon many widely available\nand scalable algorithms developed to minimize this objective. We proceed by using the predictor to\nde\ufb01ne a judicious clustering of the data, and then compute the empirically maximizing reserve price\nfor each group. Our reduction is simple and practical, and directly ties the revenue gained by the\nalgorithm to the prediction error.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1.1 Related Work\n\nOptimizing revenue in auctions has been a rich area of study, beginning with the seminal work\nof Myerson [1981] who introduced optimal auction design. Follow up work by Chawla et al.\n[2007] and Hartline and Roughgarden [2009], among others, re\ufb01ned his results to increasingly\nmore complex settings, taking into account multiple items, diverse demand functions, and weaker\nassumptions on the shape of the value distributions.\nMost of the classical literature on revenue optimization focuses on the design of optimal auctions\nwhen the bidding distribution of buyers is known. More recent work has considered the compu-\ntational and information theoretic challenges in learning optimal auctions from data. A long line\nof work [Cole and Roughgarden, 2015, Devanur et al., 2016, Dhangwatnotai et al., 2015, Morgen-\nstern and Roughgarden, 2015, 2016] analyzes the sample complexity of designing optimal auctions.\nThe main contribution of this direction is to show that under fairly general bidding scenarios, a\nnear-optimal auction can be designed knowing only a polynomial number of samples from bidders\u2019\nvaluations. Other authors, [Leme et al., 2016, Roughgarden and Wang, 2016] have focused on the\ncomputational complexity of \ufb01nding optimal reserve prices from samples, showing that even for\nsimple mechanisms the problem is often NP-hard to solve directly.\nAnother well studied approach to data-driven revenue optimization is that of online learning. Here,\nauctions occur one at a time, and the learning algorithm must compute prices as a function of the his-\ntory of the algorithm. These algorithms generally make no distributional assumptions and measure\ntheir performance in terms of regret: the difference between the algorithm\u2019s performance and the\nperformance of the best \ufb01xed reserve price in hindsight. Kleinberg and Leighton [2003] developed\nan online revenue optimization algorithm for posted-price auctions that achieves low regret. Their\nwork was later extended to second-price auctions by Cesa-Bianchi et al. [2015].\nA natural approach in both of these settings is to attempt to predict an optimal reserve price, equiv-\nalently the highest bid submitted by any of the buyers. While the problem of learning this reserve\nprice is well understood for the simplistic model of buyers with i.i.d. valuations [Cesa-Bianchi et al.,\n2015, Devanur et al., 2016, Kleinberg and Leighton, 2003], the problem becomes much more chal-\nlenging in practice, when the valuations of a buyer also depend on features associated with the ad\nopportunity (for instance user demographics, and publisher information).\nThis problem is not nearly as well understood as its i.i.d. counterpart. Mohri and Medina [2014]\nprovide learning guarantees and an algorithm based on DC programming to optimize revenue in\nsecond-price auctions with reserve. The proposed algorithm, however, does not easily scale to large\nauction data sets as each iteration involves solving a convex optimization problem. A smoother\nversion of this algorithm is given by [Rudolph et al., 2016]. However, being a highly non-convex\nproblem, neither algorithm provides a guarantee on the revenue attainable by the algorithm\u2019s output.\nDevanur et al. [2016] give sample complexity bounds on the design of optimal auctions with side\ninformation. However, the authors consider only cases where this side information is given by\n\u03c3 \u2208 [0, 1]. More importantly, their proposed algorithm only works under the unveri\ufb01able assumption\nthat the conditional distributions of bids given \u03c3 satisfy stochastic dominance.\nOur results. We show that given a predictor of the bid with squared loss of \u03b72, we can construct\na reserve function r that extracts all but g(\u03b7) revenue, for a simple increasing function g.\n(See\nTheorem 2 for the exact statement.) To the best of our knowledge, this is the \ufb01rst result that ties\nthe revenue one can achieve directly to the quality of a standard prediction task. Our algorithm for\ncomputing r is scalable, practical, and ef\ufb01cient.\nAlong the way we show what kinds of distributions are amenable to revenue optimization via reserve\nprices. We prove that when bids are drawn i.i.d. from a distribution F , the ratio between the mean bid\nand the revenue extracted with the optimum monopoly reserve scales as O(log Var(F )) \u2013 Theorem\n5. This result re\ufb01nes the log h bound derived by Goldberg et al. [2001], and formalizes the intuition\nthat reserve prices are more successful for low variance distributions.\n2 Setup\nWe consider a repeated posted price auction setup where every auction is parametrized by a feature\nvector x \u2208 X and a bid b \u2208 [0, 1]. Let D be a distribution over X \u00d7 [0, 1]. Let h : X \u2192 [0, 1], be a\nbid prediction function and denote by \u03b72 the squared loss incurred by h:\n\nE[(h(x) \u2212 b)2] = \u03b72.\n\n2\n\n\fWe assume h is given, and make no assumption on the structure of h or how it is obtained. Notice\nthat while the existence of such h is not guaranteed for all values of \u03b7, using historical data one\ncould use one of multiple readily available regression algorithms to \ufb01nd the best hypothesis h.\n\nLet S =(cid:0)(x1, b1), . . . , (xm, bm)(cid:1) \u223c D be a set of m i.i.d. samples drawn from D and denote by\n\nSX = (x1, . . . , xm) its projection on X . Given a price p let Rev(p, b) = p1b\u2265p denote the revenue\nobtained when the bidder bids b. For a reserve price function r : X \u2192 [0, 1] we let:\n\n(cid:98)R(r) =\n\n1\nm\n\n(cid:88)\n\n(x,b)\u2208S\n\nand\n\nRev(r(x), b)\n\nR(r) = E\n\n(x,b)\u223cD\n\n(cid:2)Rev(r(x), b)(cid:3)\n(cid:80)m\n\ndenote the expected and empirical revenue of reserve price function r.\n\nWe also let B = E[b], (cid:98)B = 1\nB \u2212 R(r), (cid:98)S(r) = (cid:98)B \u2212 (cid:98)R(r) denote the expected and empirical separation between bid values\n\ni=1 bi denote the population and empirical mean bid, and S(r) =\n\nand the revenue. Notice that for a given reserve price function r, S(r) corresponds to revenue left\non the table. Our goal is, given S and h, to \ufb01nd a function r that maximizes R(r) or equivalently\nminimizes S(r).\n\nm\n\n2.1 Generalization Error\n\nNote that in our set up we are only given samples from the distribution, D, but aim to maximize the\nexpected revenue. Understanding the difference between the empirical performance of an algorithm\nand its expected performance, also known as the generalization error, is a key tenet of learning\ntheory.\nAt a high level, the generalization error is a function of the training set size: larger training sets lead\nto smaller generalization error; and the inherent complexity of the learning algorithm: simple rules\nsuch as linear classi\ufb01ers generalize better than more complex ones.\nIn this paper we characterize the complexity of a class G of functions by its growth function \u03a0. The\ngrowth function corresponds to the maximum number of binary labelings that can be obtained by\nG over all possible samples SX . It is closely related to the VC-dimension when G takes values in\n{0, 1} and to the pseudo-dimension [Morgenstern and Roughgarden, 2015, Mohri et al., 2012] when\nG takes values in R.\nWe can give a bound on the generalization error associated with minimizing the empirical separation\nover a class of functions G. The following theorem is an adaptation of Theorem 1 of [Mohri and\nMedina, 2014] to our particular setup.\nTheorem 1. Let \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over the choice of the sample S the following\nbound holds uniformly for r \u2208 G\n\n(cid:114)\n\nS(r) \u2264 (cid:98)S(r) + 2\n\n(cid:114)\n\nTherefore, in order to minimize the expected separation S(r) it suf\ufb01ces to minimize the empirical\n\nseparation (cid:98)S(r) over a class of functions G whose growth function scales polynomially in m.\n\nlog 1/\u03b4\n\n2m\n\n+ 4\n\n2 log(\u03a0(G, m))\n\nm\n\n.\n\n(1)\n\n3 Warmup\n\nIn order to better understand the problem at hand, we begin by introducing a straightforward mech-\nanism for transforming the hypothesis function h to a reserve price function r with guarantees on its\nachievable revenue.\nLemma 1. Let r : X \u2192 [0, 1] be de\ufb01ned by r(x) := max(h(x) \u2212 \u03b72/3, 0). The function r then\nsatis\ufb01es S(r) \u2264 \u03b71/2 + 2\u03b72/3.\nThe proof is a simple application of Jensen\u2019s and Markov\u2019s inequalities and it is deferred to Ap-\npendix B.\nThis surprisingly simple algorithm shows there are ways to obtain revenue guarantees from a simple\nregressor. To the best of our knowledge these is the \ufb01rst guarantee of its kind. The reader may be\n\n3\n\n\fcurious about the choice of \u03b72/3 as the offset in our reserve price function. We will show that the\ndependence on \u03b72/3 is not a simple artifact of our analysis, but a cost inherent to the problem of\nrevenue optimization.\nMoreover, observe that this simple algorithm \ufb01xes a static offset, and does not make a distinction\nbetween those parts of the feature space, where the algorithm makes a low error, and those where the\nerror is relatively high. By contrast our proposed algorithm partitions the space appropriately and\ncalculates a different reserve for each partition. More importantly we will provide a data dependent\nbound on the performance of our algorithm that only in the worst case scenario behaves like \u03b72/3.\n\n4 Results Overview\n\nIn principle to maximize revenue we need to \ufb01nd a class of functions G with small complexity, but\nthat contains a function which approximately minimizes the empirical separation. The challenge\ncomes from the fact that the revenue function, Rev, is not continuous and highly non-concave\u2014a\nsmall change in the price, p, may lead to very large changes in revenue. This is the main reason why\nsimply using the predictor h(x) as a proxy for a reserve function is a poor choice, even if its average\nerror, \u03b72 is small. For example a function h, that is just as likely to over-predict by \u03b7 as to under\npredict by \u03b7 will have very small error, but lead to 0 revenue in half the cases.\nA solution on the other end of the spectrum would simply memorize the optimum prices from the\nsample S, setting r(xi) = bi. While this leads to optimal empirical revenue, a function class G\ncontaining r would satisfy \u03a0(G, m) = 2m, making the bound of Theorem 1 vacuous.\nIn this work we introduce a family G(h, k) of classes parameterized by k \u2208 N. This family admits\nan approximate minimizer that can be computed in polynomial time, has low generalization error,\nand achieves provable guarantees to the overall revenue.\nMore precisely, we show that given S, and a hypothesis h with expected squared loss of \u03b72:\n\n\u2022 For every k \u2265 1 there exists a set of functions G(h, k) such that \u03a0(G(h, k), m) = O(m2k).\n\u2022 For every k \u2265 1, there is a polynomial time algorithm that outputs rk \u2208 G(h, k) such that\n\nin the worst case scenario (cid:98)S(rk) is bounded by O( 1\n\nk2/3 + \u03b72/3 + 1\n\nm1/6 ).\n\nEffectively, we show how to transform any classi\ufb01er h with low squared loss, \u03b72, to a reserve price\npredictor that recovers all but O(\u03b72/3) revenue in expectation.\n\n4.1 Algorithm Description\n\nIn this section we give an overview of the algorithm that uses both the predictor h and the set of\nsamples in S to develop a pricing function r. Our approach has two steps. First we partition the\nset of feasible prices, 0 \u2264 p \u2264 1, into k partitions, C1, C2, . . . , Ck. The exact boundaries between\npartitions depend on the samples S and their predicted values, as given by h. For each partition\nwe \ufb01nd the price that maximizes the empirical revenue in the partition. We let r(x) return the\nempirically optimum price in the partition that contains h(x).\nFor a more formal description, let Tk be the set of k-partitions of the interval [0, 1] that is:\n\nTk = {t = (t0, t1, . . . , tk\u22121, tk) | 0 = t0 < . . . < tk = 1}.\n\nWe de\ufb01ne G(h, k) = {x (cid:55)\u2192 (cid:80)k\u22121\n\nj=0 ri1tj\u2264h(x) 0 and let rk denote the output of Algorithm 1 then rk \u2208 G(h, k) and with\nprobability at least 1 \u2212 \u03b4 over the samples S:\n\n(cid:98)S(rk) \u2264 (3(cid:98)B)1/3(cid:16) 1\n\n\u03a6(Ch)\n\n2m\n\n(cid:17) \u2264 (3(cid:98)B)1/3(cid:16) 1\n\n2k\n\n(cid:114)\n\n(cid:16)\n\n+ 2\n\n\u03b72 +\n\nlog 1/\u03b4\n\n2m\n\n(cid:17)1/2(cid:33)2/3\n\n.\n\nNotice that our bound is data dependent and only in he worst case scenario it behaves like \u03b72/3. In\ngeneral it could be much smaller.\nWe also show that the complexity of G(h, k) admits a favorable bound. The proof is similar to that\nin [Morgenstern and Roughgarden, 2015]; we include it in Appendix E for completness.\nTheorem 3. The growth function of the class G(h, k) can be bounded as: \u03a0(G(h, k), m) \u2264 m2k\u22121\n\nWe can combine these results with Equation 1 and an easy bound on (cid:98)B in terms of B to conclude:\nS(rk) \u2264 (3(cid:98)B)1/3(cid:16) \u03a6(Ch)\n\nCorollary 1. Let \u03b4 > 0 and let rk denote the output of Algorithm 1 then rk \u2208 G(h, k) and with\nprobability at least 1 \u2212 \u03b4 over the samples S:\n\n(cid:17) \u2264 (12B\u03b72)1/3+O\n\n(cid:16) log 1/\u03b4\n\n(cid:16) 1\n\n(cid:16)(cid:114)\n\n(cid:17)1/6\n\n(cid:114)\n\nk log m\n\nk log m\n\n(cid:17)\n\n+O\n\n+\n\n+\n\n.\n\nkk\n\n(cid:17)\n\n.\n\nk2/3\n\n2m\n\nm\n\n2m\n\nm\n\nSince B \u2208 [0, 1], this implies that when k = \u0398(m3/7), the separation is bounded by 2.28\u03b72/3 plus\nadditional error factors that go to 0 with the number of samples, m, as \u02dcO(m\u22122/7).\n\n5 Bounding Separation\n\nIn this section we prove the main bound motivating our algorithm. This bound relates the variance\nof the bid distribution and the maximum revenue that can be extracted when a buyer\u2019s bids follow\nsuch distribution. It formally shows what makes a distribution amenable to revenue optimization.\nTo gain intuition for the kind of bound we are striving for, consider a bid distribution F . If the\nvariance of F is 0, that is F is a point mass at some value v, then setting a reserve price to v leads to\nno separation. On the other hand, consider the equal revenue distribution, with F (x) = 1\u2212 1/x. Here\nany reserve price leads to revenue of 1. However, the distribution has unbounded expected bid and\nvariance, so it is not too surprising that more revenue cannot be extracted. We make this connection\nprecise, showing that after setting the optimal reserve price, the separation can be bounded by a\nfunction of the variance of the distribution.\nGiven any bid distribution F over [0, 1] we denote by G(r) = 1 \u2212 limr(cid:48)\u2192r\u2212 F (r(cid:48)) the probability\nthat a bid is greater than or equal to r. Finally, we will let R = maxr rG(r) denote the maximum\nrevenue achievable when facing a bidder whose bids are drawn from distribution F . As before we\ndenote by B = Eb\u223cF [b] the mean bid and by S = B \u2212 R the expected separation of distribution F .\n\n5\n\n\fTheorem 4. Let \u03c32 denote the variance of F . Then \u03c32 \u2265 2R2e S\nR \u2212 B2 \u2212 R2.\nThe proof of this theorem is highly technical and we present it in Appendix A.\nCorollary 2. The following bound holds for any distribution F: S \u2264 (3R)1/3\u03c32/3 \u2264 (3B)1/3\u03c32/3\nThe proof of this corollary follows immediately by an application of Taylor\u2019s theorem to the bound\nof Theorem 4. It is also easy to show that this bound is tight (see Appendix D).\n\n5.1 Approximating Maximum Revenue\n\n(cid:1)\n\nB2\n\nR \u2264 4.78 + 2 log(cid:0)1 + \u03c32\n\nIn their seminal work Goldberg et al. [2001] showed that when faced with a bidder drawing values\ndistribution F on [1, M ] with mean B, an auctioneer setting the optimum monopoly reserve would\nrecover at least \u2126(B/ log M ) revenue. We show how to adapt the result of Theorem 4 to re\ufb01ne this\napproximation ratio as a function of the variance of F . We defer the proof to Appendix B.\nTheorem 5. For any distribution F with mean B and variance \u03c32, the maximum revenue with\nmonopoly reserves, R, satis\ufb01es: B\nNote that since \u03c32 \u2264 M 2 this always leads to a tighter bound on the revenue.\n5.2 Partition of X\nCorollary 2 suggests clustering points in such a way that the variance of the bids in each cluster\n\nis minimized. Given a partition C = {C1, . . . , Ck} of X we denote by mj = |SX \u222a Cj|, (cid:98)Bj =\n(cid:80)\n(cid:80)\n(bi \u2212(cid:98)Bj)2. Let also rj = argmaxp>0 p|{bi > p|xi \u2208 Cj}| and\n(cid:98)Rj = rj|{bi > rj|xi \u2208 Cj}|.\n(cid:17)2/3\nLemma 2. Let r(x) = (cid:80)k\n(cid:17)1/3(cid:16) 1\n(cid:16)\n3(cid:98)B\nProof. Let (cid:98)Sj = (cid:98)Bj\u2212(cid:98)Rj, Corollary 2 applied to the empirical bid distribution in Cj yields (cid:98)Sj \u2264\n(3(cid:98)Bj)1/3(cid:98)\u03c32/3\n(cid:98)S(r) =\n\nthen (cid:98)S(r) \u2264 (cid:16)\n3(cid:98)B\n\nm , summing over all clusters and using H\u00a8older\u2019s inequality gives:\n\n(cid:80)k\nj=1 mj(cid:98)\u03c3j\n\n(3(cid:98)Bj)1/3(cid:98)\u03c32/3\n\nj mj \u2264(cid:16) k(cid:88)\n\n(cid:17)1/3(cid:16) k(cid:88)\n\n(cid:17)1/3(cid:16) 1\n\n. Multiplying by mj\n\n(cid:17)2/3\n\n.\n\nj = 1\nmj\n\ni:xi\u2208Cj\n\nbi,(cid:98)\u03c32\n\n1\nmj\n\ni:xi\u2208Cj\n\nj=1 rj1x\u2208Cj\n\n2m \u03a6(C)\n\nk(cid:88)\n\n(cid:17)\n\n.\n\nm\n\n=\n\n(cid:98)Bj\n\n3mj\nm\n\nj\n\nk(cid:88)\n\nj=1\n\n1\nm\n\nmjSj \u2264 1\nm\n\nj=1\n\nj=1\n\nj=1\n\nm(cid:98)\u03c3j\n\nmj\n\n6 Clustering Algorithm\n\nIn view of Lemma 2 and since the quantity (cid:98)B is \ufb01xed, we can \ufb01nd a function minimizing the\n\nexpected separation by \ufb01nding a partition of X that minimizes the weighted variance \u03a6(C) de\ufb01ned\nSection 4.1. From the de\ufb01nition of \u03a6, this problem resembles a traditional k-means clustering\nproblem with distance function d(xi, xi(cid:48)) = (bi\u2212bi(cid:48))2. Thus, one could use one of several clustering\nalgorithms to solve it. Nevertheless, in order to allocate a new point x \u2208 X to a cluster, we would\nrequire access to the bid b which at evaluation time is unknown. Instead, we show how to utilize the\npredictions of h to de\ufb01ne an almost optimal clustering of X .\nFor any partition C = {C1, . . . , Ck} of X de\ufb01ne\n\nk(cid:88)\n\n(cid:115) (cid:88)\n\nj=1\n\ni,i(cid:48):xi,xi(cid:48)\u2208Ck\n\n\u03a6h(C) =\n\n(h(xi) \u2212 h(xi(cid:48)))2.\n\n2m \u03a6h(C) is the function minimized by Algorithm 1. The following lemma, proved in\nNotice that 1\nAppendix B, bounds the cluster variance achieved by clustering bids according to their predictions.\n\n6\n\n\fLemma 3. Let h be a function such that 1\nm\n\nthat minimizes \u03a6(C). If Ch minimizes \u03a6h(C) then \u03a6(Ch) \u2264 \u03a6(C\u2217) + 4m(cid:98)\u03b7.\n\nCorollary 3. Let rk be the output of Algorithm 1. If 1\nm\n\n(cid:80)m\ni=1(h(xi)\u2212 bi)2 \u2264(cid:98)\u03b72, and let C\u2217 denote the partition\n(cid:80)m\nj=1(h(xi) \u2212 bi)2 \u2264(cid:98)\u03b72 then:\n(cid:17)2/3\n(cid:17)2/3 \u2264 (3(cid:98)B)1/3Big(\n\u03a6(C\u2217) + 2(cid:98)\u03b7\n\n(3)\n\n.\n\n1\n2m\n\n(cid:98)S(rk) \u2264 (3(cid:98)B)1/3(cid:16) 1\n\n\u03a6(Ch)\n\n2m\n\nj of Ch are of the form Cj = {x|tj \u2264 h(x) \u2264 tj+1} for\nProof. It is easy to see that the elements C h\nt \u2208 Tk. Thus if rk is the hypothesis induced by the partition Ch, then rk \u2208 G(h, k). The result now\nfollows by de\ufb01nition of \u03a6 and lemmas 2 and 3.\n\nThe proof of Theorem 2 is now straightforward. De\ufb01ne a partition C by xi \u2208 Cj if bi \u2208(cid:2) j\u22121\n\n(cid:3).\n\nk , j\n\nk\n\nSince (bi \u2212 bi(cid:48))2 \u2264 1\n\nk2 for bi, bi(cid:48) \u2208 Cj we have\n\nFurthermore since E[(h(x) \u2212 b)2] \u2264 \u03b72, Hoeffding\u2019s inequality implies that with probability 1 \u2212 \u03b4:\n\nm\nk\n\n.\n\n(4)\n\n(cid:115)\n\nj=1\n\nm2\nj\nk2 =\n\n\u03a6(C) \u2264 k(cid:88)\n(h(xi) \u2212 bi)2 \u2264(cid:16)\n(cid:17)1/2(cid:33)2/3\n(cid:114)\n\nlog 1/\u03b4\n\n\u03b72 +\n\n1\nm\n\nm(cid:88)\n\ni=1\n\n(cid:16)\n\nIn view of inequalities (4) and (5) as well as Corollary 3 we have:\n\n(cid:32)\n\n(cid:98)S(rk) \u2264 (3(cid:98)B)1/3\n\n\u03a6(C)+2\n\n1\n2m\n\n\u03b72+\n\n2m\n\n(cid:114)\n\n.\n\n2m\n\nlog 1/\u03b4\n\n(cid:17)\n\u2264 (3(cid:98)B)1/3(cid:16) 1\n\n2k\n\n(5)\n\n(cid:17)1/2(cid:33)2/3\n\n(cid:114)\n\n(cid:16)\n\n+2\n\n\u03b72+\n\nlog 1/\u03b4\n\n2m\n\nThis completes the proof of the main result. To implement the algorithm, note that the problem\nof minimizing \u03a6h(C) reduces to \ufb01nding a partition t \u2208 Tk such that the sum of the variances\nwithin the partitions is minimized. It is clear that it suf\ufb01ces to consider points tj in the set B =\n{h(x1), . . . , h(xm)}. With this observation, a simple dynamic program leads to a polynomial time\nalgorithm with an O(km2) running time (see Appendix C).\n\n7 Experiments\n\nWe now compare the performance of our algorithm against the following baselines:\n\n\u03b72/3 we \ufb01nd the optimal t maximizing the empirical revenue(cid:80)m\n\n1. The offset algorithm presented in Section 3, where instead of using the theoretical offset\n\n(cid:0)h(xi)\u2212t)1h(xi)\u2212t\u2264bi.\n\n2. The DC algorithm introduced by Mohri and Medina [2014], which represents the state of\n\ni=1\n\nthe art in learning a revenue optimal reserve price.\n\nSynthetic data. We begin by running experiments on synthetic data to demonstrate the regimes\nwhere each algorithm excels. We generate feature vectors xi \u2208 R10 with coordinates sampled from\na mixture of lognormal distributions with means \u00b51 = 0, \u00b52 = 1, variance \u03c31 = \u03c32 = 0.5 and\nmixture parameter p = 0.5. Let 1 \u2208 Rd denote the vector with entries set to 1. Bids are generated\naccording to two different scenarios:\nLinear Bids bi generated according to bi = max(x(cid:62)\nBimodal Bids bi generated according to the following rule: let si = max(x(cid:62)\n\nvariable with mean 0, and standard deviation \u03c3 \u2208 {0.01, 0.1, 1.0, 2.0, 4.0}.\n\ni 1 + \u03b2i, 0) where \u03b2i is a Gaussian random\n\ni 1 + \u03b2i, 0) if si > 30\n\nthen bi = 40 + \u03b1i otherwise bi = si. Here \u03b1i has the same distribution as \u03b2i.\n\nThe linear scenario demonstrates what happens when we have a good estimate of the bids. The\nbimodal scenario models a buyer, which for the most part will bid as a continuous function of\nfeatures but that is interested in a particular set of objects (for instance retargeting buyers in online\nadvertisement) for which she is willing to pay a much higher price.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Mean revenue of the three algorithms on the linear scenario. (b) Mean revenue of the\nthree algorithms on the bimodal scenario. (c) Mean revenue on auction data.\n\nFor each experiment we generated a training dataset Strain, a holdout set Sholdout and a test set\nStest each with 16,000 examples. The function h used by RIC-h and the offset algorithm is found\nby training a linear regressor over Strain. For ef\ufb01ciency, we ran RIC-h algorithm on quantizations\nof predictions h(xi). Quantized predictions belong to one of 1000 buckets over the interval [0, 50].\nFinally, the choice of hyperparameters \u03b3 for the Lipchitz loss and k for the clustering algorithm was\ndone by selecting the best performing parameter over the holdout set. Following the suggestions in\n[Mohri and Medina, 2014] we chose \u03b3 \u2208 {0.001, 0.01, 0.1, 1.0} and k \u2208 {2, 4, . . . , 24}.\nFigure 1(a),(b) shows the average revenue of the three approaches across 20 replicas of the experi-\nment as a function of the log of \u03c3. Revenue is normalized so that the DC algorithm revenue is 1.0\nwhen \u03c3 = 0.01. The error bars at one standard deviation are indistinguishable in the plot. It is\nnot surprising to see that in the linear scenario, the DC algorithm of [Mohri and Medina, 2014] and\nthe offset algorithm outperform RIC-h under low noise conditions. Both algorithms will recover a\nsolution close to the true weight vector 1. In this case the offset is minimal, thus recovering virtually\nall revenue. On the other hand, even if we set the optimal reserve price for every cluster, the inherent\nvariance of each cluster makes us leave some revenue on the table. Nevertheless, notice that as the\nnoise increases all three algorithms seem to achieve the same revenue. This is due to the fact that\nthe variance in each cluster is comparable with the error in the prediction function h.\nThe results are reversed for the bimodal scenario where RIC-h outperforms both algorithms under\nlow noise. This is due to the fact that RIC-h recovers virtually all revenue obtained from high bids\nwhile the offset and DC algorithms must set conservative prices to avoid losing revenue from lower\nbids.\nAuction data. In practice, however, neither of the synthetic regimes is fully representative of the\nbidding patterns. In order to fully evaluate RIC-h, we collected auction bid data from AdExchange\nfor 4 different publisher-advertiser pairs. For each pair we sampled 100,000 examples with a set of\ndiscrete and continuous features. The \ufb01nal feature vectors are in Rd for d \u2208 [100, 200] depending\non the publisher-buyer pair. For each experiment, we extract a random training sample of 20,0000\npoints as well as a holdout and test sample. We repeated this experiment 20 times and present\nthe results on Figure 1 (c) where we have normalized the data so that the performance of the DC\nalgorithm is always 1. The error bars represent one standard deviation from the mean revenue\nlift. Notice that our proposed algorithm achieves on average up to 30% improvement over the DC\nalgorithm. Moreover, the simple offset strategy never outperforms the clustering algorithm, and in\nsome cases achieves signi\ufb01cantly less revenue.\n\n8 Conclusion\n\nWe provided a simple, scalable reduction of the problem of revenue optimization with side infor-\nmation to the well studied problem of minimizing the squared loss. Our reduction provides the \ufb01rst\npolynomial time algoritm with a quanti\ufb01able bound on the achieved revenue. In the analysis of our\nalgorithm we also provided the \ufb01rst variance dependent lower bound on the revenue attained by set-\nting optimal monopoly prices. Finally, we provided extensive empirical evidence of the advantages\nof RIC-h over the current state of theart.\n\n8\n\n\fReferences\nNicol`o Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices\n\nin second-price auctions. IEEE Trans. Information Theory, 61(1):549\u2013564, 2015.\n\nShuchi Chawla, Jason D. Hartline, and Robert D. Kleinberg. Algorithmic pricing via virtual val-\nuations. In Proceedings 8th ACM Conference on Electronic Commerce (EC-2007), San Diego,\nCalifornia, USA, June 11-15, 2007, pages 243\u2013251, 2007. doi: 10.1145/1250910.1250946.\n\nRichard Cole and Tim Roughgarden. The sample complexity of revenue maximization. CoRR,\n\nabs/1502.00963, 2015.\n\nNikhil R. Devanur, Zhiyi Huang, and Christos-Alexandros Psomas. The sample complexity of\n\nauctions with side information. In Proceedings of STOC, pages 426\u2013439, 2016.\n\nPeerapong Dhangwatnotai, Tim Roughgarden, and Qiqi Yan. Revenue maximization with a single\n\nsample. Games and Economic Behavior, 91:318\u2013333, 2015.\n\nAndrew V. Goldberg, Jason D. Hartline, and Andrew Wright. Competitive auctions and digital\ngoods. In Proceedings of the Twelfth Annual Symposium on Discrete Algorithms, January 7-9,\n2001, Washington, DC, USA., pages 735\u2013744, 2001.\n\nJason D. Hartline and Tim Roughgarden. Simple versus optimal mechanisms. In Proceedings 10th\nACM Conference on Electronic Commerce (EC-2009), Stanford, California, USA, July 6\u201310,\n2009, pages 225\u2013234, 2009.\n\nRobert D. Kleinberg and Frank Thomson Leighton. The value of knowing a demand curve: Bounds\n\non regret for online posted-price auctions. In Proceedings of FOCS, pages 594\u2013605, 2003.\n\nRenato Paes Leme, Martin P\u00b4al, and Sergei Vassilvitskii. A \ufb01eld guide to personalized reserve prices.\nIn Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal,\nCanada, April 11 - 15, 2016, pages 1093\u20131102, 2016. doi: 10.1145/2872427.2883071.\n\nMehryar Mohri and Andres Mu\u02dcnoz Medina. Learning theory and algorithms for revenue optimiza-\n\ntion in second-price auctions with reserve. In Proceedings of ICML, pages 262\u2013270, 2014.\n\nMehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nThe MIT Press, 2012. ISBN 026201825X, 9780262018258.\n\nJamie Morgenstern and Tim Roughgarden. On the pseudo-dimension of nearly optimal auctions. In\n\nProceedings of NIPS, pages 136\u2013144, 2015.\n\nJamie Morgenstern and Tim Roughgarden. Learning simple auctions.\n\npages 1298\u20131318, 2016.\n\nIn Proceedings ofCOLT,\n\nR. Myerson. Optimal auction design. Mathematics of Operations Research, 6(1):58\u201373, 1981.\n\nTim Roughgarden and Joshua R. Wang. Minimizing regret with multiple reserves. In Proceedings of\nthe 2016 ACM Conference on Economics and Computation, EC \u201916, Maastricht, The Netherlands,\nJuly 24-28, 2016, pages 601\u2013616, 2016. doi: 10.1145/2940716.2940792.\n\nMaja R. Rudolph, Joseph G. Ellis, and David M. Blei. Objective variables for probabilistic revenue\nmaximization in second-price auctions with reserve. In Proceedings of WWW 2016, pages 1113\u2013\n1122, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1160, "authors": [{"given_name": "Andres", "family_name": "Munoz", "institution": null}, {"given_name": "Sergei", "family_name": "Vassilvitskii", "institution": "Google"}]}