{"title": "Dynamic Incentive-Aware Learning: Robust Pricing in Contextual Auctions", "book": "Advances in Neural Information Processing Systems", "page_first": 9759, "page_last": 9769, "abstract": "Motivated by pricing in ad exchange markets, we consider the problem of robust learning of reserve prices against strategic buyers in repeated contextual second-price auctions. Buyers' valuations \\new{for} an item depend on the context that describes the item. However, the seller is not aware of the relationship between the context and buyers' valuations, i.e., buyers' preferences. The seller's goal is to design a learning policy to set reserve prices via observing the past sales data, and her objective is to minimize her regret for revenue, where the regret is computed against a clairvoyant policy that knows buyers' heterogeneous preferences. Given the seller's goal, utility-maximizing buyers have the incentive to bid untruthfully in order to manipulate the seller's learning policy. We propose two learning policies that are robust to such strategic behavior. These policies use the outcomes of the auctions, rather than the submitted bids, to estimate the preferences while controlling the long-term effect of the outcome of each auction on the future reserve prices. The first policy called Contextual Robust Pricing (CORP) is designed for the setting where the market noise distribution is known to the seller and achieves a T-period regret of $O(d\\log(Td) \\log (T))$, where $d$ is the dimension of {the} contextual information. The second policy, which is a variant of the first policy, is called Stable CORP (SCORP). This policy is tailored to the setting where the market noise distribution is unknown to the seller and belongs to an ambiguity set. We show that the SCORP policy has a T-period regret of $O(\\sqrt{d\\log(Td)}\\;T^{2/3})$.", "full_text": "Dynamic Incentive-aware Learning: Robust Pricing\n\nin Contextual Auctions\n\nNegin Golrezaei\n\nSloan School of Management\n\nMassachusetts Institute of Technology\n\nCambridge, MA\n\ngolrezae@mit.edu\n\nAdel Javanmard\n\nData Sciences and Operations Department\n\nUniversity of Southern California\n\nLos Angeles, CA\n\najavanma@usc.edu\n\nVahab Mirrokni\nGoogle Research\nNew York, NY\n\nmirrokni@google.com\n\nAbstract\n\nMotivated by pricing in ad exchange markets, we consider the problem of robust\nlearning of reserve prices against strategic buyers in repeated contextual second-\nprice auctions. Buyers\u2019 valuations for an item depend on the context that describes\nthe item. However, the seller is not aware of the relationship between the context\nand buyers\u2019 valuations, i.e., buyers\u2019 preferences. The seller\u2019s goal is to design\na learning policy to set reserve prices via observing the past sales data, and her\nobjective is to minimize her regret for revenue, where the regret is computed\nagainst a clairvoyant policy that knows buyers\u2019 heterogeneous preferences. Given\nthe seller\u2019s goal, utility-maximizing buyers have the incentive to bid untruthfully\nin order to manipulate the seller\u2019s learning policy. We propose two learning\npolicies that are robust to such strategic behavior. These policies use the outcomes\nof the auctions, rather than the submitted bids, to estimate the preferences while\ncontrolling the long-term effect of the outcome of each auction on the future reserve\nprices. The \ufb01rst policy called Contextual Robust Pricing (CORP) is designed for\nthe setting where the market noise distribution is known to the seller and achieves a\nT-period regret of O(d log(T d) log(T )), where d is the dimension of the contextual\ninformation. The second policy, which is a variant of the \ufb01rst policy, is called\nStable CORP (SCORP). This policy is tailored to the setting where the market\nnoise distribution is unknown to the seller and belongs to an ambiguity set. We\n\nshow that the SCORP policy has a T-period regret of O((cid:112)d log(T d) T 2/3).\n\n1\n\nIntroduction\n\nIn many online marketplaces, both sides of the market have access to rich dynamic contextual\ninformation about the products being sold over time. On the buy side, such information can in\ufb02uence\nthe willingness-to-pay of the buyers for the products, potentially in a heterogeneous way. On the sell\nside, the information can help the seller differentiate the products and set contextual and possibly\npersonalized prices. To do so, the seller needs to learn the impact of this information on buyers\u2019\nwillingness-to-pay. Such contextual learning can be challenging for the seller when there are repeated\ninteractions between the buy and the sell sides. With repeated interactions, the utility-maximizing\nbuyers may have the incentive to act strategically and trick the learning policy of the seller into\nlowering the prices. Motivated by this, our key research question is as follows: How can the seller\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdynamically optimize (personalized) prices in a robust manner, taking into account the strategic\nbehavior of the buyers?\nOne of the online marketplaces that faces this problem is online advertising market. In this market,\na prevalent approach to sell online ads is via running real-time second-price auctions in which the\nadvertisers can use an abundance of detailed contextual information before deciding what to bid. In\nthis practice, advertisers can target Internet users based on their (heterogeneous) preferences and\ntargeting criteria. Targeting can create a thin and uncompetitive market in which few advertisers\nshow an interest in each auction. In such a thin market, it is crucial for the ad exchanges to effectively\noptimize the reserve prices in order to boost their revenue. However, learning the optimal reserve\nprices is rather dif\ufb01cult due to frequent interactions between advertisers and ad exchanges.\nInspired by this environment, we study a model in which a seller runs repeated second-price auctions\nwith reserve over time. In each auction, an item is being sold to at most one of the buyers. The\nvaluation (willingness-to-pay) of each buyer for the item in period t, which is his private information,\ndepends on an observable d-dimensional contextual information in that period and his preference\nvector. We focus on an important special case of this contextual-based valuation model in which\nthe buyer\u2019s value is a linear function of his preference vector and contextual information plus some\nrandom noise term, where the noise models the impact of contexts that are not measured/observed by\nthe seller. The preference vector, which is unknown to the seller, varies across the buyers. Thus, the\npreference vectors capture heterogeneity in buyers\u2019 valuation.\nThe seller\u2019s goal is to design a policy that dynamically learns/optimizes personalized reserve prices.\nThe buyers are fully aware of the learning policy used by the seller and act strategically in order\nto maximize their (time-discounted) cumulative utility. Dealing with such a strategic population of\nbuyers, the seller aims at extracting as much revenue as the clairvoyant policy that is cognizant of the\npreference vectors a priori. These vectors determine the relationship between the valuation of the\nbuyers and contextual information. Put differently, the seller would like to minimize her regret where\nthe regret is de\ufb01ned as the difference between the seller\u2019s revenue and that under the clairvoyant\npolicy. Note that the clairvoyant policy provides a strong benchmark because the policy posts the\noptimal personalized reserve prices based on the observed contexts.\nAs stated earlier, one of the main hurdles in designing a low-regret learning policy in this setting is\nthe frequent interactions between the seller and the buyers. Due to such interactions, the strategic\nbuyers might have the incentive to bid untruthfully. This way, they may sacri\ufb01ce their short-term\nutility in order to deceive the seller to post them lower future reserve prices. Thus, while a single shot\nsecond-price auction is a truthful mechanism, repeated second-price auctions in which the seller aims\nat dynamically learning optimal reserve prices of strategic and utility-maximizing buyers may not be\ntruthful. The untruthful bidding behavior of the buyers makes it hard for the seller to learn the optimal\nreserve prices, and this, in turn, can lead to her revenue loss. This highlights the necessity to design a\nrobust learning policy that reduces buyers\u2019 incentive to follow untruthful strategy. Beside this hurdle,\nthe availability of the dynamic contextual information requires the seller to change the reserve prices\ndynamically over time based on the contextual information. To do so, the seller needs to learn how\nbuyers react to such information and based on the reactions, posts (dynamic) personalized reserve\nprices.\nWe consider setting where the seller (\ufb01rm) is more patient than the buyer. We formalize it by\nconsidering time-discounted utility for the buyers. This is motivated by various applications. For\nexample, in online advertisement markets, the advertisers (buyers) who retarget Internet users prefer\nshowing their ads to the users who visited their website sooner rather than later. In this paper, we\npropose two learning policies. The \ufb01rst policy, that we call Contextual Robust Pricing (CORP), is\ntailored to a setting where the distribution of the noise term in buyers\u2019 valuation is known to the\nseller. We will refer to this noise as market or valuation noise. Our CORP policy gets the cumulative\nT-period regret of order O(d log(T d) log(T )), where the regret is computed against the clairvoyant\npolicy that knows the preference vectors as well as the market noise distribution. The second policy,\ncalled Stable CORP (SCORP), is a variant of the \ufb01rst policy. This policy lends itself to a setting\nwhere the distribution of the market noise is unknown and belongs to an ambiguity set. In this setting,\nthe seller does not have the intention of learning the market noise distribution. She instead would like\nto design a learning policy that is robust to the uncertainty in the noise distribution. SCORP achieves\n\nthe T-period regret of order O((cid:112)d log(T d) T 2/3). Here, we highlight two important aspects of these\n\npolicies. First, they have an episodic structure and update the estimate of preference vectors only at\n\n2\n\n\fthe beginning of each episode. Such design make the policy robust by restricting the future effect\nof the submitted bids. Speci\ufb01cally, bids in an episode are not used in choosing the reserve prices\nuntil the beginning of the next episode. Therefore, there is always a delay until a buyer observes\nthe effect of a bid on reserves. Then, considering the fact that buyers are impatient and discount the\nfuture, they are less incentivized to bid untruthfully. The second important aspect of the policies that\nensure robustness is their estimation method of the buyer\u2019s preference vectors. Rather than using the\nsubmitted bids to estimate the preference vectors, the policy simply uses the outcome of the auctions.\nBecause of this feature of the policy, bidding untruthfully does not always result in lower reserve\nprices; Instead, it can impact the future reserve prices of a buyer only when it leads to changing the\noutcome of an auction, i.e., when a buyer loses an auction due to underbidding or a buyer wins an\nauction due to overbidding.\n2 Related Work\nThere is a growing body of research on dynamic pricing with learning. Of necessity, we do not\nprovide a complete set of references, and instead refer the reader to [12] for an in-depth survey on\nthis area. In the following we discuss the literature that is closely related to our setting. Recently,\nseveral works considered the problem of dynamic pricing in a contextual setting, with non-strategic\nbuyers. [10] studied this problem when the demand function follows the logit model and proposed\nan ML-based learning algorithm. [24, 11], and [25] proposed a learning algorithm based on the\nbinary search method when the demand function is linear and deterministic. In their models, buyers\nhave homogenous preference vectors and are non-strategic. Hence, the problem reduces to a single\nbuyer setting, where the buyer acts myopically, i.e., the buyer does not consider the impact of the\ncurrent actions on the future prices. There is also a new line of literature that studied dynamic pricing\nwith demand learning when the contextual information is high dimensional (but sparse); see [19, 6].\nSimilar problems have been investigated in [18] (assuming varying coef\ufb01cient valuation models)\nand [20] (considering a setting where multiple products are offered at each round).\nAs mentioned earlier, in our setting, the seller repeatedly interacts with a small number of strategic and\nheterogeneous buyers. We note that [13] presented empirical evidence that showed buyers in online\nadvertising markets act strategically. The work [1, 29, 22] examined the problem of dynamic pricing\nwith strategic buyers in a non-contextual environment. In [1, 29], the seller repeatedly interacts with a\nsingle strategic buyer via a posted-price mechanism. Similar to our setting, the seller is more patient\nthan the buyer in a sense that the buyer discounts his future utility. [1] showed that no learning\nalgorithm can obtain a sub-linear regret when the buyer is as patient as the seller. In addition, via\ndesigning learning policies, they demonstrated that the seller can get a sub-linear regret bound when\nthe buyer is less patient. [22] studied dynamic pricing when a group of strategic buyers competes with\neach other in repeated non-contextual second-price auctions. Further, it is assumed that products to\nbe sold are ex-ante identical, and that buyers are homogenous and their valuations are all drawn from\na single distribution, which is unknown to the seller. With respect to the homogeneity assumption,\nwe point out that there exists empirical evidence that buyers are indeed heterogeneous [16, 21, 15].\nIt is not surprising that the heterogeneity in the markets makes the design of selling mechanisms\nmore dif\ufb01cult. In addition, such dif\ufb01culties get more severe when the seller needs to design dynamic\nselling mechanisms for a group of strategic buyers that compete with each other repeatedly.\nRecently, [26] studied a similar problem in a static non-contextual setting with strategic buyers.\nAssuming that the market power of each buyer is negligible, they design a mechanism that incentivizes\nthe buyers to be truthful in the \ufb01rst place, by using techniques from differential privacy [28]. Closer\n\u221a\nto the spirit of this paper, [2] studies the problem of pricing inventory in a repeated posted-price\nauction. The authors propose a pricing algorithm whose regret is in the order of O(\nlog T T 2/3) in a\ncontextual setting, against a strategic buyer. 1 We point out that our regret result improves upon [2] in\nthe following directions: (i) We allow for market noise in our model, whereas [2] considers noiseless\nsetting which posits that buyer\u2019s valuation is given as a linear function of features. By adding the\nnoise component, we make the model richer. When the noise distribution is known, our CORP\npolicy obtains a T-period regret of O(d log(T d) log(T )). In addition, when the noise distribution is\nunknown, our SCORP policy, which is doubly robust against strategic buyers and the uncertainty in\n\nthe noise distribution, obtains a T-period regret of O((cid:112)d log(T d) T 2/3). (ii) We consider a market\n\nof strategic buyers who participate in a second-price auction at each round, while [2], motivated by\ntargeting in online advertising, considers a single buyer case. Note that in case of a single buyer, there\n\n1Dependency on d is hidden in the big-O notation.\n\n3\n\n\fis no notion of bid, as the buyer only needs to decide if he is willing to get the item at the posted price.\nBy contrast, in a market of buyers, each submitted bid of a buyer can potentially affect the utility of\nthat buyer (instant and long-term utility), other buyers\u2019 utilities and the seller\u2019s revenue.2\n3 Model\nBefore we describe the model, we adopt some notation that will be used throughout the paper. For an\ninteger a, we write [a] = {1, 2, . . . , a}. In addition, for a vector v \u2208 Rd, we denote its jth coordinates\nj=1 ujvj\n\nby vj, for j \u2208 [d], and indicate its (cid:96)2 norm by (cid:107)v(cid:107). For two vectors v, u \u2208 Rd, (cid:104)u, v(cid:105) =(cid:80)d\n\nrepresents their inner product.\nWe consider a \ufb01rm who runs repeated second-price auctions with personalized reserve over a \ufb01nite\ntime horizon with length T . In each period t \u2265 1, the \ufb01rm would like to sell an item to one of N\nbuyers. The item in period t is represented by an observable feature (context) vector denoted by\nxt \u2208 Rd. We assume that the features are drawn independently from a \ufb01xed distribution D, with a\nbounded support X \u2286 Rd. Note that the length of the time horizon T and distribution D are unknown\nto the \ufb01rm. For the sake of normalization and without loss of generality, we assume that (cid:107)xt(cid:107) \u2264 1,\nand hence take X = {x \u2208 Rd : (cid:107)x(cid:107) \u2264 1}. We let \u03a3x = E[xtxT\nt ] be the second moment matrix of\ndistribution D, and assume that \u03a3x is a positive de\ufb01nite matrix, where \u03a3x is unknown to the \ufb01rm.\nFor buyers\u2019 valuations, we consider a feature-based model that captures heterogeneity among the\nbuyers. In the following, we discuss the speci\ufb01cs of the valuation model. Valuation of buyer i \u2208 [N ]\nfor an item in period t \u2265 1 depends on the feature vector xt and period t and is denoted by vit(xt).\nWe assume that vit(xt) is a linear function of a preference vector \u03b2i and the feature vector xt:\n\n(1)\nWhenever it is clear from the context, we may remove the dependency of valuation vit(xt) on the\nfeature vector xt and denote it by vit. Here, \u03b2i \u2208 Rd represents the buyer i\u2019s preference vector,\nand for the sake of normalization, we assume that (cid:107)\u03b2i(cid:107) \u2264 Bp, i \u2208 [N ], where Bp is a constant.\nThe terms zit\u2019s, i \u2208 [N ], t \u2265 1, which are independent of the feature vector xt, are idiosyncratic\nshocks and are referred to as noise. The noise terms are drawn independently and identically from a\nmean zero distribution F : [\u2212Bn, Bn] \u2192 [0, 1] with density f : [\u2212Bn, Bn] \u2192 R+, where Bn is a\nconstant.3 We assume that the \ufb01rm knows the distribution of the noise F . We relax this assumption\nlater in Section 5. Note that the valuation of buyer i, vit, is not known to the \ufb01rm, as the preference\nvector \u03b2i and realization of the noise zit are not observable to her. In addition, by our normalization,\nvit(xt) \u2264 B, with B = Bp + Bn.\nWe make the following assumption on distribution of the noise F .\nAssumption 3.1 (Log-concavity). F (z) and 1 \u2212 F (z) are log-concave in z \u2208 [\u2212Bn, Bn].\nAssumption 3.1, which is prevalent in the economics literature [5], holds by several common proba-\nbility distributions including uniform, and (truncated) Laplace, exponential, and logistic distributions.\nA few remarks are in order regarding Assumption 3.1. If distribution F is log-concave and its density\nf is symmetric, i.e., f (z) = f (\u2212z), then 1 \u2212 F (z) = F (\u2212z) is also log-concave. Moreover, if\ndensity f is log-concave, the distribution F is also log-concave [8]. This implies that Assumption 3.1\nis satis\ufb01ed when density f is symmetric and log-concave. We also point out that if a distribution has\na monotone hazard rate (MHR), i.e., 1\u2212F (z)\nis decreasing in z, then 1 \u2212 F (z) is log-concave. This\npoint, in turn, shows that all MHR and symmetric distributions satisfy Assumption 3.1.\nWe next describe the repeated second-price auctions and discuss the \ufb01rm\u2019s problem. The goal of the\n\ufb01rm is to maximize the cumulative expected revenue in repeated second-price auctions. The \ufb01rm\ntries to achieve this by choosing reserves in a dynamic and personalized manner.\n\nf (z)\n\nvit(xt) = (cid:104)xt, \u03b2i(cid:105) + zit\n\ni \u2208 [N ], t \u2265 1 .\n\n3.1 Second-price Auctions with Dynamic Personalized Reserves\nBefore de\ufb01ning a second-price auction, we need to establish some notation. For buyer i \u2208 [N ] and\nperiod t \u2265 1, we let pit be the payment from buyer i in period t. Further, let qit be the allocation\n2Section 5 in [2] considers an extension to the multiple buyers case but assumes that the highest valuation in\neach period t can be written as (cid:104)xt, \u03b2(cid:105) for a \ufb01xed parameter vector \u03b2, and product feature (context) xt, which\nwe \ufb01nd to be a strong assumption.\n\n3The noise aims at capturing features that are not observed/measured by the \ufb01rm.\n\n4\n\n\fvariable: qit = 1 if the item in period t is allocated to buyer i and is zero otherwise. We also let bit\nbe the bid submitted by buyer i and rit be the reserve price posted by the \ufb01rm for buyer i in period t.\nWe de\ufb01ne bt = (b1t, . . . , bN t) and r = (r1t, . . . , rN t) as the vectors of bids and reserves in period t,\nrespectively. Moreover, we denote by H\u03c4 the history set observed by the \ufb01rm up to period \u03c4. This set\nincludes buyers\u2019 bids and reserve prices for all t < \u03c4:\n\nH\u03c4 = {(r1, b1), . . . , (r\u03c4\u22121, b\u03c4\u22121)} .\n\n(2)\n\nvit, de\ufb01ned in Eq. (1).\n\nBelow, we explain the details of the second-price auction with reserve. In period t \u2265 1,\n\u2022 The \ufb01rm observes the feature vector xt \u223c D. In addition, each buyer i \u2208 [N ] learns his valuation\n\u2022 For each buyer, the \ufb01rm computes reserve price rit, as a function of history set Ht.\n\u2022 Each buyer i \u2208 [N ] submits a bid of bit.\n\u2022 Let i(cid:63) = arg maxi\u2208[N ]{bit}. If bi(cid:63)t \u2265 ri(cid:63)t, then the item is allocated to buyer i(cid:63), and we have\nqi(cid:63)t = 1. In case of tie, the item is allocated uniformly at random to one of the buyers among those\nwith the highest bid. For all buyers who do not get the item, we have qit = 0.\n\u2022 For each buyer i, if he gets the item (qit = 1), then he pays pit = max{rit, maxj(cid:54)=i{bjt}}.\n\nOtherwise, pit = 0.\n\nt , b\u2212\n\nt , ri(cid:63)t = r+\n\nt respectively denote the highest and second highest bids. Likewise, we de\ufb01ne v+\n\nt } if the item gets allocated and zero otherwise. We assume that for all periods t, b+\n\nTo lighten the notation, we henceforth use the following shorthands. For each period t, we let b+\nt\nand b\u2212\nt and v\u2212\nt\nas the highest and second highest valuations in period t. We also let r+\nt be the reserve price of the\nbuyer with the highest bid. Therefore, bi(cid:63)t = b+\nt , and the \ufb01rm receives a payment of\nmax{r+\nt \u2264 M\nfor some constant M > 0. In words, buyers submit bounded bids.\nThe \ufb01rm\u2019s decision in any period t \u2265 1 is to \ufb01nd optimal reserve price rit, i \u2208 [N ], and her objective\nis to maximize her (cumulative) expected revenue. Note that revenue of the \ufb01rm is the total payment\nshe collects from the buyers over the length of the time horizon. Let\nt }I(b+\n\nt )(cid:3)\npolicy used by the \ufb01rm. Then, the total revenue of the \ufb01rm is given by(cid:80)T\n\nbe the expected revenue of the \ufb01rm in period t \u2265 1, where the expectation is w.r.t. to the noise distri-\nbution F , feature distribution D, and any randomness in the bidding strategy of buyers and learning\nt=1 revt. Maximizing the\n\ufb01rm\u2019s revenue is equivalent to minimizing her regret where the regret is de\ufb01ned as the difference\nbetween the \ufb01rms\u2019 revenue and the maximum expected revenue that the \ufb01rm could earn if she knew\nthe preference vectors {\u03b2i}i\u2208[N ]. In the next section, we will formally de\ufb01ne the \ufb01rm\u2019s regret.\n\nrevt = E(cid:104)(cid:88)\n\n= E(cid:2)max{b\u2212\n\nt \u2265 r+\n\nt , r+\n\ni\u2208[N ]\n\n(3)\n\n(cid:105)\n\npitqit\n\n3.2 Benchmark and Firm\u2019s Regret\n\nWhen the preference vectors and noise distribution F are known, to set the optimal reserves rit, the\nbenchmark policy does not need any knowledge from the history set Ht. Thus, with the knowledge of\nthe preference vectors, all buyers are incentivized to bid truthfully against the benchmark policy. This\nis the case because single-shot second-price auctions are strategy proof [30]. We next characterize\nthe benchmark policy. Let r(cid:63)\nit be the reserve of buyer i in period t posted by the benchmark policy\nand following our convention, we denote by r(cid:63)+\nthe reserve price of the buyer with the highest bid.\nProposition 3.2 (Benchmark). If the \ufb01rm knows the preference vectors {\u03b2i}i\u2208[N ], then the optimal\nreserve price of buyer i \u2208 [N ] for a feature vector x \u2208 X is given by\n\n(cid:8)y(cid:0)1 \u2212 F (y \u2212 (cid:104)x, \u03b2i(cid:105))(cid:1)(cid:9) i \u2208 [N ], x \u2208 X ,\n\n(4)\nr(cid:63)\ni (x) = arg max\ni (xt). In addition, in any period t \u2265 1, the benchmark expected revenue is given\n\ny\n\nit = r(cid:63)\n\nt\n\nand hence r(cid:63)\nby\n\nt = E(cid:2) max{v\u2212\n\nrev(cid:63)\n\nt }I(v+\n\nt \u2265 r(cid:63)+\n\nt , r(cid:63)+\n\nt\n\n)(cid:3) ,\n\n(5)\n\nwhere expectation is w.r.t. to the noise distribution F and the feature distribution D.\n\n5\n\n\fWe refer to Appendix E for the proof of Proposition 3.2. We remark that the benchmark revenue rev(cid:63)\nt\nis measured against truthful buyers, while the \ufb01rm\u2019s revenue under our policy is measured against\nstrategic buyers who may not necessarily follow the truthful strategy. Observe that the optimal reserve\nprice of buyer i in period t, denoted by r(cid:63)\n\nit, solves the following optimization problem\n\n(cid:8)y \u00b7 P ((cid:104)xt, \u03b2i(cid:105) + zit \u2265 y)(cid:9) .\n\nr(cid:63)\nit = arg max\n\ny\n\n{y \u00b7 P (vit(xt) \u2265 y)} = arg max\n\ny\n\nThis shows that the optimal reserve price of buyer i does not depend on the number of buyers\nparticipating in the auction or their preference vectors. In other words, in (lazy) second-price auctions,\nwhen the preference vectors are known to the \ufb01rm, the problem of optimizing reserve prices can be\ndecoupled. Because of this, the benchmark, de\ufb01ned in Proposition 3.2, has a simple structure: For\nany feature vector x \u2208 X , the optimal reserve price of buyer i, r(cid:63)\ni (x), only depends on \u03b2i and feature\nx, and is independent of \u03b2j, j (cid:54)= i.\nHaving de\ufb01ned the benchmark, we are now ready to formally de\ufb01ne the regret of a \ufb01rm\u2019s policy \u03c0.\nConsider a policy \u03c0 that posts a vector of reserve prices r\u03c0\nN t), as a function of history\nset Ht observed by the \ufb01rm. Suppose that the buyers submit bids of bt = (b1t, . . . , bN t), t \u2265 1,\nwhere bt may not be equal to the vector of valuations vt = (v1t, . . . , vN t). The submitted bid of\nbuyer i, bit, can depend on the learning policy used by the \ufb01rm, context xt, his valuation vit, and\nhistory Hit, where\n\n1t, . . . , r\u03c0\n\nt = (r\u03c0\n\nHit = {(vi1, bi1, qi1, pi1), . . . , (vi(t\u22121), bi(t\u22121), qi(t\u22121), pi(t\u22121))}.\n\nRecalling our notation, we write r\u03c0+\nto denote the reserve price, set by policy \u03c0, of the buyer with\nthe highest bid in period t. Then, the expected revenue of the \ufb01rm under policy \u03c0 in period t reads as\n(6)\nwhere expectation is w.r.t. to the noise distribution F , feature distribution D, and any randomness in\nbidding strategy of the buyers. Then, the worst-case cumulative regret of policy \u03c0 is de\ufb01ned by\n\nt = E(cid:2) max{b\u2212\n\nt \u2265 r\u03c0+\n\nt }I(b+\n\n)(cid:3) ,\n\nt , r\u03c0+\n\nrev\u03c0\n\nt\n\nt\n\nt ) : (cid:107)\u03b2i(cid:107) \u2264 Bp, for i \u2208 [N ], supp(D) \u2286 X(cid:111)\n\nt \u2212 rev\u03c0\n\n(rev(cid:63)\n\nReg\u03c0(T ) = max\n\n(7)\nNote that the regret of the policy \u03c0 is not a function of the feature distribution D and the feature\nvectors {\u03b2i}i\u2208[N ]. That is, we compute the regret of the policy \u03c0 against the worst feature distribution\nD and preference vectors {\u03b2i}i\u2208[N ]. In the next section, we discuss buyers\u2019 bidding behavior.\n\nt=1\n\n.\n\n(cid:110) T(cid:88)\n\nUi = (cid:80)\u221e\n\n3.3 Utility-maximizing Buyers\nWe assume that each buyer i \u2208 [N ] is risk neutral and aims at maximizing his (time-discounted)\ncumulative expected utility. The utility of buyer i in period t \u2265 1 with valuation vit is given by\nuit = vitqit\u2212pit. Note that through the allocation variables qit, utility uit, depends on the submitted\nbids of all the buyers, bt, and their reserve price rt used by the \ufb01rm.\nEach buyer i would like to maximize his time-discounted cumulative utility, which is de\ufb01ned as\nt=1 \u03b3tE[uit], where \u03b3 \u2208 (0, 1) is a discount factor. The discount factor highlights the fact\nthat the \ufb01rm is more patient than the buyers. For instance, in online advertising markets, advertisers\nare willing to show their ads to the users who just visited their websites.4 As another example,\nin cloud computing markets, the consumers would like to access enough capacity whenever they\nneed it [7]. We note that [1] showed that it is impossible to get a sub-linear regret when buyers are\nutility-maximizer and do not discount their future utilities.\nAll buyers fully know the learning policy that the \ufb01rm is using to set the reserves.5 Armed with this\nknowledge, buyers can potentially increase their future utility they earn via bidding untruthfully.\nParticularly, a buyer can underbid (shade) his bid by submitting bid bit < vit, or he can overbid by\nsubmitting bid bit > vit. Both shading and overbidding can potentially impact the \ufb01rms\u2019 estimate\nof preference vectors of the buyers and this, in turn, can hurt the \ufb01rms\u2019 revenue. However, shading\ncan lead to a utility loss in the current period, as by shading, the buyer may lose an auction that he\nwould have won by bidding truthfully. Similarly, overbidding can result in a utility loss in the current\nperiod, as by overbidding the buyer might end up paying more than his valuation.\n\n4Such a practice is known as retargeting [2, 15].\n5This assumption is inspired by the literature on the behavior-based pricing where it is shown that the \ufb01rm\n\ncan earn more revenue by committing to a pricing strategy [17, 31]. See also [3, 4] for a similar insight.\n\n6\n\n\fFigure 1: Schematic representation of the CORP policy. The dark blue rectangles show the random exploration periods.\n\n4 CORP: A Contextual Robust Pricing Policy\nIn this section, we present our learning policy. The description of the policy is provided in Table\n1. For reader\u2019s convenience, we also provide a schematic representation of CORP in Figure 1. The\npolicy works in an episodic manner. It tries to learn the preference vectors by using Maximum\nLikelihood Estimation (MLE) and meanwhile sets the reserve prices based on its current estimates\nof the preference vectors. Episodes are indexed by k = 1, 2, . . ., where the length of each episode,\ndenoted by (cid:96)k, is given by 2k\u22121. Thus, episode k starts in period (cid:96)k = 2k\u22121 and ends in period\n(cid:96)k+1 \u2212 1 = 2k \u2212 1. Note that the length of episodes increases exponentially with k. Throughout, we\nuse notation Ek to refer to periods in episode k, i.e., Ek \u2261 {(cid:96)k, . . . , (cid:96)k+1 \u2212 1}.\nAt the beginning of each episode k, we estimate the preference vectors of the buyers using the\noutcome of the auctions (qit\u2019s) in the pervious episode, i.e., episode k \u2212 1, and we do not change our\n\nestimates during episode k. Let(cid:98)\u03b2ik be the estimated preference vector of buyer i at the beginning of\nepisode k. Then,(cid:98)\u03b2ik solves the following optimization problem:\n\n(cid:98)\u03b2ik = arg min\n(cid:88)\n\n(cid:107)\u03b2(cid:107)\u2264Bp\n\nwhere\n\nLik(\u03b2) = \u2212 1\n(cid:96)k\u22121\n\nt\u2208Ek\u22121\n\nLik(\u03b2), i \u2208 [N ] ,\n\n(cid:8)qit log(cid:0)(1 \u2212 F (max{b+\u2212it, rit} \u2212 (cid:104)xt, \u03b2(cid:105)))(cid:1)\n+ (1 \u2212 qit) log(cid:0)F (max{b+\u2212it, rit} \u2212 (cid:104)xt, \u03b2(cid:105))(cid:1)(cid:9)\n\n(8)\n\n(9)\n\nis the negative of the log-likelihood function. Here, b+\u2212it refers to the maximum bids of buyers\nother than buyer i, in period t; that is, b+\u2212it = maxj(cid:54)=i bjt. Then, buyer i wins the auction in\nperiod t if and only if bit > max{b+\u2212it, rit}. Similarly, we de\ufb01ne v+\u2212it = maxj(cid:54)=i vjt.\nNote\nthat F (max{b+\u2212it, rit} \u2212 (cid:104)xt, \u03b2(cid:105)) is the probability of event (cid:104)xt, \u03b2(cid:105) + zit \u2264 max{b+\u2212it, rit}, which\nis the probability that buyer i does not win the item at time t, upon bidding truthfully. The log-\nlikelihood function Lik(\u03b2) is computed after running the auctions in all the periods of episode Ek\u22121.\nTherefore, the \ufb01rm has access to the required knowledge to compute the log-likelihood function\nLik(\u03b2). Speci\ufb01cally, by the time the \ufb01rm computes Lik(\u03b2), she has access to the submitted bids of\nthe buyers in periods t \u2208 Ek\u22121 as well as the reserve prices used in these periods. After estimating\nthe preference vectors at the beginning of each episode k, the policy proceeds to use its estimation to\nset reserve prices. In particular, inspired by Proposition 3.2, the reserve price in period t \u2208 Ek, rit,\nsolves (11).\nWe now discuss some of the important features of our policy.\n(i) In each episode k, every period t is assigned to exploitation with probability 1 \u2212 1/(cid:96)k, and is\nassigned to exploration with probability 1/(cid:96)k. In the exploration periods, the \ufb01rm chooses one\nof the buyers at random and allocates the item to him if his submitted bid is above a reserve\nprice r \u223c uniform(0, B) where uniform(0, B) is the uniform distribution in the range [0, B]. In\nexploitation periods, the \ufb01rm exploits her current estimate of the preference vectors to set the reserve\nprices where the estimates are obtained by applying the MLE method to the outcomes of auctions\nin episode k \u2212 1. The main purpose of setting reserve prices randomly in the exploration periods is\nto motivate the buyers to be truthful. Note that the buyer does not know if in a given period t, the\nprices are set randomly. Thus, if he underbids in such a period, with a positive probability, he loses\nthe opportunity to obtain a positive utility.\n\n7\n\n\u2026\u2026Estimate\"#\u2019sEstimate\"#\u2019sEstimate\"#\u2019sEpisode k(\u2113%=2%())Episode k-1(\u2113%()=2%(*)Episode k-2(\u2113%(*=2%(+)Outcome of Auctions \fCORP: A Contextual Robust Pricing\nInitialization: For any k \u2208 Z+, let (cid:96)k = 2k\u22121 and Ek = {(cid:96)k, . . . , (cid:96)k+1 \u2212 1}. Moreover, we let ri1 = 0 and\n\n(cid:98)\u03b2i1 = 0 for any i \u2208 [N ].\nestimate the preference vectors, denoted by {(cid:98)\u03b2ik}i\u2208[N ], as follows\n\nUpdating Preference Vectors: At the start of each episode k = 1, 2, . . ., i.e, at the beginning of period t = (cid:96)k,\n\nLik(\u03b2), i \u2208 [N ] ,\n\n(10)\n\n(cid:98)\u03b2ik = arg min\n\n(cid:107)\u03b2(cid:107)\u2264Bp\n\nwhere Lik(\u03b2) is de\ufb01ned in Eq. (9).\nSetting Reserves: In each episode k = 1, 2, . . ., and for any period t in this episode, i.e., t \u2208 Ek,\n\n- Exploration Phase: With probability 1\n(cid:96)k\n\n, choose one of the N buyers uniformly at random and offer him\nthe item at price of r \u223c uniform(0, B), where uniform(0, B) is the uniform distribution in the range\n[0, B]. For other buyers, set their reserve prices to \u221e.\n\n- Exploitation Phase: With probability 1 \u2212 1\n\n, observe the feature vector xt and set the reserve of each\n\nbuyer i \u2208 [N ] to\n\n(cid:96)k\n\n(cid:8)y(cid:0)1 \u2212 F (y \u2212 (cid:104)xt,(cid:98)\u03b2ik(cid:105))(cid:1)(cid:9) .\n\nrit = arg max\n\ny\n\n(11)\n\nTable 1: CORP Policy\n\n(ii) We highlight that CORP policy does not use the submitted bids in estimating the preference\nvectors: It only uses the outcomes of the auctions, i.e., qit\u2019s, to estimate these vectors; see the\nde\ufb01nition of the log-likelihood function in Equation (9). This makes the estimation procedure of the\npolicy robust to untruthful bidding behavior of the buyers, as untruthful bidding may not necessarily\nlead to a different outcome. In addition, due to this feature of the learning policy, the buyers are\nincentivized to bid truthfully unless they are interested in changing the outcome of the auction at the\nexpense of losing their current utility.\n(iii) Other important factors that makes the CORP policy robust is its episodic structure and impa-\ntience of buyers. In the CORP policy, submitted bids in episode k are not used in setting reserve\nprices until the beginning of episode (k + 1). Therefore, there is always a delay until buyers observe\nthe effect of a bid on their reserves. Then, since buyers are impatient and maximize their discounted\ncumulative utility, they have less incentive to bid untruthfully. This is a salient property of the\nCORP policy that bounds the perpetual effect of each bid and, as we will see in the analysis, leads to\nrobustness of the learning policy to the strategic behavior of buyers.\nTheorem 4.1 (Regret Bound: Known Market Noise Distribution). Suppose that Assumption 3.1\nholds and the \ufb01rm knows the market noise distribution F . Then, the T-period worst-case regret of the\nCORP policy is at most O(d log(T d) log(T )), where the regret is computed against the benchmark,\nde\ufb01ned in Proposition 3.2.\n\nIn Appendix A we give a proof sketch of Theorem 4.1 and refer to Appendix C for a detailed proof.\n\n5 SCORP: Stable CORP Policy\nThe CORP policy is assumed to know the market noise distribution F . Nevertheless, in practice, it\nmay very well be the case that distribution F is unknown or cannot be well approximated (e.g., it\nchanges over time). To address this problem, we propose a variant of the CORP policy, called Stable\nContextual Robust Pricing (SCORP), which is robust against the lack of a precise knowledge of F .\nSpeci\ufb01cally, we consider an ambiguity set F of possible probability distributions for the market noise\nand propose a policy that works well for every distribution in the ambiguity set.\nDue to space constraint, we brie\ufb02y explain SCORP here and refer to Appendix B for more details\nand a formal description of the policy. Similar to the COPR policy, SCORP has an episodic theme,\nwith the length of episodes growing exponentially. As before, we denote the set of periods in episode\nk by Ek, i.e., Ek = {(cid:96)k, . . . , (cid:96)k+1 \u2212 1}, with (cid:96)k = 2k\u22121. However, instead of having randomized\n\n8\n\n\fexploration, each episode k starts with a pure exploration phase of length (cid:100)(cid:96)2/3\nk (cid:101). We use notation\nIk to refer to periods in the pure exploration phase of episode k, i.e., Ik \u2261 {(cid:96)k, . . . , (cid:96)k + (cid:100)(cid:96)2/3\nk (cid:101)}.\nDuring each period in Ik, we choose one of the N buyers uniformly at random and offer him the item\nat price of r \u223c uniform(0, B). For other buyers, we set their reserve prices to \u221e. In the remaining\nperiods of the episode (i.e., Ek\\Ik), we offer the reserve prices based on the current estimates of\nthe preference vectors which are obtained by applying the least-square estimator to the outcomes of\nauctions in the pure exploration phase, Ik. This is the exploitation phase as we set reserves based\non our best guess of the preference vectors. In the least-square estimator, SCORP uses the outcome\nof the auctions, not the submitted bids, which makes SCORP robust to the strategic buyers. In\naddition, in the exploitation phase, SCORP chooses reserve prices in a way to maximize the worst\ncase revenue, over the ambiguity set F, based on the current estimate of the preference vectors. In\nthis sense, SCORP is robust also against the uncertainty in the noise distribution. Thus, SCORP is\nindeed doubly robust. In Theorem B.3 in Appendix B, we show that SCORP achieves the T-period\n\nworst-case regret of O((cid:112)d log(T d) T 2/3).\n\n6 Extension to nonlinear models\n\nAlthough the paper focuses on linear valuation models, it is straightforward to generalize our analysis\nto some of the nonlinear valuation models. Speci\ufb01cally, consider model\n\nvit(xt) = \u03c8((cid:104)\u03c6(xt), \u03b2i(cid:105) + zit)\n\ni \u2208 [N ],\n\nt \u2265 1 ,\n\nwhere \u03c6 : Rd (cid:55)\u2192 Rd is a mapping and \u03c8 : R (cid:55)\u2192 R is an increasing function. Then by the change of\nvariable \u02dcvit = \u03c8\u22121(vit), \u02dcxt = \u03c6(xt) , we arrive at the relation \u02dcvit = (cid:104)\u02dcxt, \u03b2i(cid:105) + zit. By modifying\nthe CORP policy for this relation, we can get a policy that also achieve logarithmic regret for these\nnonlinear settings. Some examples include: log-log model, semi-log model and logistic model. We\nrefer to the long version of the paper [14] for further discussion on this matter.\nWhile these models have been popular for some applications (the \ufb01rst two in hedonic pricing and\nthe last in click-through-rate prediction), it is still an interesting direction to consider nonparametric\nmodels (similar to [27] and [9] as examples) but it is beyond the scope of current paper.\n7 Conclusion\n\nMotivated by online marketplaces with highly differentiated products, we formulated a dynamic\npricing problem in the contextual setting. In this problem, a \ufb01rm runs repeated second-price auctions\nwith reserve and the item to be sold in each period is described by a context (feature) vector. In our\nmodel, contextual information of an item in\ufb02uences buyers\u2019 valuations of that item in a heterogeneous\nway, via buyers\u2019 preference vectors. Due to the repeated interaction of buyers with the \ufb01rm, buyers\nhave the incentive to game the \ufb01rm\u2019s policy by bidding untruthfully. We proposed two pricing policies\nto set the reserve prices of buyers. These policies aim at learning the preference vectors of buyers in\na robust way against strategic buyers and meanwhile maximize the \ufb01rm\u2019s collected revenue.\nThe main insight behind the robustness property of our approach is that by an episodic design, we\nlimit the long-term effect of each bid on the \ufb01rm\u2019s estimates of the preference vectors. Further, instead\nof using the bids (data) we use only the outcomes of auctions (censored data) in estimating preference\nvectors. Interestingly, we show that using this censored data does not hamper the learning rate while\nbringing in robustness property. As the granularity of real-time data increases at an unprecedented\nrate, we believe the ideas of this work can serve as a starting point for other complex dynamic\ncontextual learning and decision making problems.\nAcknowledgement\n\nA. Javanmard was supported in part by an Outlier Research in Business (iORB) grant from the USC\nMarshall School of Business, a Google Faculty Research Award and the NSF CAREER Award\nDMS-1844481. A. Javanmard would also like to acknowledge the \ufb01nancial support of the Of\ufb01ce of\nthe Provost at the University of Southern California through the Zumberge Fund Individual Grant\nProgram.\n\n9\n\n\fReferences\n[1] K. Amin, A. Rostamizadeh, and U. Syed. Learning prices for repeated auctions with strategic buyers. In\n\nAdvances in Neural Information Processing Systems, pages 1169\u20131177, 2013.\n\n[2] K. Amin, A. Rostamizadeh, and U. Syed. Repeated contextual auctions with strategic buyers. In Advances\n\nin Neural Information Processing Systems, pages 622\u2013630, 2014.\n\n[3] Y. Aviv and A. Pazgal. Optimal pricing of seasonal products in the presence of forward-looking consumers.\n\nManufacturing & Service Operations Management, 10(3):339\u2013359, 2008.\n\n[4] Y. Aviv, M. M. Wei, and F. Zhang. Responsive pricing of fashion products: The effects of demand learning\n\nand strategic consumer behavior. Technical report, Working Paper, Washington University, 2015.\n\n[5] M. Bagnoli and T. Bergstrom. Log-concave probability and its applications. Economic theory, 26(2):445\u2013\n\n469, 2005.\n\n[6] G.-Y. Ban and N. B. Keskin. Personalized dynamic pricing with machine learning. 2017.\n\n[7] C. Borgs, O. Candogan, J. Chayes, I. Lobel, and H. Nazerzadeh. Optimal multiperiod pricing with service\n\nguarantees. Management Science, 60(7):1792\u20131811, 2014.\n\n[8] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[9] N. Chen and G. Gallego. Nonparametric learning and optimization with covariates. arXiv preprint\n\narXiv:1805.01136, 2018.\n\n[10] X. Chen, Z. Owen, C. Pixton, and D. Simchi-Levi. A statistical learning approach to personalization in\n\nrevenue management. 2015.\n\n[11] M. Cohen, I. Lobel, and R. Paes Leme. Feature-based dynamic pricing. 2016.\n\n[12] A. V. den Boer. Dynamic pricing and learning: historical origins, current research, and new directions.\n\nSurveys in operations research and management science, 20(1):1\u201318, 2015.\n\n[13] B. Edelman and M. Ostrovsky. Strategic bidder behavior in sponsored search auctions. Decision support\n\nsystems, 43(1):192\u2013198, 2007.\n\n[14] N. Golrezaei, A. Javanmard, and V. Mirrokni. Dynamic incentive-aware learning: Robust pricing in\n\ncontextual auctions. Available at SSRN 3144034, 2018.\n\n[15] N. Golrezaei, M. Lin, V. Mirrokni, and H. Nazerzadeh. Boosted second-price auctions for heterogeneous\n\nbidders. 2017.\n\n[16] B. Guimaraes and K. D. Sheedy. Sales and monetary policy. American Economic Review, 101(2):844\u201376,\n\n2011.\n\n[17] O. D. Hart and J. Tirole. Contract renegotiation and coasian dynamics. The Review of Economic Studies,\n\n55(4):509\u2013540, 1988.\n\n[18] A. Javanmard. Perishability of data: Dynamic pricing under varying-coef\ufb01cient models. Journal of\n\nMachine Learning Research, 18(53):1\u201331, 2017.\n\n[19] A. Javanmard and H. Nazerzadeh. Dynamic pricing in high-dimensions. The Journal of Machine Learning\n\nResearch, 20(1):315\u2013363, 2019.\n\n[20] A. Javanmard, H. Nazerzadeh, and S. Shao. Multi-product dynamic pricing in high-dimensions with\n\nheterogenous price sensitivity. arXiv preprint arXiv:1901.01030, 2019.\n\n[21] J. P. Johnson and D. P. Myatt. Multiproduct quality competition: Fighting brands and product line pruning.\n\nAmerican Economic Review, 93(3):748\u2013774, 2003.\n\n[22] Y. Kanoria and H. Nazerzadeh. Dynamic reserve prices for repeated auctions: Learning from bids. 2017.\n\n[23] C. Koufogiannakis and N. E. Young. A nearly linear-time ptas for explicit fractional packing and covering\n\nlinear programs. Algorithmica, 70(4):648\u2013674, 2014.\n\n[24] R. P. Leme and J. Schneider. Contextual search via intrinsic volumes.\n\nIn 2018 IEEE 59th Annual\n\nSymposium on Foundations of Computer Science (FOCS), pages 268\u2013282. IEEE, 2018.\n\n10\n\n\f[25] I. Lobel, R. P. Leme, and A. Vladu. Multidimensional binary search for contextual decision-making. arXiv\n\npreprint arXiv:1611.00829, 2016.\n\n[26] M. Mahdian, V. Mirrokni, and S. Zuo. Incentive-aware learning for large markets. In Proceedings of the\n26th International Conference on World Wide Web. International World Wide Web Conferences Steering\nCommittee, 2017.\n\n[27] J. Mao, R. Leme, and J. Schneider. Contextual pricing for lipschitz buyers. In Advances in Neural\n\nInformation Processing Systems, pages 5643\u20135651, 2018.\n\n[28] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on\nFoundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings,\npages 94\u2013103, 2007.\n\n[29] A. M. Medina and M. Mohri. Learning theory and algorithms for revenue optimization in second price\nauctions with reserve. In Proceedings of the 31st International Conference on Machine Learning (ICML-\n14), pages 262\u2013270, 2014.\n\n[30] R. B. Myerson. Optimal auction design. Mathematics of operations research, 6(1):58\u201373, 1981.\n\n[31] S. W. Salant. When is inducing self-selection suboptimal for a monopolist? The Quarterly Journal of\n\nEconomics, 104(2):391\u2013397, 1989.\n\n[32] J. Tropp. Freedman\u2019s inequality for matrix martingales. Electronic Communications in Probability,\n\n16:262\u2013270, 2011.\n\n[33] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed sensing,\n\npages 210\u2013268. Cambridge Univ. Press, Cambridge, 2012.\n\n11\n\n\f", "award": [], "sourceid": 5151, "authors": [{"given_name": "Negin", "family_name": "Golrezaei", "institution": "MIT"}, {"given_name": "Adel", "family_name": "Javanmard", "institution": "USC"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}]}