{"title": "Practical Differentially Private Top-k Selection with Pay-what-you-get Composition", "book": "Advances in Neural Information Processing Systems", "page_first": 3532, "page_last": 3542, "abstract": "We study the problem of top-k selection over a large domain universe subject to user-level differential privacy. Typically, the exponential mechanism or report noisy max are the algorithms used to solve this problem. However, these algorithms require querying the database for the count of each domain element. We focus on the setting where the data domain is unknown, which is different than the setting of frequent itemsets where an apriori type algorithm can help prune the space of domain elements to query. We design algorithms that ensures (approximate) differential privacy and only needs access to the true top-k' elements from the data for any chosen k' \u2265 k. This is a highly desirable feature for making differential privacy practical, since the algorithms require no knowledge of the domain. We consider both the setting where a user's data can modify an arbitrary number of counts by at most 1, i.e. unrestricted sensitivity, and the setting where a user's data can modify at most some small, fixed number of counts by at most 1, i.e. restricted sensitivity. Additionally, we provide a pay-what-you-get privacy composition bound for our algorithms. That is, our algorithms might return fewer than k elements when the top-k elements are queried, but the overall privacy budget only decreases by the size of the outcome set.", "full_text": "Practical Differentially Private Top-k Selection with\n\nPay-what-you-get Composition\n\nDavid Durfee1 and Ryan Rogers1\n\n1Data Science Applied Research, LinkedIn\n\nAbstract\n\nWe study the problem of top-k selection over a large domain universe subject to\nuser-level differential privacy. Typically, the exponential mechanism or report noisy\nmax are the algorithms used to solve this problem. However, these algorithms\nrequire querying the database for the count of each domain element. We focus on\nthe setting where the data domain is unknown, which is different than the setting\nof frequent itemsets where an apriori type algorithm can help prune the space\nof domain elements to query. We design algorithms that ensures (approximate)\n(\u03b5, \u03b4 > 0)-differential privacy and only needs access to the true top-\u00afk elements\nfrom the data for any chosen \u00afk \u2265 k. We consider both the setting where a user\u2019s\ndata can modify an arbitrary number of counts by at most 1, i.e. unrestricted\nsensitivity, and the setting where a user\u2019s data can modify at most some small,\n\ufb01xed number of counts by at most 1, i.e. restricted sensitivity. Additionally, we\nprovide a pay-what-you-get privacy composition bound for our algorithms. That is,\nour algorithms might return fewer than k elements when the top-k elements are\nqueried, but the overall privacy budget only decreases by the size of the outcome.\n\n1\n\nIntroduction\n\nDetermining the top-k most frequent items from a massive dataset in an ef\ufb01cient way is one of the\nmost fundamental problems in data science, see Ilyas et al. [17] for a survey of top-k processing\ntechniques. However, it is important to consider users\u2019 privacy in the dataset, since results from data\nmining approaches can reveal sensitive information about a user\u2019s data [20]. Simple thresholding\ntechniques, e.g. k-anonymity, do not provide formal privacy guarantees, since adversary background\nknowledge or linking other datasets may cause someone\u2019s data in a protected dataset to be revealed\n[24]. Our aim is to provide rigorous privacy techniques for determining the top-k so that it can be\nbuilt on top of highly distributed, real-time systems that might already be in place.\nDifferential privacy has become the gold standard for rigorous privacy guarantees in data analytics.\nOne of the primary bene\ufb01ts of differential privacy is that the privacy loss of a computation on a\ndataset can be quanti\ufb01ed. Many companies have adopted differential privacy, including Google [15],\nApple [1], Uber [18], Microsoft [9], and LinkedIn [21], as well as government agencies, like the U.S.\nCensus Bureau [8]. In this work, we hope to extend the use of differential privacy in practical systems\nto allow analysts to compute the k most frequent elements in a given dataset. We are certainly not\nthe \ufb01rst to explore this topic, yet the previous works require querying the count of every domain\nelement, e.g. report noisy max [10] and the exponential mechanism [25], or require some structure\non the large domain universe, e.g. frequent item sets (see Related Work). We aim to design practical,\n(approximate) differentially private algorithms that do not require any structure on the data domain,\nwhich is typically the case in exploratory data analysis. Our algorithms work in the setting where data\nis preprocessed prior to running our algorithms, so that the differentially private computation only\naccesses a subset of the data while still providing user level privacy in the full underlying dataset.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwith either \u2248 \u2206 in the restricted sensitivity setting or \u2248 \u221a\n\nWe design (\u03b5, \u03b4 > 0)-differentially private algorithms that can return the top-k results by querying\nthe counts of elements that only exist in the dataset. To ensure user level privacy, where we want\nto protect the privacy of a user\u2019s entire local dataset that might consist of many data records, we\nconsider two different settings. In the restricted sensitivity setting, we assume that a user can modify\nthe counts by at most 1 across at most a \ufb01xed number \u2206 of elements in a data domain, which is\nassumed to be known. An example of such a setting would be computing the top-k countries where\nusers have a certain skill set. Assuming a user can only be in one country, we have \u2206 = 1. In the\nmore general setting, we consider unrestricted sensitivity, where a user can modify the counts by at\nmost 1 across an arbitrary number of elements. An example of the unrestricted setting would be if\nwe wanted to compute the top-k articles with distinct user engagement (liked, commented, shared,\netc.). We design different algorithms for either setting so that the privacy parameter \u03b5 needs to scale\nk in the unrestricted setting. Thus, our\ndifferentially private algorithms will ensure user level privacy despite a user being able to modify the\ncounts of an arbitrary number of elements.\nThe reason that our algorithms are approximate differentially private is because we want to allow our\nalgorithms to not have to know the data domain, or any structure on it. For exploratory analyses, one\nwould like to not have to provide the algorithm the full data domain beforehand. The mere presence\nof a domain element in the exploratory analysis might be the result of a single user\u2019s data. Hence, if\nwe remove a user\u2019s data in a neighboring dataset, there are some outcomes that cannot occur. We\ndesign algorithms such that these events occur with very small \u03b4 probability. Simultaneously, we\nensure that the private algorithms do not compromise the ef\ufb01ciency of existing systems.\nAs a byproduct of our analysis, we also include some results of independent interest. In particular,\nwe give a composition theorem that essentially allows for pay-what-you-get privacy loss. Since our\nalgorithms can potentially output fewer than k elements when asked for the top-k, we allow the\nanalyst to ask more queries if the algorithms return fewer than k outcomes, up to some \ufb01xed bound.\nFurther, we de\ufb01ne a condition on differentially private algorithms that allows for better composition\nbounds than the general optimal composition bounds [19, 26]. Lastly, we show how we can achieve a\none-shot differentially private algorithm that provides a ranked top-k result and has privacy parameter\nthat scales with\nWe see this work as bringing together multiple theoretical results in differential privacy to arrive at a\npractical privacy system that can be used on top of existing, real-time data analytics platforms for\nmassive datasets distributed across multiple servers. Essentially, the algorithms allow for solving the\ntop-\u00afk problem \ufb01rst with the existing infrastructure for any chosen \u00afk \u2265 k, and then incorporate noise\nand a threshold to output the top-k, or fewer outcomes. In our approach, we can think of the existing\nsystem, such as online analytical processing (OLAP) systems, as a blackbox top-k solver and without\nadjusting the input dataset or opening up the blackbox, we can still implement private algorithms.\n\n\u221a\n\nk, which uses a different noise distribution than work from Dwork et al. [14].\n\n1.1 Related Work\n\nThere are several works in differential privacy for discovering the most frequent elements in a dataset,\ne.g. top-k selection and heavy hitters. In the local privacy setting, there has been both academic work\n[3, 4] as well as industry solutions [1, 16] to identifying the heavy hitters. Note that these algorithms\nrequire some additional structure on the data domain, such as \ufb01xed length words, where the data can\nbe represented as a sequence of some known length and each element of the sequence belongs to\nsome known set. We will be working in the trusted curator model. There has been several works in\nthis model that estimate frequent itemsets subject to differential privacy, including [5, 23, 28, 22, 29].\nSimilar to our work, Bhaskar et al. [5] \ufb01rst solve the top-\u00afk problem nonprivately (but \u00afk \u2265 k can be\nthe full domain for certain databases) and then use the exponential mechanism to return an estimate\nfor the top-k. The primary difference between these works and ours is that the domain universe in\nour setting is unknown and not assumed to have any structure. For itemsets, one can iteratively build\nup the domain from smaller itemsets, as in the locally private algorithms.\nWe assume no structure on the domain, as one would assume without considering privacy restrictions.\nThis is a highly desirable feature for making differential privacy practical, since the algorithms can\nwork over arbitrary domains. Chaudhuri et al. [7] considers the problem of returning the argmax\nsubject to differential privacy, where their algorithm works in the range independent setting. That is,\ntheir algorithms can return domain elements that are unknown to the analyst querying the dataset.\nHowever, their large margin mechanism can run over the entire domain universe in the worst case.\n\n2\n\n\fThe algorithms in [7] and [5] share a similar approach in that both use the exponential mechanism on\nelements above a threshold. In order to obtain pure-differential privacy (\u03b4 = 0), Bhaskar et al. [5]\nsamples uniformly from elements below the threshold, whereas Chaudhuri et al. [7] never sample\nanything from this remaining set and thus satisfy approximate-differential privacy (\u03b4 > 0). Our\napproach will also follow this high-level idea, but we set the threshold based on an input parameter\nto ensure computational ef\ufb01ciency. To our knowledge, there are no top-k differentially private\nalgorithms for the unknown domain setting that never require iterating over the entire domain.\nThere have been several works bounding the total privacy loss of an (adaptive) sequence of dif-\nferentially private mechanisms, including basic composition [12, 10], advanced composition (with\nimprovements) [13, 11, 6], and optimal composition [19, 26]. There has also been work in bounding\nthe privacy loss when the privacy parameters themselves can be chosen adaptively \u2014 where the\nprevious composition theorems cannot be applied \u2014 with pay-as-you-go composition [27]. In this\nwork, we provide a pay-what-you-get composition theorem for our algorithms which allows the\nanalyst to only pay for the number of elements that were returned by our algorithms in the overall\nprivacy budget. Because our algorithms can return fewer than k elements when asked for the top-k,\nwe want to ensure the analyst can ask many more queries if fewer than k elements have been given.\n\n2 Preliminaries\nWe will represent the domain as [d] := {1,\u00b7\u00b7\u00b7 , d} and a user i\u2019s data as xi \u2208 2[d] =: X . We then\nwrite a dataset of n users as x = {x1,\u00b7\u00b7\u00b7 , xn}. We say that x, x(cid:48) are neighbors if they differ in the\naddition or deletion of one user\u2019s data, e.g. x = x(cid:48) \u222a {xi}. We now de\ufb01ne differential privacy [12].\nDe\ufb01nition 2.1 (Differential Privacy). An algorithm M that takes a collection of records in X to some\narbitrary outcome set Y is (\u03b5, \u03b4)-differentially private (DP), or \u03b5-DP if \u03b4 = 0, if for all neighbors\nx, x(cid:48) and for all outcome sets S \u2286 Y, we have\n\nPr[M(x) \u2208 S] \u2264 e\u03b5 Pr[M(x(cid:48)) \u2208 S] + \u03b4\n\ni=1\n\ndenote the number of users that have element j \u2208 [d], i.e. hj(x) = (cid:80)n\n\nIn this work, we want to select the top-k most frequent elements in a dataset x. Let hj(x) \u2208 N\n1{j \u2208 xi}. We then\nsort the counts and denote the ordering as hi(1)(x) \u2265 \u00b7\u00b7\u00b7 \u2265 hi(d) (x) with corresponding elements\ni(1),\u00b7\u00b7\u00b7 , i(d) \u2208 [d]. Hence, from dataset x, we seek to output i(1),\u00b7\u00b7\u00b7 , i(k) where we break ties in\nsome arbitrary, data independent way.\nNote that for neighboring datasets x and x(cid:48), the corresponding neighboring histograms h = h(x)\nand h(cid:48) = h(x(cid:48)) can differ in all d positions by at most 1, i.e. ||h \u2212 h(cid:48)||\u221e \u2264 1. In some instances,\none user can only impact the count on at most a \ufb01xed number of coordinates. We then say that h, h(cid:48)\nare \u2206-restricted sensitivity neighbors if ||h \u2212 h(cid:48)||\u221e \u2264 1 and ||h \u2212 h(cid:48)||0 \u2264 \u2206.\nThe algorithms we describe will only need access to a histogram h(x) = (h1(x),\u00b7\u00b7\u00b7 , hd(x)) \u2208 Nd,\nwhere we drop x when it is clear from context. We will be analyzing the privacy loss of an individual\nuser over many different top-k1, top-k2,\u00b7\u00b7\u00b7 queries on a larger, overall dataset. Consider the example\nwhere we want to know the top-k1 articles that distinct users engaged with, then we want to know the\ntop-k2 articles that distinct users engaged with in Germany, and so on. A user\u2019s data can be part of\neach input histogram, so we want to compose the privacy loss across many different queries.\nIn our algorithms, we will add noise to the histogram counts. The noise distributions we consider are\nfrom a Gumbel random variable or a Laplace random variable where Gumbel(b) has density function\npGumbel(z; b) and Lap(b) has density function pLap(z; b), with\n\n(cid:16)\u2212(z/b + e\u2212z/b)\n(cid:17)\n\npGumbel(z; b) =\n\n\u00b7 exp\n\n1\nb\n\nand\n\npLap(z; b) =\n\n\u00b7 exp (\u2212|z|/b) .\n\n(1)\n\n1\n2b\n\n3 Main Algorithm and Results\n\nWe now present our main algorithm for reporting the top-k domain elements. The limited domain\nprocedure LimitDomk,\u00afk given in Algorithm 1 takes as input a histogram h \u2208 Nd, parameter k, some\ncutoff \u00afk \u2265 k for the number of elements to consider, and privacy parameters (\u03b5, \u03b4). It then returns at\nmost k indices in relative rank order. Our algorithm can be thought of as solving the top-\u00afk problem\n\n3\n\n\fwith access to the true data, then from this set of histogram counts, adds noise to each count and sorts\nthem to return at most k indices with counts that are above some data dependent, noisy threshold.\nThe noise that we add will be from a Gumbel distribution, given in (1), which has a nice connection\nwith the exponential mechanism [25] (see Section 4). In later sections we will present a sketch of the\nanalysis with some extensions. We present the formal analysis in the supplementary \ufb01le.\nAlgorithm 1 LimitDomk,\u00afk; Top-k from the \u00afk \u2265 k limited domain\n\nInput: Histogram h; privacy parameters \u03b5, \u03b4.\nOutput: Ordered set of indices.\nSort h(1) \u2265 h(2) \u2265 \u00b7\u00b7\u00b7 .\nSet h\u22a5 = h(\u00afk+1) + 1 + ln(min{\u2206, \u00afk, d \u2212 \u00afk}/\u03b4)/\u03b5.\nSet v\u22a5 = h\u22a5 + Gumbel(1/\u03b5).\nfor j \u2264 \u00afk do\nSort {v(j)} \u222a v\u22a5 and let vi(1), ...., vi(j), v\u22a5 be the sorted list up until v\u22a5.\nReturn {i(1), ..., i(j),\u22a5} if j < k, otherwise return {i(1), ..., i(k)}.\n\nSet v(j) = h(j) + Gumbel(1/\u03b5).\n\nWe now state its privacy guarantee.\nTheorem 1. Algorithm 1 is (\u03b5(cid:48), \u03b4 + \u03b4(cid:48))-DP for any \u03b4(cid:48) \u2265 0 where\nk\u03b52\n2\n\n+ \u03b5(cid:112)2k ln(1/\u03b4(cid:48)),\n\n\u03b5(cid:48) = min\n\n(cid:18) e\u03b5 \u2212 1\n\n(cid:19)\n\nk\u03b5, k\u03b5 \u00b7\n\n(cid:40)\n\ne\u03b5 + 1\n\n(cid:114) 1\n\n2\n\n+ \u03b5\n\n(cid:41)\n\n.\n\nk ln(1/\u03b4(cid:48))\n\n(2)\n\n\u03b4\n\nmin{\u2206,\u00afk}. We give more high-level intuition in Section 3.2.\n\nNote that our algorithm is not guaranteed to output k indices, and this is key to obtaining our privacy\nguarantees. The primary dif\ufb01culty here is that the indices within the true top-\u00afk can change by adding\nor removing one person\u2019s data. The purpose of the threshold, \u22a5, is then to ensure that the probability\nof outputting any index in the top-\u00afk for histogram h but not in the top-\u00afk for a neighboring histogram\nh(cid:48) is bounded by\nIn order to maximize the probability of outputting k indices, we want to minimize our threshold\nvalue. In the unrestricted sensitivity setting, it becomes natural to consider how to set \u00afk, as there\nbecomes a tradeoff where h(\u00afk+1) is decreasing in \u00afk whereas ln(\u00afk/\u03b4)/\u03b5 is increasing in \u00afk. Ideally, we\nwould set \u00afk to be a point within the histogram in which we see a sudden drop, but setting it in such\na data dependent manner would violate privacy. Instead, we will simply consider the optimization\nproblem of \ufb01nding index \u00afk that minimizes h(\u00afk+1) + 1 + ln(\u00afk/\u03b4)/\u03b5 (and is computationally feasible),\nand we will solve this problem with standard DP techniques. We can then \ufb01nd a noisy estimate of the\noptimal parameter \u00afk for a given histogram h and the resulting procedure increases the privacy loss by\nsubstituting k + 1 for k in the guarantees in Theorem 1. See the supplementary \ufb01le for more details.\n\nPay-what-you-get Composition\n\nAlgorithm 2 multiLimitDomk(cid:63),(cid:96)(cid:63); Multiple queries to limited domain\n\nInput: Adaptive stream h1, h2, ...., integers k(cid:63) and (cid:96)(cid:63), and per iterate privacy parameters \u03b5, \u03b4.\nOutput: Sequence of outputs (o1,\u00b7\u00b7\u00b7 , o(cid:96)) for (cid:96) \u2264 (cid:96)(cid:63).\nwhile k(cid:63) > 0 and (cid:96)(cid:63) > 0 do\n\nBased on previous outcomes, select adaptive histogram hi and parameters ki, \u00afki\nif ki \u2264 k(cid:63) then\n\nLet oi = LimitDomki,\u00afki(hi) with privacy parameters \u03b5 and \u03b4\nk(cid:63) \u2190 k(cid:63) \u2212 |oi| and (cid:96)(cid:63) \u2190 (cid:96)(cid:63) \u2212 1\n\nReturn o = (o1, o2,\u00b7\u00b7\u00b7 )\n\nWhile the privacy loss for Algorithm 1 will be a function of k regardless of whether it outputs far\nfewer than k indices, we can actually show that in making multiple calls to this algorithm, we can\ninstead bound the privacy loss in terms of the number of indices that are output. More speci\ufb01cally,\nwe will take the length of the output for each call to Algorithm 1, which is not deterministic, and\n\n4\n\n\fensure that the sum of these lengths does not exceed some k(cid:63). Additionally, we need to restrict how\nmany top-k queries can be asked of our system, which we denote as (cid:96)(cid:63). Accordingly, the privacy loss\nwill then be in terms of k(cid:63) and (cid:96)(cid:63). We detail the procedure multiLimitDomk(cid:63),(cid:96)(cid:63) in Algorithm 2.\nFrom a practical perspective, this means that if we allowed a client to make multiple top-k queries\nwith a total budget of k(cid:63), whenever a top-k query was made their total budget would only decrease\nwith the size of the output, as opposed to k. We will further discuss in Section 3.1 how this property\nin some ways can actually provide higher utility than standard approaches that have access to the full\nhistogram and must output k indices. We then have the following privacy statement.\nTheorem 2. For any \u03b4(cid:48) \u2265 0, multiLimitDomk(cid:63),(cid:96)(cid:63) in Algorithm 2 is (\u03b5(cid:63), 2(cid:96)(cid:63)\u03b4 + \u03b4(cid:48))-DP where\n\n(cid:41)\n\nk(cid:63) ln(1/\u03b4(cid:48))\n\n.\n\n(3)\n\n(cid:40)\n\n\u03b5(cid:63) = min\n\nk(cid:63)\u03b5, k(cid:63)\u03b5 \u00b7\n\n(cid:18) e\u03b5 \u2212 1\n\n(cid:19)\n\ne\u03b5 + 1\n\n+ \u03b5(cid:112)2k(cid:63) ln(1/\u03b4(cid:48)),\n\n(cid:114) 1\n\n2\n\nk(cid:63)\u03b52\n\n2\n\n+ \u03b5\n\nIn the supplementary \ufb01le, we present other variants of our main algorithm that are better to use in\nspeci\ufb01c settings. In particular, we consider alternatively adding Laplace noise to Algorithm 1 which\nallows our \u03b5-privacy loss parameter to scale with \u2206 (rather than\nk) in the restricted sensitivity\nsetting, but unlike using Gumbel noise, it does not bene\ufb01t from the pay-what-you-get composition.\n\n\u221a\n\nImproved Advanced Composition\n\nWe also provide a result that may be of independent interest. In Section 4, we consider a slightly\ntighter characterization of pure (\u03b4 = 0) differential privacy, which we refer to as range-bounded,\nand show that it can improve the total privacy loss over a sequence of adaptively chosen private\nalgorithms. In particular, we consider the exponential mechanism, which is known to be \u03b5-DP, and\nshow that it has even stronger properties that allow us to show it is \u03b5-range-bounded under the\nsame parameters. Accordingly, we can then give improved advanced composition bounds for the\nexponential mechanism compared to the optimal composition bounds for general \u03b5-DP mechanisms\ngiven in [19, 26] (we show a comparison of these bounds in Section 4.3).\n\n3.1 Accuracy Comparisons\n\nIn contrast to previous work in top-k selection subject to DP, our algorithms can return fewer than\nk indices. Typically, accuracy in this setting is to return a set of exactly k indices such that each\nreturned index has a count that is at least the k-th ranked value minus some small amount. We then\nrelax the utility statement to allow returning fewer than k indices\nDe\ufb01nition 3.1. Given histogram h along with non-negative integers k and \u03b1, we say that a subset of\nindices d \u2286 [d]\u222a{\u22a5} is an (\u03b1, k)-accurate if for any i \u2208 d such that i (cid:54)= \u22a5, we have hi \u2265 h(k) \u2212 \u03b1.\nFor this de\ufb01nition, we can give asymptotically better accuracy guarantees than the exponential\nmechanism achieves, which are known to be tight [2], but it is important to mention that our de\ufb01nition\ndoes not require k indices to be output. Accordingly, we add a suf\ufb01cient condition under which our\nalgorithm will return k indices with a given probability. We defer the proof to the supplementary \ufb01le.\nLemma 3.1. For any histogram h, with probability at least 1 \u2212 \u03b2 the output from Algorithm 1\nwith parameters k, \u00afk, \u03b5, \u03b4 is (\u03b1, k)-accurate where \u03b1 = ln(k\u00afk/\u03b2)/\u03b5. Additionally, we have that\nAlgorithm 1 will return k indices with probability at least 1 \u2212 \u03b2 if\n\nh(k) \u2265 h(\u00afk+1) + 1 + ln(min{\u2206, \u00afk}/\u03b4)/\u03b5 + ln(k/\u03b2)/\u03b5\n\nThe \ufb01rst statement is essentially equivalent to Theorem 6 in [2] which would have \u03b1 = ln(kd/\u03b2)/\u03b5 in\nthis setting because we incorporate advanced composition at the end of the analysis and we consider\nthe absolute counts (not normalized by the total number of users). Accordingly, our \u03b1 parameter\nswaps d with \u00afk as expected, and will improve the accuracy statement for the output indices.\nEven for histograms in which we are unlikely to return k indices, we see this as the primary advantage\nof our pay-what-you-get composition. The indices that are returned are normally the clear winners,\ni.e. indices with counts substantially above the (\u00afk + 1)th value, and then the \u22a5 value is informative\nthat the remaining values are approximately equal where the analyst only has to pay for this single\noutput as opposed to paying for the remaining outputs that are close to a random permutation.\n\n5\n\n\f3.2 Our Techniques\n\nThe main challenge with ensuring differential privacy in our setting is that preprocessing the data to\nthe true top-\u00afk indices will lead to different domains for neighboring histograms. More explicitly, the\nindices within the top-\u00afk can change by adding or removing one user\u2019s data, and this makes ensuring\npure differential privacy without knowing the domain impossible. However, the key insight will be\nthat only indices whose value is within 1 of h(\u00afk+1), i.e. the count of the (\u00afk + 1)th ranked index, can\ngo in or out of the top-\u00afk by adding or removing one user\u2019s data. Accordingly, the noisy threshold that\nwe add will be explicitly set such that for indices with value within 1 of h(\u00afk+1), the noisy estimate\nexceeding the noisy threshold will be a low probability event. By restricting our output indices to\nthose whose noisy estimate are in the top-k and exceed a noisy threshold, we ensure that indices\nin the top-\u00afk for one histogram but not in a neighboring histogram will output with probability at\nmost\nmin{\u2206,\u00afk}. A union bound over the possible indices that can change will then give our desired\nbound on these bad events. We now present the high level reasoning behind the proof of privacy in\nTheorem 1 and defer the formal analysis to the supplementary material.\n\n\u03b4\n\n1. Adding Gumbel noise and taking the top-k at once is equivalent to selecting the argmax\nusing the exponential mechanism then removing that index and iterating, see Lemma 4.2.1\n2. To get around the fact that the domains can change in neighboring datasets, we de\ufb01ne a\nvariant of Algorithm 1 that takes a histogram and a domain as input. We then prove that this\nvariant is DP for any input domain, and for a choice of domain that depends on the input\nhistogram, it is the same as Algorithm 1\n\n3. Due to the choice of the count for element \u22a5, we show that for any given neighboring\ndatasets h, h(cid:48), the probability that Algorithm 1 evaluated on h can return any element that is\nnot part of the domain with h(cid:48) occurs with probability \u03b4.\n\nWe now provide a sketch of the analysis for proving the pay-what-you-get composition bound in\nTheorem 2 and defer the formal proof to the supplementary material.\n\n1. Because Algorithm 1 can be expressed as multiple iterations of the exponential mechanism,\nwe can string together many calls to Algorithm 1 as an adaptive sequence of DP mechanisms.\n2. With multiple calls to Algorithm 1, if we ever get a \u22a5 outcome, we can simply start a new\ntop-k query and hence a new sequence of exponential mechanism calls. Hence, we do not\nneed to get k outcomes before we switch to a new query.\n\n3. To get the improved constants in (3), compared to advanced composition given in Theo-\nrem 3 [13], we introduce a tigher range-bounded characterization, which the exponential\nmechanism satis\ufb01es, that enjoys better composition, see Lemma 4.4.\n\n4 Existing DP Algorithms and Extensions\n\nWe now cover some existing differentially private algorithms and extensions to them. We start with\nthe exponential mechanism [25], and show how it is equivalent to adding noise from a particular\ndistribution and taking the argmax outcome. Next, we will present a stronger privacy condition\nthan differential privacy which will in fact lead to an improved composition bound than the optimal\ncomposition bounds [19, 26] for general DP mechanisms. Throughout, we will make use of the\nfollowing composition theorem in differential privacy.\nTheorem 3 (Composition [10, 13] with improvements by [19, 26]). Let M1,M2,\u00b7\u00b7\u00b7 ,Mt be each\n(\u03b5i, \u03b4i)-DP, where Mi may depend on the previous outcomes of M1,\u00b7\u00b7\u00b7 ,Mi\u22121, then the composed\ni=1 \u03b4i + \u03b4(cid:48))-DP for any \u03b4(cid:48) \u2265 0 where\nt(cid:88)\n\nalgorithm M(x) = (M1(x),M2(x),\u00b7\u00b7\u00b7 ,Mt(x)) is (\u03b5(cid:48),(cid:80)t\n(cid:118)(cid:117)(cid:117)(cid:116)2\n\n\u03b5(cid:48) = min\n\n\u03b5i \u00b7 (\n\ni ln(1/\u03b4(cid:48))\n\u03b52\n\ne\u03b5i \u2212 1\ne\u03b5i + 1\n\n) +\n\n\uf8fc\uf8fd\uf8fe .\n\n\uf8f1\uf8f2\uf8f3 t(cid:88)\n\ni=1\n\nt(cid:88)\n\ni=1\n\n\u03b5i,\n\n1Note that we could have alternatively written our algorithm in terms of iteratively applying exponential\n\nmechanism for easier analysis, but instead adding Gumbel noise once is computationally more ef\ufb01cient.\n\n6\n\ni=1\n\n\f4.1 Exponential Mechanism and Gumbel Noise\nThe exponential mechanism takes a quality score q that maps a dataset in D and outcome to R and\ncan be thought of as evaluating how good q(x, y) is for an outcome y \u2208 Y on dataset x \u2208 D. For our\nsetting, we will be using the following quality score q(h, i) = hi in the exponential mechanism.\nDe\ufb01nition 4.1 (Exponential Mechanism). Let EMq : D \u2192 Y be a randomized mapping where for all\noutputs y \u2208 Y we have\n\nPr[EMq(x) = y] \u221d exp (\u03b5q(x, y)/\u2206(q))\n\nwhere\n\n\u2206(q) := sup\ny\u2208Y\n\n|q(x, y) \u2212 q(x(cid:48), y)|.\n\nWe say that a quality score q(\u00b7,\u00b7) is monotonic in the dataset if the addition of a data record can either\nincrease (decrease) or remain the same with any outcome, e.g. q(x, y) \u2264 q(x\u222a{xi}, y) for any input\nand outcome y. Note that q(h, i) = hi is monotonic. We then have the following privacy guarantee.\nLemma 4.1. The exponential mechanism EMq is 2\u03b5-DP. Further, if q is monotonic, then EMq is \u03b5-DP.\n\nWe point out that the exponential mechanism can be simulated by adding noise from Gumbel(\u2206(q)/\u03b5)\nto each quality score value and then reporting the outcome with the largest noisy count. This is similar\nto the report noisy max mechanism [10] except Gumbel noise is added rather than Laplace. We de\ufb01ne\nq to be the iterative peeling algorithm that \ufb01rst samples the outcome with the largest quality score\npEMk\nthen repeats on the remaining outcomes and continues k times. We further de\ufb01ne Mk\nGumbel(q(x)) to\nbe the algorithm that adds Gumbel(\u2206(q)/\u03b5) to each q(x, y) for y \u2208 Y and takes the k indices with\nthe largest noisy counts. We then make the following connection between pEMk and Mk\nGumbel, so that\nwe can compute the top-k outcomes in one-shot. We defer the proof to the supplementary \ufb01le.\nLemma 4.2. For any input x \u2208 X the peeling exponential mechanism pEMk\ntion to Mk\n\nGumbel(q(x)). That is for any outcome vector (o1,\u00b7\u00b7\u00b7 , ok) \u2208 [d]k we have\nGumbel(q(x)) = (o1,\u00b7\u00b7\u00b7 , ok)]\n\nq (x) is equal in distribu-\n\nPr[pEMk\n\nq (x) = (o1,\u00b7\u00b7\u00b7 , ok)] = Pr[Mk\nWe next show that the one-shot noise addition is (\u2248 \u221a\n\nk\u03b5, \u03b4)-DP using Theorem 3. Dwork et al. [14]\nalso considered a one-shot approach with Laplace noise addition and in order to get the\nk\u03b5 factor\non the privacy loss, their algorithm could not return the ranked list of indices. Using Gumbel noise\nallows us to return the ranked list of indices in one-shot with the same privacy loss.\nGumbel(q(\u00b7)) is (\u03b5(cid:48), \u03b4)-DP and if q is monotonic in the\nCorollary 4.1. For any \u03b4 \u2265 0, the one-shot Mk\n(cid:40)\n(cid:41)\n(cid:41)\nGumbel(q(\u00b7)) is (\u03b5(cid:48)(cid:48), \u03b4)-DP where\ndataset then Mk\n1\n\u03b4\n\ne2\u03b5 \u2212 1\ne2\u03b5 + 1\n\n\u03b5(cid:48) = 2 \u00b7 min\n\ne\u03b5 \u2212 1\ne\u03b5 + 1\n\n, \u03b5(cid:48)(cid:48) = min\n\n(cid:114)\n\n(cid:114)\n\n2k ln(\n\n)\n\n(cid:40)\n\nk\u03b5, k\u03b5(\n\n) + \u03b5\n\n2k ln(\n\n)\n\n.\n\nk\u03b5, k\u03b5(\n\n) + \u03b5\n\n\u221a\n\n1\n\u03b4\n\n4.2 Bounded Range Composition\n\nIt turns out that we can actually improve on the total privacy loss for this algorithm and for a wider\nclass of algorithms in general. We \ufb01rst de\ufb01ne a slightly stronger condition than (pure) differential\nprivacy that can give a tighter characterization of the privacy loss for certain DP mechanisms.\nDe\ufb01nition 4.2 (Range-Bounded). Given a mechanism M that takes a collection of records in X to\noutcome set Y, we say that M is \u03b5-range-bounded if for any neighboring databases x, x(cid:48) we have\n\n(cid:18) Pr[M(x) = y]\n\nPr[M(x(cid:48)) = y]\n\n(cid:19)\n\nln\n\nsup\ny\u2208Y\n\n(cid:18) Pr[M(x) = y(cid:48)]\n\nPr[M(x(cid:48)) = y(cid:48)]\n\n(cid:19)\n\n\u2264 \u03b5\n\n\u2212 inf\n\ny(cid:48)\u2208Y ln\n\nIn the supplementary \ufb01le, we show that exponential mechanism achieves the same privacy parameters\nas in Lemma 4.1 for our stronger charaterization.\nLemma 4.3. The exponential mechanism EMq is 2\u03b5-range-bounded. Further if q is monotonic then\nEMq is \u03b5-range bounded.\n\nWe now show that we can achieve a better composition bound when we compose \u03b5-range-bounded\nalgorithms as opposed to using Theorem 3, which applies to the composition of general DP algorithms.\n\n7\n\n\fIn fact, our composition bound for range-bounded algorithms improves on the optimal composition\ntheorem for general DP algorithms [19, 26]. See the supplementary \ufb01le for a comparison of the\ndifferent bounds. We defer the proof, which largely follows a similar argument to [13].\nLemma 4.4. Let M1,M2,\u00b7\u00b7\u00b7 ,Mt each be \u03b5i-bounded range where the choice of Mi may depend\non the previous outcomes of M1,\u00b7\u00b7\u00b7 ,Mi\u22121, then the composed algorithm M(x) of each of the\nalgorithms M1(x),M2(x),\u00b7\u00b7\u00b7 ,Mt(x) is (\u03b5(cid:48), \u03b4)-DP for any \u03b4 \u2265 0 where\n\n\uf8fc\uf8fd\uf8fe .\nthe signi\ufb01cant term that is normally considered in advanced composition is(cid:112)2k ln(1/\u03b4(cid:48))\u03b5, which\n\nNote that in order to see an improvement in the advanced composition bound, we do not necessarily\nrequire that an \u03b5-DP mechanism is also \u03b5-range-bounded, but could be relaxed. More speci\ufb01cally,\n\n(cid:18) e\u03b5i \u2212 1\n\n\uf8f1\uf8f2\uf8f3 t(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116)2\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\n2\n\n\u03b5(cid:48) = min\n\nt(cid:88)\n\nt(cid:88)\n\nt(cid:88)\n\nt(cid:88)\n\ni=1\n\n\u03b52\ni\n2\n\n+\n\n\u03b52\ni ln(\n\n1\n\u03b4\n\n),\n\n\u03b52\ni ln(\n\n1\n\u03b4\n\n)\n\n\u03b5i,\n\n\u03b5i \u00b7\n\ne\u03b5i + 1\n\n+\n\n(cid:19)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n(4)\n\n(cid:113) \u03b12\n\n2 k ln(1/\u03b4(cid:48))\u03b5 for composing \u03b1\u03b5-range-bounded mechanisms with \u03b1 \u2264 2.\ncan be replaced with\nConsequently, we believe that this could be useful for mechanisms beyond the exponential mechanism.\n\n4.3 Comparison between Bounded Range DP Composition and Optimal DP Composition\n\nFigure 1: Plotting the ratio of the optimal DP composition bound from [19] given in Lemma 4.5 to\nthe bounded range DP composition bound from Lemma 4.4 for various \u03b5 values and \u03b4 < 10\u22126.\n\nHere we compare the composition bound given in Lemma 4.4 and show that it can actually improve\non the optimal bound for generally DP, which we state here for the homogeneous case.\nLemma 4.5 (Optimal DP Composition [19]). For any \u03b5 \u2265 0 and \u03b4 \u2208 [0, 1], the composed mech-\nanism of k adaptively chosen \u03b5-DP is ((k \u2212 2i)\u03b5, \u03b4i)-DP for all i \u2208 {0, 1,\u00b7\u00b7\u00b7 ,(cid:98)k/2(cid:99)}where\n\n(cid:1)(cid:0)e(k\u2212(cid:96))\u03b5 \u2212 e(k\u22122i+(cid:96))\u03b5(cid:1) /(1 + e\u03b5)k.\n\n\u03b4i =(cid:80)i\u22121\n\n(cid:0)k\n\n(cid:96)=0\n\n(cid:96)\n\nIn Figure 1, we plot, for various k and \u03b5, the ratio between the composition bound for range bounded\nDP algorithms and the general DP optimal composition bound, where a ratio larger than 1 means that\nour bound is smaller. Due to the discrete formula for \u03b4i in the optimal composition formula, we select\nthe index i that produces the smallest (k \u2212 2i)\u03b5 while \u03b4i \u2264 10\u22126. Frequently, this \u03b4i that is selected\nis much smaller than the threshold 10\u22126, so we use this same \u03b4i when we compare our bounds to the\noptimal composition bound. Note that the jaggedness in the plot is because the optimal composition\nbound might be ((k \u2212 2i)\u03b5, \u03b4 (cid:28) 10\u22126)-DP at round k but ((k + 1 \u2212 2(i + 1))\u03b5, \u03b4 \u2248 10\u22126)-DP at\nround k + 1. Hence, plotting only the \ufb01rst privacy parameter might be non-monotonic.\n\n5 Conclusion\n\nWe have presented a way to ef\ufb01ciently report the top-k elements in a dataset subject to differential\nprivacy. Our approach does not require adjusting the input data to an existing system, nor does it\nrequire altering the non-private data analytics. Our algorithms can be seen as being an additional layer\non top of existing systems so that we can leverage highly ef\ufb01cient, scalable data analytics platforms\nin our private systems. Our algorithms can balance utility in terms of both privacy with \u03b5 as well\nas ef\ufb01ciency with \u00afk. Further, we have improved on the general composition bounds in differential\nprivacy that can be applied in our setting to extract more utility under the same privacy budget and\nhave provided a pay-what-you-get composition bound.\n\n8\n\n020406080100k1.01.11.21.31.41.5Ratio of Opt Comp to Range Bounded CompRatio of Privacy Loss for <106= 0.005= 0.010= 0.025= 0.050020406080100k1.01.11.21.31.41.5Ratio of Opt Comp to Range Bounded CompRatio of Privacy Loss for <106= 0.100= 0.250= 0.500= 1.000\fReferences\n[1] Apple Differential Privacy Team. Learning with privacy at scale, 2017. Available at https:\n\n//machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.\nhtml.\n\n[2] M. Bafna and J. Ullman. The price of selection in differential privacy. In S. Kale and O. Shamir,\neditors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of\nMachine Learning Research, pages 151\u2013168, Amsterdam, Netherlands, 07\u201310 Jul 2017. PMLR.\nURL http://proceedings.mlr.press/v65/bafna17a.html.\n\n[3] R. Bassily and A. Smith. Local, private, ef\ufb01cient protocols for succinct histograms.\n\nIn\nProceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC\n\u201915, pages 127\u2013135, New York, NY, USA, 2015. ACM.\nISBN 978-1-4503-3536-2. doi:\n10.1145/2746539.2746632. URL http://doi.acm.org/10.1145/2746539.2746632.\n\n[4] R. Bassily, K. Nissim, U. Stemmer, and A. Guha Thakurta. Practical locally private\nheavy hitters.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,\npages 2288\u20132296. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/\n6823-practical-locally-private-heavy-hitters.pdf.\n\n[5] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive\nIn Proceedings of the 16th ACM SIGKDD International Conference on Knowledge\ndata.\nDiscovery and Data Mining, KDD \u201910, pages 503\u2013512, New York, NY, USA, 2010. ACM.\nISBN 978-1-4503-0055-1. doi: 10.1145/1835804.1835869. URL http://doi.acm.org/10.\n1145/1835804.1835869.\n\n[6] M. Bun and T. Steinke. Concentrated differential privacy: Simpli\ufb01cations, extensions, and\n\nlower bounds. In Theory of Cryptography Conference (TCC), pages 635\u2013658, 2016.\n\n[7] K. Chaudhuri, D. J. Hsu, and S. Song. The large margin mechanism for differentially\nprivate maximization.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and\nK. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,\npages 1287\u20131295. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/\n5391-the-large-margin-mechanism-for-differentially-private-maximization.\npdf.\n\n[8] A. N. Dajani, A. D. Lauger, P. E. Singer, D. Kifer, J. P. Reiter, A. Machanavajjhala, S. L.\nGar\ufb01nkel1, S. A. Dahl, M. Graham, V. Karwa, H. Kim, P. Leclerc, I. M. Schmutte, W. N. Sexton,\nL. Vilhuber, and J. M. Abowd. The modernization of statistical disclosure limitation at the\nU.S. Census bureau. Available online at https://www2.census.gov/cac/sac/meetings/\n2017-09/statistical-disclosure-limitation.pdf, 2017.\n\n[9] B. Ding, J. Kulkarni, and S. Yekhanin.\n\nCollecting telemetry data privately. De-\ncember 2017. URL https://www.microsoft.com/en-us/research/publication/\ncollecting-telemetry-data-privately/.\n\n[10] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and\nTrends in Theoretical Computer Science, 9(3 & 4):211\u2013407, 2014. doi: 10.1561/0400000042.\nURL http://dx.doi.org/10.1561/0400000042.\n\n[11] C. Dwork and G. Rothblum. Concentrated differential privacy. arXiv:1603.01887 [cs.DS],\n\n2016.\n\n[12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\ndata analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265\u2013284,\n2006.\n\n[13] C. Dwork, G. N. Rothblum, and S. P. Vadhan. Boosting and differential privacy. In 51st Annual\n\nSymposium on Foundations of Computer Science, pages 51\u201360, 2010.\n\n[14] C. Dwork, W. Su, and L. Zhang. Private False Discovery Rate Control. arXiv e-prints, art.\n\narXiv:1511.03803, Nov 2015.\n\n9\n\n\f[15] U. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving\nordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and\nCommunications Security, CCS \u201914, pages 1054\u20131067, New York, NY, USA, 2014. ACM.\nISBN 978-1-4503-2957-6. doi: 10.1145/2660267.2660348. URL http://doi.acm.org/10.\n1145/2660267.2660348.\n\n[16] G. Fanti, V. Pihur, and \u00dalfar Erlingsson. Building a rappor with the unknown: Privacy-preserving\nlearning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies\n(PoPETS), issue 3, 2016, 2016.\n\n[17] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in\nrelational database systems. ACM Comput. Surv., 40(4):11:1\u201311:58, Oct. 2008. ISSN 0360-0300.\ndoi: 10.1145/1391729.1391730. URL http://doi.acm.org/10.1145/1391729.1391730.\n\n[18] N. Johnson, J. P. Near, and D. Song. Towards practical differential privacy for sql queries. Proc.\nVLDB Endow., 11(5):526\u2013539, Jan. 2018. ISSN 2150-8097. doi: 10.1145/3187009.3177733.\nURL https://doi.org/10.1145/3187009.3177733.\n\n[19] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. IEEE\nTransactions on Information Theory, 63(6):4037\u20134049, June 2017. ISSN 0018-9448. doi:\n10.1109/TIT.2017.2685505.\n\n[20] M. Kantarcio\u02c7glu, J. Jin, and C. Clifton. When do data mining results violate privacy? In\nProceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201904, pages 599\u2013604, New York, NY, USA, 2004. ACM. ISBN 1-58113-\n888-1. doi: 10.1145/1014052.1014126. URL http://doi.acm.org/10.1145/1014052.\n1014126.\n\n[21] K. Kenthapadi and T. T. L. Tran. Pripearl: A framework for privacy-preserving analytics\nIn Proceedings of the 27th ACM International Conference on\nand reporting at linkedin.\nInformation and Knowledge Management, CIKM \u201918, pages 2183\u20132191, New York, NY,\nUSA, 2018. ACM. ISBN 978-1-4503-6014-2. doi: 10.1145/3269206.3272031. URL http:\n//doi.acm.org/10.1145/3269206.3272031.\n\n[22] J. Lee and C. W. Clifton. Top-k frequent itemsets via differentially private fp-trees.\n\nIn\nProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, KDD \u201914, pages 931\u2013940, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-\n2956-9. doi: 10.1145/2623330.2623723. URL http://doi.acm.org/10.1145/2623330.\n2623723.\n\n[23] N. Li, W. H. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset mining with differential\n\nprivacy. PVLDB, 5:1340\u20131351, 2012.\n\n[24] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: Privacy\nbeyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1), Mar. 2007. ISSN 1556-4681. doi:\n10.1145/1217299.1217302. URL http://doi.acm.org/10.1145/1217299.1217302.\n\n[25] F. McSherry and K. Talwar. Mechanism design via differential privacy.\n\nSymposium on Foundations of Computer Science, 2007.\n\nIn 48th Annual\n\n[26] J. Murtagh and S. Vadhan. The complexity of computing the optimal composition of differential\nprivacy. In Proceedings, Part I, of the 13th International Conference on Theory of Cryptography\n- Volume 9562, TCC 2016-A, pages 157\u2013175, Berlin, Heidelberg, 2016. Springer-Verlag. ISBN\n978-3-662-49095-2. doi: 10.1007/978-3-662-49096-9_7. URL https://doi.org/10.1007/\n978-3-662-49096-9_7.\n\n[27] R. M. Rogers, A. Roth, J. Ullman, and S. P. Vadhan.\n\nPrivacy odometers and \ufb01l-\nters: Pay-as-you-go composition. In Advances in Neural Information Processing Systems\n29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,\n2016, Barcelona, Spain, pages 1921\u20131929, 2016. URL http://papers.nips.cc/paper/\n6170-privacy-odometers-and-filters-pay-as-you-go-composition.\n\n10\n\n\f[28] C. Zeng, J. F. Naughton, and J.-Y. Cai. On differentially private frequent itemset mining. Proc.\nVLDB Endow., 6(1):25\u201336, Nov. 2012. ISSN 2150-8097. doi: 10.14778/2428536.2428539.\nURL http://dx.doi.org/10.14778/2428536.2428539.\n\n[29] W. Zhu, P. Kairouz, H. Sun, B. McMahan, and W. Li. Federated heavy hitters discovery with\ndifferential privacy. CoRR, abs/1902.08534, 2019. URL http://arxiv.org/abs/1902.\n08534.\n\n11\n\n\f", "award": [], "sourceid": 1924, "authors": [{"given_name": "David", "family_name": "Durfee", "institution": "Georgia Tech"}, {"given_name": "Ryan", "family_name": "Rogers", "institution": "LinkedIn"}]}