{"title": "A Necessary and Sufficient Stability Notion for Adaptive Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 11485, "page_last": 11494, "abstract": "We introduce a new notion of the stability of computations, which holds under post-processing and adaptive composition. We show that the notion is both necessary and sufficient to ensure generalization in the face of adaptivity, for any computations that respond to bounded-sensitivity linear queries while providing accuracy with respect to the data sample set. The stability notion is based on quantifying the effect of observing a computation's outputs on the posterior over the data sample elements. We show a separation between this stability notion and previously studied notion and observe that all differentially private algorithms also satisfy this notion.", "full_text": "A necessary and suf\ufb01cient stability notion for\n\nadaptive generalization\n\nKatrina Ligett\n\nMoshe Shenfeld\n\nSchool of Computer Science & Engineering\n\nSchool of Computer Science & Engineering\n\nHebrew University of Jerusalem\n\nJerusalem 91904, Israel\n\nkatrina@cs.huji.ac.il\n\nHebrew University of Jerusalem\n\nJerusalem 91904, Israel\n\nmoshe.shenfeld@cs.huji.ac.il\n\nAbstract\n\nWe introduce a new notion of the stability of computations, which holds under post-\nprocessing and adaptive composition. We show that the notion is both necessary and\nsuf\ufb01cient to ensure generalization in the face of adaptivity, for any computations\nthat respond to bounded-sensitivity linear queries while providing accuracy with\nrespect to the data sample set. The stability notion is based on quantifying the effect\nof observing a computation\u2019s outputs on the posterior over the data sample elements.\nWe show a separation between this stability notion and previously studied notion\nand observe that all differentially private algorithms also satisfy this notion.\n\n1\n\nIntroduction\n\nA fundamental idea behind most forms of data-driven research and machine learning is the concept\nof generalization\u2013the ability to infer properties of a data distribution by working only with a sample\nfrom that distribution. One typical approach is to invoke a concentration bound to ensure that, for a\nsuf\ufb01ciently large sample size, the evaluation of the function on the sample set will yield a result that is\nclose to its value on the underlying distribution, with high probability. Intuitively, these concentration\narguments ensure that, for any given function, most sample sets are good \u201crepresentatives\u201d of the\ndistribution. Invoking a union bound, such a guarantee easily extends to the evaluation of multiple\nfunctions on the same sample set.\nOf course, such guarantees hold only if the functions to be evaluated were chosen independently\nof the sample set.\nIn recent years, grave concern has erupted in many data-driven \ufb01elds, that\nadaptive selection of computations is eroding statistical validity of scienti\ufb01c \ufb01ndings [Ioa05, GL14].\nAdaptivity is not an evil to be avoided\u2014it constitutes a natural part of the scienti\ufb01c process, wherein\nprevious \ufb01ndings are used to develop and re\ufb01ne future hypotheses. However, unchecked adaptivity\ncan (and does, as demonstrated by, e.g., [DFH+15b] and [RZ16]) often lead one to evaluate over\ufb01tting\nfunctions\u2014ones that return very different values on the sample set than on the distribution.\nTraditional generalization guarantees do not necessarily guard against adaptivity; while generalization\nensures that the response to a query on a sample set will be close to that of the same query on the\ndistribution, it does not rule out the possibility that the probability to get a speci\ufb01c response will be\ndramatically affected by the contents of the sample set. In the extreme, a generalizing computation\ncould encode the whole sample set in the low-order bits of the output, while maintaining high\naccuracy with respect to the underlying distribution. Subsequent adaptive queries could then, by\npost-processing the computation\u2019s output, arbitrarily over\ufb01t to the sample set.\nIn recent years, an exciting line of work, starting with Dwork et al. [DFH+15b], has formalized\nthis problem of adaptive data analysis and introduced new techniques to ensure guarantees of\ngeneralization in the face of an adaptively-chosen sequence of computations (what we call here\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fadaptive generalization). One great insight of Dwork et al. and followup work was that techniques\nfor ensuring the stability of computations (some of them originally conceived as privacy notions) can\nbe powerful tools for providing adaptive generalization.\nA number of papers have considered variants of stability notions, the relationships between them, and\ntheir properties, including generalization properties. Despite much progress in this space, one issue\nthat has remained open is the limits of stability\u2014how much can the stability notions be relaxed, and\nstill imply generalization? It is this question that we address in this paper.\n\n1.1 Our Contribution\n\nWe introduce a new notion of the stability of computations, which holds under post-processing\n(Theorem 2.3) and adaptive composition (Theorems 2.6 and 2.7), and show that the notion is both\nnecessary (Theorem 3.6) and suf\ufb01cient (Theorem 3.3) to ensure generalization in the face of adaptivity,\nfor any computations that respond to bounded-sensitivity linear queries (see De\ufb01nition 3.1) while\nproviding accuracy with respect to the data sample set. This means (up to a small caveat)1 that our\nstability de\ufb01nition is equivalent to generalization, assuming sample accuracy, for bounded linear\nqueries. Linear queries form the basis for many learning algorithms, such as those that rely on\ngradients or on the estimation of the average loss of a hypothesis.\nIn order to formulate our stability notion, we consider a prior distribution over the database elements\nand the posterior distribution over those elements conditioned on the output of a computation. In\nsome sense, harmful outputs are those that induce large statistical distance between this prior and\nposterior (De\ufb01nition 2.1). Our new notion of stability, Local Statistical Stability (De\ufb01nition 2.2),\nintuitively, requires a computation to have only small probability of producing such a harmful output.\nIn Section 4, we directly prove that Differential Privacy, Max Information, Typical Stability and\nCompression Schemes all imply Local Statistical Stability, which provides an alternative method to\nestablish their generalization properties. We also provide a few separation results between the various\nde\ufb01nitions.\n\n1.2 Additional Related Work\n\nMost countermeasures to over\ufb01tting fall into one of a few categories. A long line of work bases\ngeneralization guarantees on some form of bound on the complexity of the range of the mechanism,\ne.g., its VC dimension (see [SSBD14] for a textbook summary of these techniques). Other examples\ninclude Bounded Description Length [DFH+15a], and compression schemes [LW86] (which addi-\ntionally hold under post-processing and adaptive composition [DFH+15a, CLN+16]). Another line\nof work focuses on the algorithmic stability of the computation [BE02], which bounds the effects on\nthe output of changing one element in the training set.\nA different category of stability notions, which focus on the effect of a small change in the sample\nset on the probability distribution over the range of possible outputs, has recently emerged from the\nnotion of Differential Privacy [DMNS06]. Work of [DFH+15b] established that Differential Privacy,\ninterpreted as a stability notion, ensures generalization; it is also known (see [DR+14]) to be robust\nto adaptivity and to withstand post-processing. A number of subsequent works propose alternative\nstability notions that weaken the conditions of Differential Privacy in various ways while attempting\nto retain its desirable generalization properties. One example is Max Information [DFH+15a], which\nshares the guarantees of Differential Privacy. A variety of other stability notions ([RRST16, RZ16,\nRRT+16, BNS+16, FS17, EGI19]), unlike Differential Privacy and Max Information, only imply\ngeneralization in expectation. [XR17, Ala17, BMN+17] extend these guarantees to generalization in\nprobability, under various restrictions.\n[CLN+16] introduce the notion of post-hoc generalization, which captures robustness to post-\nprocessing, but it was recently shown not to hold under composition [NSS+18]. The challenges that\nthe internal correlation of non-product distributions present for stability have been studied in the\ncontext of Inferential Privacy [GK16] and Typical Stability [BF16].\n\n1In particular, our lower bound (Theorem 3.6) requires one more query than our upper bound (Theorem 3.3).\n\n2\n\n\f2 LS stability de\ufb01nition and properties\nLet X be an arbitrary countable domain. Fixing some n \u2208 N, let DX n be some probability distribution\nde\ufb01ned over X n.2 Let Q,R be arbitrary countable sets which we will refer to as queries and responses,\nrespectively. Let a mechanism M : X n \u00d7 Q \u2192 R be a (possibly non-deterministic) function that,\ngiven a sample set s \u2208 X n and a query q \u2208 Q, returns a response r \u2208 R. Intuitively, queries can be\nthought of as questions the mechanism is asked about the sample set, usually representing functions\nfrom X n to R; the mechanism can be thought of as providing an estimate to the value of those\nfunctions, but we do not restrict the de\ufb01nitions, for reasons which will become apparent once we\nintroduce the notion of adaptivity (De\ufb01nition 2.4).\nThis setting involves two sources of randomness, the underlying distribution DX n, and the conditional\ndistribution DqR|X n (r | s)\u2014that is, the probability to get r as the output of M (s, q). These in turn\ninduce a set of distributions (formalized in De\ufb01nition A.1): the marginal distribution over R, the\n(X n,R)) and product distribution (denoted DqX n\u2297R) over X n \u00d7 R, and\njoint distribution (denoted Dq\nthe conditional distribution over X n given r \u2208 R. Note that even if DX n is a product distribution,\nthis conditional distribution might not be a product distribution. Although the underlying distribution\nDX n is de\ufb01ned over X n, it induces a natural probability distribution over X as well, by sampling one\nof the sample elements in the set uniformly at random.3 This in turn allows us extend our de\ufb01nitions\nto several other distributions, which form a connection between R and X (formalized in De\ufb01nition\nA.2): the marginal distribution over X , the joint distribution and product distribution over X \u00d7R, the\nconditional distribution over R given x \u2208 X , and the conditional distribution over X given r \u2208 R.\nWe use our distribution notation to denote both the probability that a distribution places on a subset\nof its range and the probability placed on a single element of the range.\n\nNotational conventions We use calligraphic letters to denote domains, lower case letters to denote\nelements of these domains, capital letters to denote random variables taking values in these domains,\nand bold letters to denote subsets of these domains. We omit subscripts and superscripts from some\nnotation when they are clear from context.\n\n2.1 Local Statistical Stability\n\nBefore observing any output from the mechanism, an outside observer knowing D but without other\ninformation about the sample set s holds prior D (x) that sampling an element of s would return a\nparticular x \u2208 X . Once an output r of the mechanism is observed, however, the observer\u2019s posterior\nbecomes D (x| r). The difference between these two distributions is what determines the resulting\ndegradation in stability. This difference could be quanti\ufb01ed using a variety of distance measures (a\npartial list can be found in Appendix F); here we introduce a particular one which we use to de\ufb01ne\nour stability notion.\nDe\ufb01nition 2.1 (Stability loss of a response). Given a distribution DX n, a query q, and a mechanism\nDX n (r) of a response r \u2208 R with respect to DX n and q is\nM : X n \u00d7 Q \u2192 R, the stability loss (cid:96)q\nde\ufb01ned as the Statistical Distance (De\ufb01nition F.1) between the prior distribution over X and the\n(cid:88)\nposterior induced by r. That is,\n\n(D (x| r) \u2212 D (x)) ,\n\n(cid:96)q\nDX n (r) :=\n\nx\u2208x+(r)\n\nwhere x+ (r) := {x \u2208 X | D (x| r) > D (x)}, the set of all sample elements which have a posterior\nprobability (given r) higher then their prior. Similarly, we de\ufb01ne the stability loss (cid:96) (r) of a set of\nresponses r \u2286 R as\n\nGiven 0 \u2264 \u0001 \u2264 1, a response will be called \u0001-unstable with respect to DX n and q if its loss is greater\nthe \u0001. The set of all \u0001-unstable responses will be denoted rDX n ,q\n\n:= {r \u2208 R| (cid:96) (r) > \u0001}.\n\n\u0001\n\n2Throughout the paper, X n can either denote the family of sequences of length n or a multiset of size n; that\n3It is worth noting that in the case where DX n is the product distribution of some distribution PX over X ,\n\nis, the sample set s can be treated as an ordered or unordered set.\nwe get that the induced distribution over X is PX .\n\n3\n\n(cid:80)\n\n(cid:96) (r) :=\n\nr\u2208r D (r) \u00b7 (cid:96) (r)\n\n.\n\nD (r)\n\n\fWe now introduce our notion of stability of a mechanism.\nDe\ufb01nition 2.2 (Local Statistical Stability). Given 0 \u2264 \u0001, \u03b4 \u2264 1, a distribution DX n, and a query q, a\nmechanism M : X n \u00d7 Q \u2192 R will be called (\u0001, \u03b4)-Local-Statistically Stable with respect to DX n\nand q (or LS Stable, or LSS, for short) if for any r \u2286 R, D (r) \u00b7 ((cid:96) (r) \u2212 \u0001) \u2264 \u03b4.\nNotice that the maximal value of the left hand side is achieved for the subset r\u0001. This stability\nde\ufb01nition can be extended to apply to a family of queries and/or a family of possible distributions.\nWhen there exists a family of queries Q and a family of distributions D such that a mechanism M\nis (\u0001, \u03b4)-LSS for all DX n \u2208 D and for all q \u2208 Q, then M will be called (\u0001, \u03b4)-LSS for D,Q. (This\nstability notion somewhat resembles Semantic Privacy as discussed by [KS14], though they use it to\ncompare different posterior distributions.)\n\nIntuitively, this can be thought of as placing a \u03b4 bound on the probability of observing an outcome\nwhose stability loss exceeds \u0001. This claim is formalized in Lemma B.1.\n\n2.2 Properties\n\nWe now turn to prove two crucial properties of LSS: post-processing and adaptive composition.\nPost-processing guarantees (known in some contexts as data processing inequalities) ensure that\nthe stability of a computation can only be increased by subsequent manipulations. This is a key\ndesideratum for concepts used to ensure adaptivity-proof generalization, since otherwise an adaptive\nsubsequent computation could potentially arbitrarily degrade the generalization guarantees.\nTheorem 2.3 (LSS holds under Post-Processing). Given 0 \u2264 \u0001, \u03b4 \u2264 1, a distribution DX n, and a\nquery q, if a mechanism M is (\u0001, \u03b4)-LSS with respect to DX n and q, then for any range U and any\narbitrary (possibly non-deterministic) function f : R \u2192 U, we have that f \u25e6 M : X n \u00d7 Q \u2192 U is\nalso (\u0001, \u03b4)-LSS with respect to DX n and q. An analogous statement also holds for mechanisms that\nare LSS with respect to a family of queries and/or a family of distributions.\n\nThe proof appears in Appendix B.1.\nIn order to formally de\ufb01ne adaptive learning and stability under adaptively chosen queries, we\nformalize the notion of an analyst who issues those queries.\nDe\ufb01nition 2.4 (Analyst and Adaptive Mechanism). An analyst over a family of queries Q is a\n(possibly non-deterministic) function A : R\u2217 \u2192 Q that receives a view\u2014a \ufb01nite sequence of\nresponses\u2014and outputs a query. We denote by A the family of all analysts, and write Vk := Rk and\nV := R\u2217.\nIllustrated below, the adaptive mechanism Adp \u00afM : X n \u00d7 A \u2192 Vk is a particular type of mechanism,\nwhich inputs an analyst as its query and which returns a view as its range type. It is parameterized\ni=1 where \u2200i \u2208 [k], Mi : X n \u00d7 Q \u2192 R. Given a\nby a sequence of sub-mechanisms \u00afM = (Mi)k\nsample set s and an analyst A as input, the adaptive mechanism iterates k times through the process\nwhere A sends a query to Mi and receives its response to that query on the sample set. The adaptive\nmechanism returns the resulting sequence of k responses vk. Naturally, this requires A to match M\nsuch that M\u2019s range can be A\u2019s input, and vice versa.4 5\n\nFor illustration, consider a gradient descent algorithm, where at each step the algorithm requests an\nestimate of the gradient at a given point, and chooses the next point in which the gradient should be\nevaluated based on the response it receives. For us, M evaluates the gradient at a given point, and A\n\n4If the same mechanism appears more then once in \u00afM, it can also be stateful, which means it retains an\ninternal record consisting of internal randomness, the history of sample sets and queries it has been fed, and the\nresponses it has produced; its behavior may be a function of this internal record. We omit this from the notation\nfor simplicity, but do refer to this when relevant. A stateful mechanism will be de\ufb01ned as LSS if it is LSS given\nany reachable internal record. A pedantic treatment might consider the probability that a particular internal state\ncould be reached, and only require LSS when accounting for these probabilities.\n\n5If A is randomized, we add one more step at the beginning where Adp \u00afM randomly generates some bits\nc\u2014A\u2019s \u201ccoin tosses.\u201d In this case, vk := (c, r1, . . . , rik) and A receives the coin tosses as an input as well. This\naddition turns qk+1 into a deterministic function of vi for any i \u2208 N, a fact that will be used multiple times\nthroughout the paper. In this situation, the randomness of Adp \u00afM results both from the randomness of the coin\ntosses and from that of the sub-mechanisms.\n\n4\n\n\fAdaptive Mechanism Adp \u00afM\nInput: s \u2208 X n, A \u2208 A\nOutput: vk \u2208 Vk\nv0 \u2190 \u2205 or c\nfor i \u2208 [k] :\nqi \u2190 A (vi\u22121)\nri \u2190 Mi (s, qi)\nvi \u2190 (vi\u22121, ri)\nreturn vk\n\ndetermines the next point to be considered. The interaction between the two of them constitutes an\nadaptive learning process.\nDe\ufb01nition 2.5 (k-LSS under adaptivity). Given 0 \u2264 \u0001, \u03b4 \u2264 1, a distribution DX n, and an analyst\nA, a sequence of k mechanisms \u00afM will be called (\u0001, \u03b4)-local-statistically stable under k adaptive\niterations with respect to DX n and A (or k-LSS for short), if Adp \u00afM is (\u0001, \u03b4)-LSS with respect to DX n\nand A (in which case we will use vk,A,DX n\nto denote the set of \u0001 unstable views). This de\ufb01nition can\nbe extended to a family of analysts and/or a family of possible distributions as well.\n\n\u0001\n\nAdaptive composition is a key property of a stability notion, since it restricts the degradation of\nstability across multiple computations. A key observation is that the posterior D (s| vk) is itself\na distribution over X n and qk+1 is a deterministic function of vk. Therefore, as long as each sub-\nmechanism is LSS with respect to any posterior that could have been induced by previous adaptive\ninteraction, one can reason about the properties of the composition.\nWe \ufb01rst show that the stability loss of a view is bounded by the sum of losses of its responses\nwith respect to the sub-mechanisms, which provides a linear bound on the degradation of the LSS\nparameters. Adding a bound on the expectation of the loss of the sub-mechanisms allows us to also\ninvoke Azuma\u2019s inequality and prove a sub-linear bound.\nTheorem 2.6 (LSS adaptively composes linearly). Given a family of distributions D over X n, a\nfamily of queries Q, and a sequence of k mechanisms \u00afM where \u2200i \u2208 [k], Mi : X n \u00d7 Q \u2192 R, we\nwill denote DM0,Q := D, and for any i > 0, DMi,Q will denote the set of all posterior distributions\ninduced by any response of Mi with non-zero probability with respect to DMi\u22121,Q and Q (see\nDe\ufb01nition B.2).\nGiven a sequence 0 \u2264 \u00011, \u03b41, . . . , \u0001k, \u03b4k \u2264 1, if for all i, Mi is (\u0001i, \u03b4i)-LSS with respect to DMi\u22121,Q\nand Q, the sequence is\n-k-LSS with respect to D and any analyst A over Q \u00d7 R.\n\n(cid:32)(cid:80)\n\n\u0001i, (cid:80)\n\n(cid:33)\n\n\u03b4i\n\ni\u2208[k]\n\ni\u2208[k]\n\nThe proof appears in Appendix B.3.\nOne simple case is when DMi\u22121,Q = D, and Mi is (\u0001i, \u03b4i)-LSS with respect to D and Q, for all i.\n(cid:33)\nTheorem 2.7 (LSS adaptively composes sub-linearly). Under the same conditions as Theorem\n2.6, given 0 \u2264 \u03b11, . . . , \u03b1k \u2264 1, such that for all i and any DX n \u2208 DMi\u22121,Q, and q \u2208 Q,\n-k-\n\n[(cid:96) (R)] \u2264 \u03b1i, then for any 0 \u2264 \u03b4(cid:48) \u2264 1, the sequence is\n\nE\n\nS\u223cDX n ,R\u223cMi(s,q)\nLSS with respect to D and any analyst A over Q \u00d7 R, where \u0001(cid:48) :=\n\n(cid:32)\n\u0001(cid:48), \u03b4(cid:48) + (cid:80)\n(cid:1) (cid:80)\ni + (cid:80)\n\n\u00012\n\ni\u2208[k]\n\ni\u2208[k]\n\n\u03b4i\n\u0001i\n\n\u03b1i.\n\n(cid:114)\n\n8ln(cid:0) 1\n\n\u03b4(cid:48)\n\ni\u2208[k]\n\nThe theorem provides a better bound then the previous one in case \u03b1i (cid:28) \u0001i, in which case the\ndominating term is the \ufb01rst one, which is sub-linear in k. The proof appears in Appendix B.4.\n\n3 LSS is Necessary and Suf\ufb01cient for Generalization\n\nUp until this point, queries and responses have been fairly abstract concepts. In order to discuss\ngeneralization and accuracy, we must make them concrete. As a result, in this section, we often\nconsider queries in the family of functions q : X n \u2192 R, and consider responses which have some\n\n5\n\n\fn(cid:80)\n\nmetric de\ufb01ned over them. We show our results for a fairly general class of functions known as\nbounded linear queries.6\nDe\ufb01nition 3.1 (Linear queries). A function q : X n \u2192 R will be called a linear query, if it is de\ufb01ned\nby a function q1 : X \u2192 R such that q (s) := 1\nq1 (si) (for simplicity we will denote q1 simply\nas q throughout the paper). If q : X \u2192 [\u2212\u2206, \u2206] it will be called a \u2206-bounded linear query. The set\nof \u2206-bounded linear queries will be denoted Q\u2206.\nIn this context, there is a \u201ccorrect\u201d answer the mechanism can produce for a given query, de\ufb01ned as\nthe value of the function on the sample set or distribution, and its distance from the response provided\nby the mechanism can be thought of as the mechanism\u2019s error.\nDe\ufb01nition 3.2 (Sample accuracy, distribution accuracy). Given 0 \u2264 \u0001, 0 \u2264 \u03b4 \u2264 1, a distribution\nDX n, and a query q, a mechanism M : X n \u00d7 Q \u2192 R will be called (\u0001, \u03b4)-Sample Accurate with\nrespect to DX n and q, if\n\ni=1\n\nn\n\nPr\n\nS\u223cDX n ,R\u223cM (S,q)\n\n[|R \u2212 q (S)| > \u0001] \u2264 \u03b4.\n\nSuch a mechanism will be called (\u0001, \u03b4)-Distribution Accurate with respect to DX n and q if\n\nPr\n\n[|R \u2212 q (DX n )| > \u0001] \u2264 \u03b4,\n\nS\u223cDX n\n\nS\u223cDX n ,R\u223cM (S,q)\n[q (S)]. When there exists a family of distributions D and a family of\nwhere q (DX n) := E\nqueries Q such that a mechanism M is (\u0001, \u03b4)-Sample (Distribution) Accurate for all D \u2208 D and for\nall q \u2208 Q, then M will be called (\u0001, \u03b4)-Sample (Distribution) Accurate with respect to D and Q.\nA sequence of k mechanisms \u00afM where \u2200i \u2208 [k] : Mi : X n \u00d7 Q \u2192 R which respond to a sequence\nof k (potentially adaptively chosen) queries q1, . . . qk will be called (\u0001, \u03b4)-k-Sample Accurate with\nrespect to DX n and q1, . . . qk if\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\nPr\n\nS\u223cDX n ,Ri\u223cMi(S,qi)\n\nmax\ni\u2208k\n\n|Ri \u2212 qi (S)| > \u0001\n\n\u2264 \u03b4,\n\nand (\u0001, \u03b4)-k-Distribution Accurate with respect to DX n and q1, . . . qk if\n|Ri \u2212 qi (DX n )| > \u0001\n\nPr\n\n\u2264 \u03b4.\n\nS\u223cDX n ,Ri\u223cMi(S,qi)\n\nmax\ni\u2208k\n\nWhen considering an adaptive process, accuracy is de\ufb01ned with respect to the analyst, and the\nprobabilities are taken also over the choice of the coin tosses by the adaptive mechanism.7\nWe denote by V the set of views consisting of responses in R.\nWe now show that if a mechanism returns accurate results with respect to the sample set, then being\nLSS implies accuracy on the underlying distribution.\nTheorem 3.3 (LSS implies generalization with high probability). Given 0 \u2264 \u0001 \u2264 \u2206, 0 \u2264 \u03b4 \u2264 1,\na distribution DX n, and an analyst A : V \u2192 Q\u2206, if a sequence of k mechanisms \u00afM where\n\u2200i \u2208 [k] , Mi : X n \u00d7Q\u2206 \u2192 R is both\nrespect to DX n and A, then it is (\u0001, \u03b4)-k-Distribution Accurate with respect to DX n and A.\n\n(cid:1)-k-Sample Accurate with\n\n-k-LSS and(cid:0) \u0001\n\n(cid:16) \u0001\n\n(cid:17)\n\n4800\u22062\n\n8\u2206 ,\n\n8 ,\n\n\u0001\u03b4\n\n600\u2206\n\n\u00012\u03b4\n\nThe proof of this theorem consists of two stages, and follows the method introduced by [BNS+16].\nFirst we show that the a query returned by an LSS mechanism has expected value on the underlying\ndistribution that is close to its value on the sample set that the mechanism received as input (Appendix\nC.1). We then proceed to lift this guarantee from expectation to high probability, using a thought\nexperiment known as the Monitor Mechanism (Appendix C.2). Intuitively, it runs a large number of\n6For simplicity, throughout the following section we choose R = R, but all results extend to any metric\n\nspace, in particular Rd.\n\n7If the adaptive mechanism invokes a stateful sub-mechanism multiple times, we specify that the mechanism\nis Sample (Distribution) Accurate if it is Sample (Distribution) Accurate given any reachable internal record.\nAgain, a somewhat more involved treatment might consider the probability that a particular internal state of the\nmechanism could be reached.\n\n6\n\n\findependent copies of an underlying mechanism, and exposes the results of the least-distribution-\naccurate copy as its output. If the expected error of even this least-accurate-copy is relatively low,\nthen the underlying mechanism generalizes with high probability (Appendix C.3).\nWe next show that a mechanism that is not LSS cannot be both Sample Accurate and Distribution\nAccurate. In order to prove this theorem, we show how to construct a \u201cbad\u201d query.\nDe\ufb01nition 3.4 (Loss assessment query). Given a query q and a response r, we will de\ufb01ne the Loss\nassessment query \u02dcqr as\n\n(cid:26)\u2206\n\n\u02dcqr (x) =\n\nD (x) > D (x| r)\n\u2212\u2206 D (x) \u2264 D (x| r)\n\n.\n\nIntuitively, this function maximizes the difference between E\n\n[\u02dcqr (X)] and\n\nX\u223cDX\n\nE\n\nX\u223cDqX|R\n\n[\u02dcqr (X) | r],\n\nand as a result, the potential to over\ufb01t.8\n\nThis function is used to lower bound the effect of the stability loss on the expected over\ufb01tting.\nLemma 3.5 (Loss assessment query over\ufb01ts in expectation). Given 0 \u2264 \u0001, \u03b4 \u2264 1, a distribution DX n,\na query q, and a mechanism M, if D (r\u0001) > \u03b4, then there is a function f : R \u2192 Q\u2206 such that,\n\nE\n\nS\u223cDX n ,Q(cid:48)\u223cf\u25e6M (S,q)\n\n[Q(cid:48) (DX n ) \u2212 Q(cid:48) (S)]\n\n(cid:12)(cid:12)(cid:12)(cid:12) > 2\u0001\u2206\u03b4.\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nProof. Choosing f (r) = qr we get that,\n\nE\n\nS\u223cDX n ,Q(cid:48)\u223cf\u25e6M (S,q)\n\n[Q(cid:48) (DX n ) \u2212 Q(cid:48) (S)]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12) (1)\n\n=\n\n=\n\n(2)\u2265\n\nq(cid:48)\u2208Q\u2206\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:88)\nD (q(cid:48)) \u00b7(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)\nD (r) \u00b7(cid:88)\n(cid:122)\n(cid:122)\n(cid:125)(cid:124)\n(cid:123)(cid:88)\n(cid:88)\n\n\u2265\u03b4\nD (r)\u00b7\n\nr\u2208R\n\nx\u2208X\n\nr\u2208r\u0001\n\nx\u2208X\n\n(D (x) \u2212 D (x| q(cid:48))) \u00b7 q(cid:48) (x)\n\nx\u2208X\n(D (x) \u2212 D (x| r)) \u00b7 \u02dcqr (x)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:125)(cid:124)\n\n=2(cid:96)(r)>2\u0001\n\n|D (x) \u2212 D (x| r)|\u00b7\u2206\n\n(cid:123)\n\nwhere (1) is further justi\ufb01ed in the proof of Theorem C.1, (2) results from the de\ufb01nition of the loss\nassessment query, and (3) from the de\ufb01nition of r\u0001.\n\n(3)\n> 2\u0001\u2206\u03b4\n\nWe use this method for constructing an over\ufb01tting query for non-LSS mechanism, to show that LSS\nis necessary in order for a mechanism to be both Sample Accurate and Distribution Accurate.\nTheorem 3.6 (Necessity of LSS for Generalization). Given 0 \u2264 \u0001 \u2264 \u2206, 0 \u2264 \u03b4 \u2264 1, a distribution\nDX n, and an analyst A : V \u2192 Q\u2206, if a sequence of k mechanisms \u00afM where \u2200i \u2208 [k] , Mi :\n\n\u2206 , \u03b4(cid:1)-k-LSS, then it cannot be both(cid:0) \u0001\n\n(cid:1) (k + 1)-Distribution Accurate\n\nX n \u00d7 Q\u2206 \u2192 R is not(cid:0) \u0001\nand(cid:0) \u0001\n\n(cid:1) (k + 1)-Sample Accurate.\n\n5 , \u0001\u03b4\n\n5\u2206\n\n5 , \u0001\u03b4\n\n5\u2206\n\nThe proof of this theorem, which appears in Appendix C.4, uses a similar method to the proof of\nTheorem 3.3, employing a variant of the Monitor Mechanism that outputs the loss assessment query\nwith the highest level of over\ufb01tting.\n\n4 Relationship to other notions of stability\n\nIn this section, we discuss the relationship between LSS and a few common notions of stability;\nde\ufb01nitions can be found in Appendix D.1.\nIn order to do so, we introduce an additional new\nstability notion, which relaxes the Max Information (MI) (De\ufb01nition D.2) notion by moving from the\ndistribution over the sample sets to the distribution over the sample elements.\n\n8The fact that we are able to de\ufb01ne such a query is a result of the way the distance measure of LSS treats the\n\nx\u2019s and the fact that it is de\ufb01ned over X and not X n.\n\n7\n\n\fDe\ufb01nition 4.1 (Local Max Information). Given 0 \u2264 \u0001, 0 \u2264 \u03b4 \u2264 1, a distribution DX n and a query q,\na mechanism M will be said to satisfy (\u0001, \u03b4)-Local-Max-Information with respect to DX n and q (or\nLMI, for short), if the joint distributions D(X ,R) and the product distribution DX\u2297R over X \u00d7 R are\n(\u0001, \u03b4)-indistinguishable. In other words, for any b \u2286 X \u00d7 R,\n\nD(X ,R) (b) \u2264 e\u0001 \u00b7 DX\u2297R (b) + \u03b4 and DX\u2297R (b) \u2264 e\u0001 \u00b7 D(X ,R) (b) + \u03b4.\n\nThe de\ufb01nition can be extended to apply to a family of queries and/or a family of possible distributions.\n\n4.1\n\nImplications\n\nPrior work ([DFH+15a] and [RRST16]) showed that bounded Differential Privacy (DP) (De\ufb01nition\nD.1) implies bounded MI (De\ufb01nition D.2). In the case of \u03b4 > 0, this holds only if the underlying\ndistribution is a product distribution [De12]). Bounded MI is also implied by Typical Stability (TS)\n(De\ufb01nition D.3) [BF16], and Bounded Maximal Leakage (ML) [EGI19]. We prove that DP, MI and\nTS imply LMI (in the case of DP, only for product distributions). All proofs for this subsection can be\nfound in Appendix D.2, where we also introduce a local version of ML and prove its relation to LMI.\nTheorem 4.2 (Differential Privacy implies Local Max Information). Given 0 \u2264 \u0001, 0 \u2264 \u03b4 \u2264 1, a\ndistribution DX , and a query q, if a mechanism M is (\u0001, \u03b4)-DP with respect to q then it is (\u0001, \u03b4)-LMI\nwith respect to the same q and the product distribution over X n induced by DX .\nTheorem 4.3 (Max Information implies Local Max Information). Given 0 \u2264 \u0001, 0 \u2264 \u03b4 \u2264 1, a\ndistribution DX n and a query q, if a mechanism M has \u03b4-approximate max-information of \u0001 with\nrespect to DX n and q then it is (\u0001, \u03b4)-LMI with respect to the same DX n and q.\nTheorem 4.4 (Typical Stability implies Local Max Information). Given 0 \u2264 \u0001, 0 \u2264 \u03b4, \u03b7 \u2264 1, a\ndistribution DX n and a query q, if a mechanism M is (\u0001, \u03b4, \u03b7)-Typically Stable with respect to DX n\nand q then it is (\u0001, \u03b4 + 2\u03b7)-LMI with respect to the same DX n and q.\n\nThese three theorems follow naturally from the fact that LMI is a fairly direct relaxation of DP, MI\nand TS.\nWe next show that LMI implies LSS.\nTheorem 4.5 (Local Max Information implies Local Statistical Stability). Given 0 \u2264 \u03b4 \u2264 \u0001 \u2264 1\n3 , a\ndistribution DX n and a query q, if a mechanism M is (\u0001, \u03b4)-LMI with respect to DX n and q, then it\n\n(cid:1)-LSS with respect to the same DX n and q, where \u0001(cid:48) = e\u0001 \u2212 1 + \u0001.\n\nis(cid:0)\u0001(cid:48), \u03b4\n\n\u0001\n\nWe also prove that Compression Schemes (De\ufb01nition D.6) imply LSS. This results from the fact that\nreleasing information based on a restricted number of sample elements has a limited effect on the\nposterior distribution on one element of the sample set.\nTheorem 4.6 (Compressibility implies Local Statistical Stability). Given 0 \u2264 \u03b4 \u2264 1, an integer\nm \u2264\n9ln(2n/\u03b4) , a distribution DX , and a query q \u2208 Q, if a mechanism M has a compression scheme\nof size m then it is (\u0001, \u03b4)-LSS with respect to the same q and the product distribution over X n induced\nby DX , for any \u0001 > 11\n\n(cid:113) mln(2n/\u03b4)\n\n.9\n\nn\n\nn\n\n4.2 Separations\n\nFinally, we show that MI is a strictly stronger requirement than LMI, and LMI is a strictly stronger\nrequirement then LSS. Proofs of these theorems appear in Appendix D.3.\nTheorem 4.7 (Max Information is strictly stronger than Local Max Information). For any 0 < \u0001,\nn \u2265 3, the mechanism which outputs the parity function of the sample set is (\u0001, 0)-LMI but not\n\n(cid:0)1, 1\n(cid:1)-MI.\n(cid:1) , 6(cid:9), a mechanism which uniformly samples and outputs one sample\n0 \u2264 \u03b4 \u2264 1, n > max(cid:8)2ln(cid:0) 2\n(cid:19)\n(cid:113) ln(2n/\u03b4)\n-LSS but is not(cid:0)1, 1\n\nTheorem 4.8 (Local Max Information is strictly stronger than Local Statistical Stability). For any\n\n(cid:1)-LMI.\n\nelement is\n\n(cid:18)\n\n11\n\n5\n\n\u03b4\n\n, \u03b4\n\nn\n\n2n\n\n9In case g releases some side information, the number of bits required to describe this information is added\n\nto the m factor in the bound on \u0001.\n\n8\n\n\f5 Applications and Discussion\n\nIn order to make the LSS notion useful, we must identify mechanisms which manages to remain\nstable while maintaining sample accuracy. Fortunately, many such mechanisms have been introduced\nin the context of Differential Privacy. Two of the most basic Differentially Private mechanisms are\nbased on noise addition, of either a Laplace or a Gaussian random variable. Careful tailoring of their\nparameters allows \u201cmasking\u201d the effect of changing one element, while maintaining a limited effect\non the sample accuracy. By Theorems 4.2 and 4.5, these mechanisms are guaranteed to be LSS as\nwell. The de\ufb01nitions and properties of these mechanisms can be found in Appendix E.\nIn moving away from the study of worst-case data sets (as is common in previous stability notions) to\naveraging over sample sets and over data elements of those sets, we hope that the Local Statistical\nStability notion will enable new progress in the study of generalization under adaptive data analysis.\nThis averaging, potentially leveraging a sort of \u201cnatural noise\u201d from the data sampling process, may\nenable the development of new algorithms to preserve generalization, and may also support tighter\nbounds on the implications of existing algorithms. One possible way this might be achieved is by\nlimiting the family of distributions and queries, such that the empirical mean of the query lies within\nsome con\ufb01dence interval around population mean, which would allow scaling the noise to the interval\nrather than the full range (see, e.g. , Concentrated Queries, as proposed by [BF16]).\nOne might also hope that realistic adaptive learning settings are not adversarial, and might therefore\nenjoy even better generalization guarantees. LSS may be a tool for understanding the generalization\nproperties of algorithms of interest (as opposed to worst-case queries or analysts; see e.g. [GK16],\n[ZH19]).\n\nAcknowledgements This work was supported in part by Israel Science Foundation (ISF) grant\n1044/16, the United States Air Force and DARPA under contract FA8750-16-C-0022, and the\nFedermann Cyber Security Center in conjunction with the Israel national cyber directorate. Part of\nthis work was done while the authors were visiting the Simons Institute for the Theory of Computing.\nAny opinions, \ufb01ndings and conclusions or recommendations expressed in this material are those of\nthe authors and do not necessarily re\ufb02ect the views of the United States Air Force and DARPA.\n\nReferences\n\n[Ala17] Ibrahim Alabdulmohsin. An information-theoretic route from generalization in expec-\ntation to generalization in probability. In Arti\ufb01cial Intelligence and Statistics, pages\n92\u2013100, 2017.\n\n[BE02] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine\n\nlearning research, 2(Mar):499\u2013526, 2002.\n\n[BF16] Raef Bassily and Yoav Freund. Typical stability. arXiv preprint arXiv:1604.03336,\n\n2016.\n\n[BMN+17] Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learn-\n\ners that use little information. arXiv preprint arXiv:1710.05233, 2017.\n\n[BNS+16] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan\nUllman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-\neighth annual ACM symposium on Theory of Computing, pages 1046\u20131059. ACM,\n2016.\n\n[CLN+16] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu.\nAdaptive learning with robust generalization guarantees. In Conference on Learning\nTheory, pages 772\u2013814, 2016.\n\n[De12] Anindya De. Lower bounds in differential privacy. In Theory of cryptography confer-\n\nence, pages 321\u2013338. Springer, 2012.\n\n[DFH+15a] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron\nRoth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural\nInformation Processing Systems, pages 2350\u20132358, 2015.\n\n9\n\n\f[DFH+15b] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and\nAaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings\nof the forty-seventh annual ACM symposium on Theory of computing, pages 117\u2013126.\nACM, 2015.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise\nto sensitivity in private data analysis. In Theory of cryptography conference, pages\n265\u2013284. Springer, 2006.\n\n[DR+14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.\n\nFoundations and Trends R(cid:13) in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[EGI19] Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. A new approach to adap-\ntive data analysis and learning via maximal leakage. arXiv preprint arXiv:1903.01777,\n2019.\n\n[FS17] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data\n\nanalysis. arXiv preprint arXiv:1712.07196, 2017.\n\n[GK16] Arpita Ghosh and Robert Kleinberg. Inferential privacy guarantees for differentially\n\nprivate mechanisms. arXiv preprint arXiv:1603.01508, 2016.\n\n[GL14] Andrew Gelman and Eric Loken. The statistical crisis in science. American scientist,\n\n102(6):460, 2014.\n\n[Ioa05] John PA Ioannidis. Why most published research \ufb01ndings are false. PLoS medicine,\n\n2(8):e124, 2005.\n\n[IWK18] Ibrahim Issa, Aaron B Wagner, and Sudeep Kamath. An operational approach to\n\ninformation leakage. arXiv preprint arXiv:1807.07878, 2018.\n\n[KS14] Shiva P Kasiviswanathan and Adam Smith. On the\u2019semantics\u2019 of differential privacy: A\n\nbayesian formulation. Journal of Privacy and Con\ufb01dentiality, 6(1), 2014.\n\n[LW86] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability.\n\n1986.\n\n[NSS+18] Kobbi Nissim, Adam Smith, Uri Stemmer, Thomas Steinke, and Jonathan Ullman. The\nlimits of post-selection generalization. In Advances in Neural Information Processing\nSystems, pages 6402\u20136411, 2018.\n\n[RRST16] Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. Max-information, differential\nprivacy, and post-selection hypothesis testing. In 2016 IEEE 57th Annual Symposium\non Foundations of Computer Science (FOCS), pages 487\u2013494. IEEE, 2016.\n\n[RRT+16] Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu.\nInformation-theoretic analysis of stability and bias of learning algorithms. In Informa-\ntion Theory Workshop (ITW), 2016 IEEE, pages 26\u201330. IEEE, 2016.\n\n[RZ16] Daniel Russo and James Zou. Controlling bias in adaptive data analysis using informa-\n\ntion theory. In Arti\ufb01cial Intelligence and Statistics, pages 1232\u20131240, 2016.\n\n[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From\n\ntheory to algorithms. Cambridge university press, 2014.\n\n[TV+15] Terence Tao, Van Vu, et al. Random matrices: universality of local spectral statistics of\n\nnon-hermitian matrices. The Annals of Probability, 43(2):782\u2013874, 2015.\n\n[XR17] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capa-\nbility of learning algorithms. In Advances in Neural Information Processing Systems,\npages 2524\u20132533, 2017.\n\n[ZH19] Tijana Zrnic and Moritz Hardt. Natural analysts in adaptive data analysis. arXiv preprint\n\narXiv:1901.11143, 2019.\n\n10\n\n\f", "award": [], "sourceid": 6133, "authors": [{"given_name": "Moshe", "family_name": "Shenfeld", "institution": "Hebrew University of Jerusalem"}, {"given_name": "Katrina", "family_name": "Ligett", "institution": "Hebrew University"}]}