{"title": "More Supervision, Less Computation: Statistical-Computational Tradeoffs in Weakly Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4482, "page_last": 4490, "abstract": "We consider the weakly supervised binary classification problem where the labels are randomly flipped with probability $1-\\alpha$. Although there exist numerous algorithms for this problem, it remains theoretically unexplored how the statistical accuracies and computational efficiency of these algorithms depend on the degree of supervision, which is quantified by $\\alpha$. In this paper, we characterize the effect of $\\alpha$ by establishing the information-theoretic and computational boundaries, namely, the minimax-optimal statistical accuracy that can be achieved by all algorithms, and polynomial-time algorithms under an oracle computational model. For small $\\alpha$, our result shows a gap between these two boundaries, which represents the computational price of achieving the information-theoretic boundary due to the lack of supervision. Interestingly, we also show that this gap narrows as $\\alpha$ increases. In other words, having more supervision, i.e., more correct labels, not only improves the optimal statistical accuracy as expected, but also enhances the computational efficiency for achieving such accuracy.", "full_text": "More Supervision, Less Computation:\n\nStatistical-Computational Tradeoffs in Weakly\n\nSupervised Learning\n\nXinyang Yi\u2020\u2217 Zhaoran Wang\u2021\u2217 Zhuoran Yang\u2021\u2217 Constantine Caramanis\u2020 Han Liu\u2021\n\n\u2020The University of Texas at Austin\n\n\u2021Princeton University\n\n\u2020{yixy,constantine}@utexas.edu\n\n\u2021{zhaoran,zy6,hanliu}@princeton.edu\n\n{\u2217: equal contribution}\n\nAbstract\n\nWe consider the weakly supervised binary classi\ufb01cation problem where the labels\nare randomly \ufb02ipped with probability 1 \u2212 \u03b1. Although there exist numerous al-\ngorithms for this problem, it remains theoretically unexplored how the statistical\naccuracies and computational ef\ufb01ciency of these algorithms depend on the degree\nof supervision, which is quanti\ufb01ed by \u03b1. In this paper, we characterize the effect of\n\u03b1 by establishing the information-theoretic and computational boundaries, namely,\nthe minimax-optimal statistical accuracy that can be achieved by all algorithms,\nand polynomial-time algorithms under an oracle computational model. For small\n\u03b1, our result shows a gap between these two boundaries, which represents the com-\nputational price of achieving the information-theoretic boundary due to the lack of\nsupervision. Interestingly, we also show that this gap narrows as \u03b1 increases. In\nother words, having more supervision, i.e., more correct labels, not only improves\nthe optimal statistical accuracy as expected, but also enhances the computational\nef\ufb01ciency for achieving such accuracy.\n\nIntroduction\n\ni=1, we observe {(xi, yi)}n\n\n1\nPractical classi\ufb01cation problems usually involve corrupted labels. Speci\ufb01cally, let {(xi, zi)}n\ni=1 be\nn independent data points, where xi \u2208 Rd is the covariate vector and zi \u2208 {0, 1} is the uncorrupted\nlabel. Instead of observing {(xi, zi)}n\ni=1 in which yi is the corrupted label.\nIn detail, with probability (1\u2212 \u03b1), yi is chosen uniformly at random over {0, 1}, and with probability\n\u03b1, yi = zi. Here \u03b1 \u2208 [0, 1] quanti\ufb01es the degree of supervision: a larger \u03b1 indicates more supervision\nsince we have more uncorrupted labels in this case. In this paper, we are particularly interested in the\neffect of \u03b1 on the statistical accuracy and computational ef\ufb01ciency for parameter estimation in this\nproblem, particularly in the high dimensional settings where the dimension d is much larger than the\nsample size n.\nThere exists a vast body of literature on binary classi\ufb01cation problems with corrupted labels. In\nparticular, the study of randomly perturbed labels dates back to [1] in the context of random clas-\nsi\ufb01cation noise model. See, e.g., [12, 20] for a survey. Also, classi\ufb01cation problems with missing\nlabels are also extensively studied in the context of semi-supervised or weakly supervised learning\nby [14, 17, 21], among others. Despite the extensive study on this problem, its information-theoretic\nand computational boundaries remain unexplored in terms of theory. In a nutshell, the information-\ntheoretic boundary refers to the optimal statistical accuracy achievable by any algorithms, while the\ncomputational boundary refers to the optimal statistical accuracy achievable by the algorithms under\na computational budget that has a polynomial dependence on the problem scale (d, n). Moreover,\nit remains unclear how these two boundaries vary along with \u03b1. One interesting question to ask is\n29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fhow the degree of supervision affects the fundamental statistical and computational dif\ufb01culties of\nthis problem, especially in the high dimensional regime.\nIn this paper, we sharply characterize both the information-theoretic and computational boundaries\nof the weakly supervised binary classi\ufb01cation problems under the minimax framework. Speci\ufb01cally,\nwe consider the Gaussian generative model where X|Z = z \u223c N (\u00b5z, \u03a3) and z \u2208 {0, 1} is the\ntrue label. Suppose {(xi, zi)}n\ni=1 are\ngenerated from {zi}n\ni=1 in the aforementioned manner. We focus on the high dimensional regime,\nwhere d \ufffd n and \u00b51 \u2212 \u00b50 is s-sparse, i.e., \u00b51 \u2212 \u00b50 has s nonzero entires. We are interested in\nestimating \u00b51 \u2212 \u00b50 from the observed samples {(xi, yi)}n\ni=1. By a standard reduction argument [24],\nthe fundamental limits of this estimation task are captured by a hypothesis testing problem, namely,\nH0 : \u00b51 \u2212 \u00b50 = 0 versus H1 : \u00b51 \u2212 \u00b50 is s-sparse and\n\ni=1 are n independent samples of (X, Z). We assume that {yi}n\n\n(1.1)\nwhere \u03b3n denotes the signal strength that scales with n. Consequently, we focus on studying the\nfundamental limits of \u03b3n for solving this hypothesis testing problem.\n\n(\u00b51 \u2212 \u00b50)\ufffd\u03a3\u22121(\u00b51 \u2212 \u00b50) := \u03b3n > 0,\n\ns log d\n\n2n \ufffd,\n\n\ufffd\n\n\ufffd\n\n\ufffdn = o\ufffd\ufffd s2\n\ufffdn = \ufffd\ufffd\ufffd s log d\n\nn\n\nn\n\n\ufffd\n\ns log d\n\n2n \ufffd\n\n\ufffd\n\n\ufffdn\n\nE\ufffdcient\n\n\ufffdn = \ufffd\ufffd\ufffd s2\n\nn\n\n\ufffd\n\ns log d\n\n2n \ufffd\n\n\ufffd\n\nIntractable\n\nImpossible\n\n\ufffdn = o\ufffd\ufffd s log d\n\nn\n\n\ufffd\n\n0\n\n\ufffd\n\n1\n\ns log d\n\n2n \ufffd\n\n\ufffd\n\nFigure 1: Computational-statistical phase transitions for weakly supervised binary classi\ufb01cation. Here\n\u03b1 denotes the degree of supervision, i.e., the label is corrupted to be uniformly random with probabil-\nity 1 \u2212 \u03b1, and \u03b3n is the signal strength, which is de\ufb01ned in (1.1). Here a \u2227 b denotes min{a, b}.\nOur main results are illustrated in Figure 1. Speci\ufb01cally, we identify the impossible, intractable, and\nef\ufb01cient regimes for the statistical-computational phase transitions under certain regularity condi-\ntions.\n\nsolving the hypothesis testing problem.\n\noracle complexity that is asymptotically powerful in solving the testing problem.\n\ntractable algorithm that has a polynomial oracle complexity under an extension of the statistical query\nmodel [18] is asymptotically powerless. We will rigorously de\ufb01ne the computational model in \u00a72.\n\n(i) For \u03b3n = o[\ufffds log d/n \u2227 (1/\u03b12 \u00b7 s log d/n)], any algorithm is asymptotically powerless in\n(ii) For \u03b3n = \u2126[\ufffds log d/n \u2227 (1/\u03b12 \u00b7 s log d/n)] and \u03b3n = o[\ufffds2/n \u2227 (1/\u03b12 \u00b7 s log d/n)], any\n(iii) For \u03b3n = \u2126[\ufffds2/n \u2227 (1/\u03b12 \u00b7 s log d/n)], there is an ef\ufb01cient algorithm with a polynomial\nHere \ufffds log d/n \u2227 (1/\u03b12 \u00b7 s log d/n) gives the information-theoretic boundary, while \ufffds2/n \u2227\n(1/\u03b12 \u00b7 s log d/n) gives the computational boundary. Moreover, by a reduction from the estimation\nproblem to the testing problem, these boundaries for testing imply the ones for estimating \u00b52 \u2212 \u00b51\nas well.\nConsequently, there exists a signi\ufb01cant gap between the computational and information-theoretic\nboundaries for small \u03b1. In other word, to achieve the information-theoretic boundary, one has to\npay the price of intractable computation. As \u03b1 tends to one, this gap between computational and\ninformation-theoretic boundaries narrows and eventually vanishes. This indicates that, having more\nsupervision not only improves the statistical accuracy, as shown by the decay of information-theoretic\nboundary in Figure 1, but more importantly, enhances the computational ef\ufb01ciency by reducing\nthe computational price for attaining information-theoretic optimality. This phenomenon \u2014 \u201cmore\nsupervision, less computation\u201d \u2014 is observed for the \ufb01rst time in this paper.\n\n1.1 More Related Work, Our Contribution, and Notation\n\nBesides the aforementioned literature on weakly supervised learning and label corruption, our work\nis also connected to a recent line of work on statistical-computational tradeoffs [2\u20135, 8, 13, 15, 19,\n26\u201328]. In comparison, we quantify the statistical-computational tradeoffs for weakly supervised\nlearning for the \ufb01rst time. Furthermore, our results are built on an oracle computational model\n\n2\n\n\fin [8] that slightly extends the statistical query model [18], and hence do not hinge on unproven\nconjectures on computational hardness like planted clique. Compared with our work, [8] focuses\non the computational hardness of learning heterogeneous models, whereas we consider the interplay\nbetween supervision and statistical-computational tradeoffs. A similar computational model is used\nin [27] to study structural normal mean model and principal component analysis, which exhibit\ndifferent statistical-computational phase transitions. In addition, our work is related to sparse linear\ndiscriminant analysis and two-sample testing of sparse means, which correspond to our special cases\nof \u03b1 = 1 and \u03b1 = 0, respectively. See, e.g., [7, 23] for details. In contrast with their results, our\nresults capture the effects of \u03b1 on statistical and computational tradeoffs.\nIn summary, the contribution of our work is two-fold:\n\n(i) We characterize the computational and statistical boundaries of the weakly supervised binary\nclassi\ufb01cation problem for the \ufb01rst time. Compared with existing results for other models, our results\ndo not rely on unproven conjectures.\n(ii) Based on our theoretical characterization, we propose the \u201cmore supervision, less computation\u201d\nphenomenon, which is observed for the \ufb01rst time.\n\nNotation. We denote the \u03c72-divergence between two distributions P, Q by D\u03c72 (P, Q). For two\nnonnegative sequences an, bn indexed by n, we use an = o(bn) as a shorthand for limn\u2192\u221e an/bn =\n0. We say an = \u2126(bn) if an/bn \u2265 c for some absolute constant c > 0 when n is suf\ufb01ciently large.\nWe use a \u2228 b and a \u2227 b to denote max{a, b} and min{a, b}, respectively. For any positive integer k,\nwe denote {1, 2, . . . , k} by [k]. For v \u2208 Rd, we denote by \ufffdv\ufffdp the \ufffdp-norm of v. In addition, we\ndenote the operator norm of a matrix A by |||A|||2.\n2 Background\n\nIn this section, we formally de\ufb01ne the statistical model for weakly supervised binary classi\ufb01cation.\nThen we follow it with the statistical query model that connects computational complexity and\nstatistical optimality.\n\n2.1 Problem Setup\n\nConsider the following Gaussian generative model for binary classi\ufb01cation. For a random vector\n\nX \u2208 Rd and a binary random variable Z \u2208 {0, 1}, we assume\n\nX|Z = 0 \u223c N (\u00b50, \u03a3), X|Z = 1 \u223c N (\u00b51, \u03a3),\n\n(2.1)\n\nwhere P(Z = 0) = P(Z = 1) = 1/2. Under this model, the optimal classi\ufb01er by Bayes rule\ncorresponds to the Fisher\u2019s linear discriminative analysis (LDA) classi\ufb01er. In this paper, we focus\non the noisy label setting where true label Z is replaced by a uniformly random label in {0, 1} with\nprobability 1\u2212 \u03b1. Hence, \u03b1 characterizes the degree of supervision in the model. In speci\ufb01c, if \u03b1 = 0,\nwe observe the true label Z, thus the problem belongs to supervised learning. Whereas if \u03b1 = 1,\nthe observed label is completely random, which contains no information of the model in (2.1). This\nsetting is thus equivalent to learning a Gaussian mixture model, which is an unsupervised problem.\nIn the general setting with noisy labels, we denote the observed label by Y , which is linked to the\ntrue label Z via\n\n(2.2)\nWe consider the hypothesis testing problem of detecting whether \u00b50 \ufffd= \u00b51 given n i.i.d. samples\n{yi, xi}n\n\nP(Y = Z) = (1 + \u03b1)/2, P(Y = 1 \u2212 Z) = (1 \u2212 \u03b1)/2.\n\ni=1 of (Y, X), namely\n\nH0 : \u00b50 = \u00b51 versus H1 : \u00b50 \ufffd= \u00b51.\n\n(2.3)\nWe focus on the high dimensional and sparse regime, where d \ufffd n and \u00b50 \u2212 \u00b51 is s-sparse, i.e.,\n\u00b50 \u2212 \u00b51 \u2208 B0(s), where B0(s) := {\u00b5 \u2208 Rd : \ufffd\u00b5\ufffd0 \u2264 s}. Throughout this paper, use the sample\nsize n to drive the asymptotics. We introduce a shorthand notation \u03b8 := (\u00b50, \u00b51, \u03a3, \u03b1) to represent\nthe parameters of the aforementioned model. Let P\u03b8 be the joint distribution of (Y, X) under our\nstatistical model with parameter \u03b8, and Pn\n\u03b8 be the product distribution of n i.i.d. samples accordingly.\nWe denote the parameter spaces of the null and alternative hypotheses by G0 and G1 respectively. For\nany test function \u03c6 : {(yi, xi)}n\ni=1 \u2192 {0, 1}, the classical testing risk is de\ufb01ned as the summation of\n\n3\n\n\ftype-I and type-II errors, namely\n\nRn(\u03c6;G0,G1) := sup\n\n\u03b8\u2208G0\n\nPn\n\n\u03b8 (\u03c6 = 1) + sup\n\u03b8\u2208G1\n\nPn\n\n\u03b8 (\u03c6 = 0).\n\nThe minimax risk is de\ufb01ned as the smallest testing risk of all possible test functions, that is,\n\nR\u2217\nn(G0,G1) := inf\n\n\u03c6\n\nRn(\u03c6;G0,G1),\n\n(2.4)\n\nwhere the in\ufb01mum is taken over all measurable test functions.\nIntuitively, the separation between two Gaussian components under H1 and the covariance matrix \u03a3\ntogether determine the hardness of detection. To characterize such dependence, we de\ufb01ne the signal-\n\nto-noise ratio (SNR) as \u03c1(\u03b8) := (\u00b50\u2212 \u00b51)\ufffd\u03a3\u22121(\u00b50\u2212 \u00b51). For any nonnegative sequence {\u03b3n}n\u22651,\nlet G1(\u03b3n) := {\u03b8 : \u03c1(\u03b8) \u2265 \u03b3n} be a sequence of alternative parameter spaces with minimum\nseparation \u03b3n. The following minimax rate characterizes the information-theoretic limits of the\ndetection problem.\n\nDe\ufb01nition 2.1 (Minimax rate). We say a sequence {\u03b3\u2217\n\u2022 For any sequence {\u03b3n}n\u22651 satisfying \u03b3n = o(\u03b3\u2217\n\u2022 For any sequence {\u03b3n}n\u22651 satisfying \u03b3n = \u2126(\u03b3\u2217\n\nn}n\u22651 is a minimax rate if\n\nn), we have limn\u2192\u221e R\u2217\nn), we have limn\u2192\u221e R\u2217\n\nn[G0,G1(\u03b3n)] = 1;\nn[G0,G1(\u03b3n)] = 0.\n\nThe minimax rate in De\ufb01nition 2.1 characterizes the statistical dif\ufb01culty of the testing problem. How-\never, it fails to shed light on the computational ef\ufb01ciency of possible testing algorithms. The reason\nis that this concept does not make any computational restriction on the test functions. The minimax\nrisk in (2.4) might be attained only by test functions that have exponential computational complex-\nities. This limitation of De\ufb01nition 2.1 motivates us to study statistical limits under computational\nconstraints.\n\n2.2 Computational Model\n\nStatistical query models [8\u201311, 18, 27] capture computational complexity by characterizing the total\nnumber of rounds an algorithm interacts with data. In this paper, we consider the following statistical\nquery model, which admits bounded query functions but allows the responses of query functions to\nbe unbounded.\n\nDe\ufb01nition 2.2 (Statistical query model). In the statistical query model, an algorithm A is allowed\nto query an oracle T rounds, but not to access data {(yi, xi)}n\ni=1 directly. At each round, A queries\nthe oracle r with a query function q \u2208 QA , in which QA \u2286 {q : {0, 1}\u00d7 Rd \u2192 [\u2212M, M ]} denotes\nthe query space of A . The oracle r outputs a realization of a random variable Zq \u2208 R satisfying\n\nP\ufffd \ufffdq\u2208QA\ufffd|Zq \u2212 E[q(Y, X)]| \u2264 \u03c4q\ufffd\ufffd \u2265 1 \u2212 2\u03be, where\n\u03c4q = [\u03b7(QA ) + log(1/\u03be)] \u00b7 M/n\ufffd\ufffd2[\u03b7(QA ) + log(1/\u03be)] \u00b7 (M 2 \u2212 {E[q(Y, X)]}2)\ufffdn.\n\n(2.5)\nHere \u03c4q > 0 is the tolerance parameter and \u03be \u2208 [0, 1) is the tail probability. The quantity \u03b7(QA ) \u2265 0\nin \u03c4q measures the capacity of QA in logarithmic scale, e.g., for countable QA , \u03b7(QA ) = log(|QA |).\nThe number T is de\ufb01ned as the oracle complexity. We denote by R[\u03be, n, T, \u03b7(QA )] the set of oracles\nsatisfying (2.5), and by A(T ) the family of algorithms that queries an oracle no more than T rounds.\nThis version of statistical query model is used in [8], and reduces to the VSTAT model proposed in\n[9\u201311] by the transformation \ufffdq(y, x) = q(y, x)/(2M ) + 1/2 for any q \u2208 QA . The computational\n\nmodel in De\ufb01nition 2.2 enables us to handle query functions that are bounded by an unknown and\n\ufb01xed number M . Note that that by incorporating the tail probability \u03be, the response Zq is allowed to\nbe unbounded. To understand the intuition behind De\ufb01nition 2.2, we remark that (2.5) resembles the\nBernstein\u2019s inequality for bounded random variables [25]\n\n1\nn\n\nn\ufffdi=1\n\nP\ufffd\ufffd\ufffd\ufffd\ufffd\n\nq(Yi, Xi) \u2212 E[q(Y, X)]\ufffd\ufffd\ufffd\ufffd \u2265 t\ufffd \u2264 2 exp\ufffd\n\n2Var[q(Y, X)] + M t\ufffd.\n\nt2\n\n(2.6)\n\nWe \ufb01rst replace Var [q(Y, X)] by its upper bound M 2 \u2212 {E[q(Y, X)]}2, which is tight when q takes\nvalues in {\u2212M, M}. Then inequality (2.5) is obtained by replacing n\u22121\ufffdn\ni=1 q(Yi, Xi) in (2.6) by\nZq and then bounding the suprema over the query space QA . In the de\ufb01nition of \u03c4q in (2.5), we\n\n4\n\n\fincorporate the effect of uniform concentration over the query space QA by adding the quantity\n\u03b7(QA ), which measures the capacity of QA . In addition, under the De\ufb01nition 2.2, the algorithm\nA does not interact directly with data. Such an restriction characterizes the fact that in statistical\nproblems, the effectiveness of an algorithm only depends on the global statistical properties, not the\ninformation of individual data points. For instance, algorithms that only rely on the convergence of\nthe empirical distribution to the population distribution are contained in the statistical query model;\nwhereas algorithms that hinge on the \ufb01rst data point (y1, x1) is not allowed. This restriction captures\na vast family of algorithms in statistics and machine learning, including applying gradient method to\nmaximize likelihood function, matrix factorization algorithms, expectation-maximization algorithms,\nand sampling algorithms [9].\nBased on the statistical query model, we study the minimax risk under oracle complexity constraints.\nFor the testing problem (2.3), let A(Tn) be a class of testing algorithms under the statistical query\nmodel with query complexity no more than Tn, with {Tn}n\u22651 being a sequence of positive integers\ndepending on the sample size n. For any A \u2208 A(Tn) and any oracle r \u2208 R[\u03be, n, Tn, \u03b7(QA )] that\nresponds to A , let H(A , r) be the set of test functions that deterministically depend on A \u2019s queries\nto the oracle r and the corresponding responses. We use P\u03b8 to denote the distribution of the random\nvariables returned by oracle r when the model parameter is \u03b8.\nFor a general hypothesis testing problem, namely, H0 : \u03b8 \u2208 G0 versus H1 : \u03b8 \u2208 G1, the minimax test-\ning risk with respect to an algorithm A and a statistical oracle r \u2208 R[\u03be, n, Tn, \u03b7(QA )] is de\ufb01ned as\n(2.7)\n\n\u2217\n\nR\n\nn(G0,G1; A , r) := inf\n\n\u03c6\u2208H(A ,r)\ufffd sup\n\n\u03b8\u2208G0\n\nP\u03b8(\u03c6 = 1) + sup\n\u03b8\u2208G1\n\nP\u03b8(\u03c6 = 0)\ufffd.\n\nCompared with the classical minimax risk in (2.4), the new notion in (2.7) incorporates the computa-\ntional budgets via oracle complexity. In speci\ufb01c, we only consider the test functions obtained by an\nalgorithm with at most Tn queries to a statistical oracle. If Tn is a polynomial of the dimensionality\nd, (2.7) characterizes the statistical optimality of computational ef\ufb01cient algorithms. This motivates\nus to de\ufb01ne the computationally tractable minimax rate, which contrasts with De\ufb01nition 2.1.\n\nDe\ufb01nition 2.3 (Computationally tractable minimax rate). Let G1(\u03b3n) := {\u03b8 : \u03c1(\u03b8) \u2265 \u03b3n} be a\nsequence of model spaces with minimum separation \u03b3n, where \u03c1(\u03b8) is the SNR. A sequence {\u03b3\u2217\nn}n\u22651\nis called a computationally tractable minimax rate if\nn), any constant \u03b7 > 0, and any A \u2208 A(d\u03b7),\n\u2022 For any sequence {\u03b3n}n\u22651 satisfying \u03b3n = o(\u03b3\u2217\nthere exists an oracle r \u2208 R[\u03be, n, Tn, \u03b7(QA )] such that limn\u2192\u221e R\n\u2022 For any sequence {\u03b3n}n\u22651 satisfying \u03b3n = \u2126(\u03b3\u2217\nn), there exist a constant \u03b7 > 0 and an algorithm\nA \u2208 A(d\u03b7) such that, for any r \u2208 R[\u03be, n, Tn, \u03b7(QA )], we have limn\u2192\u221e R\nn[G0,G1(\u03b3n); A , r] = 0.\n3 Main Results\n\nn[G0,G1(\u03b3n); A , r] = 1;\n\n\u2217\n\n\u2217\n\nThroughout this paper, we assume that the covariance matrix \u03a3 in (2.1) is known. Speci\ufb01cally, for\n\nsome positive de\ufb01nite \u03a3 \u2208 Rd\u00d7d, the parameter spaces of the null and alternative hypotheses are\n\nde\ufb01ned as\n\nG0(\u03a3) := {\u03b8 = (\u00b5, \u00b5, \u03a3, \u03b1) : \u00b5 \u2208 Rd},\n\nG1(\u03a3; \u03b3n) := {\u03b8 = (\u00b50, \u00b51, \u03a3, \u03b1) : \u00b50, \u00b51 \u2208 Rd, \u00b50 \u2212 \u00b51 \u2208 B0(s), \u03c1(\u03b8) \u2265 \u03b3n}.\n\nAccordingly, the testing problem of detecting whether \u00b50 \ufffd= \u00b51 is to distinguish\n\nH0 : \u03b8 \u2208 G0(\u03a3) versus H1 : \u03b8 \u2208 G1(\u03a3; \u03b3n).\n\nIn \u00a73.1, we present the minimax rate of the detection problem from an information-theoretic perspec-\ntive. In \u00a73.2, under the statistical query model introduced in \u00a72.2, we provide a computational lower\nbound and a nearly matching upper bound that is achieved by an ef\ufb01cient testing algorithm.\n\n(3.1)\n\n(3.2)\n\n(3.3)\n\n(3.4)\n\n3.1\n\nInformation-theoretic Limits\n\nNow we turn to characterize the minimax rate given in De\ufb01nition 2.1. For parameter spaces (3.1) and\n\n(3.2) with known \u03a3, we show that in highly sparse setting where s = o(\u221ad), we have\n\n\u03b3\u2217\n\nn =\ufffds log d/n \u2227 (1/\u03b12 \u00b7 s log d/n),\n\n5\n\n\fTo prove (3.4), we \ufb01rst present a lower bound which shows that the hypothesis testing problem in\n(3.3) is impossible if \u03b3n = o(\u03b3\u2217\n\nn).\n\nTheorem 3.1. For the hypothesis testing problem in (3.3) with known \u03a3, we assume that there exists\na small constant \u03b4 > 0 such that s = o(d1/2\u2212\u03b4). Let \u03b3\u2217\nn be de\ufb01ned in (3.4). For any sequence\n{\u03b3n}n\u22651 such that \u03b3n = o(\u03b3\u2217\n\nn), any hypothesis test is asymptotically powerless, namely,\n\nlim\nn\u2192\u221e\n\nsup\n\n\u03a3\n\nR\u2217\nn[G0(\u03a3),G1(\u03a3; \u03b3n)] = 1.\n\nBy Theorem 3.1, we observe a phase transition in the necessary SNR for powerful detection when \u03b1\ndecreases from one to zero. Starting with rate s log d/n in the supervised setting where \u03b1 = 1, the\nrequired SNR gradually increases as label qualities decrease. Finally, when \u03b1 reaches zero, which\n\ncorresponds to the unsupervised setting, powerful detection requires the SNR to be \u2126(\ufffds log d/n).\n\nIt is worth noting that when \u03b1 = (s log d/n)1/4, we still have (n3s log d)1/4 uncorrupted labels.\nHowever, our lower bound (along with the upper bound shown in Theorem 3.2) indicates that the\ninformation contained in these uncorrupted labels are buried in the noise, and cannot essentially\nimprove the detection quality compared with the unsupervised setting.\nNext we establish a matching upper bound for the detection problem in (3.3). We denote the condition\nnumber of the covariance matrix \u03a3 by \u03ba, i.e., \u03ba := \u03bbmax(\u03a3)/\u03bbmin(\u03a3), where \u03bbmax(\u03a3) and \u03bbmin(\u03a3)\nare the largest and smallest eigenvalues of \u03a3, repectively. Note that marginally Y is uniformly\ndistributed over {0, 1}. For ease of presentation, we assume that the sample size is 2n and each class\ncontains exactly n data points. Note that we can always discard some samples in the larger class to\nmake the sample sizes of both classes to be equal. Due to the law of large numbers, this trick will not\naffect the analysis of sample complexity in the sense of order wise.\n\nGiven 2n i.i.d. samples {(yi, xi)}2n\n\ni=1 of (Y, X) \u2208 {0, 1} \u00d7 Rd, we de\ufb01ne\n\nwi = \u03a3\u22121/2(x2i \u2212 x2i\u22121), for all i \u2208 [n].\n\nIn addition, we split the dataset {(yi, xi)}2n\nui = x(1)\n\nand de\ufb01ne\n\ni=1 into two disjoint parts {(0, x(0)\ni \u2212 x(0)\n\n, for all i \u2208 [n].\n\ni\n\ni\n\n(3.5)\n\ni=1 and {(1, x(1)\n)}n\n\ni\n\n)}n\ni=1,\n\n(3.6)\n\nWe note that computing sample differences in (3.5) and (3.6) is critical for our problem because we\nfocus on detecting the difference between \u00b50 and \u00b51, and computing differences can avoid estimating\nEP\u03b8 (X) that might be dense. For any integer s \u2208 [d], we de\ufb01ne B2(s) := B0(s) \u2229 Sd\u22121 as the set\nof s-sparse vectors on the unit sphere in Rd. With {wi}n\ni=1, we introduce two test\n\nfunctions\n\n\u03c61 := 1\ufffd sup\n\u03c62 := 1\ufffd sup\n\nv\u2208B2(1)\n\nv\u2208B2(s)\n\n(v\ufffd\u03a3\u22121wi)2\n\ni=1 and {ui}n\n2v\ufffd\u03a3\u22121v \u2265 1 + \u03c41\ufffd,\nn\ufffdi=1\n\ufffdv, diag(\u03a3)\u22121/2ui\ufffd \u2265 \u03c42\ufffd,\nn\ufffdi=1\n\n1\nn\n\n1\nn\n\n(3.7)\n\n(3.8)\n\nwhere \u03c41, \u03c42 > 0 are algorithmic parameters that will be speci\ufb01ed later. To provide some intuitions,\nwe consider the case where \u03a3 = I. Test function \u03c61 seeks a sparse direction that explains the most\nvariance of wi. Therefore, such a test is closely related to the sparse principal component detection\nproblem [3]. Test function \u03c62 simply selects the coordinate of n\u22121\ufffdn\nui that has the largest\nmagnitude and compares it with \u03c42. This test is closely related to detecting sparse normal mean\nin high dimensions [16]. Based on these two ingredients, we construct our \ufb01nal testing function \u03c6\nas \u03c6 = \u03c61 \u2228 \u03c62, i.e., if any of \u03c61 and \u03c62 is true, then \u03c6 rejects the null. The following theorem\nestablishes a suf\ufb01cient condition for test function \u03c6 to be asymptotically powerful.\n\ni=1\n\nTheorem 3.2. Consider the testing problem (3.3) where \u03a3 is known and has condition number \u03ba.\nFor test functions \u03c61 and \u03c62 de\ufb01ned in (3.7) and (3.8) with parameters \u03c41 and \u03c42 given by\n\nWe de\ufb01ne the ultimate test function as \u03c6 = \u03c61 \u2228 \u03c62. We assume that s \u2264 C \u00b7 d for some absolute\nconstant Cs and n \u2265 64 \u00b7 s log(ed/s). Then if\n\n\u03c41 = \u03ba\ufffds log(ed/s)/n, \u03c42 =\ufffd8 log d/n.\n\u03b3n \u2265 C \ufffd\u03ba \u00b7 [\ufffds log(ed/s)/n \u2227 (1/\u03b12 \u00b7 s log d/n)],\n\n(3.9)\n\n6\n\n\fwhere C \ufffd is an absolute constant, then test function \u03c6 is asymptotically powerful. In speci\ufb01c, we have\n\nsup\n\nPn\n\n\u03b8 (\u03c6 = 1) +\n\nsup\n\n\u03b8\u2208G0(\u03a3)\n\n\u03b8\u2208G1(\u03a3;\u03b3n)\n\nPn\n\u03b8 (\u03c6 = 0) \u2264 20/d.\n\n(3.10)\n\nTheorem 3.2 provides a non-asymptotic guarantee. When n goes to in\ufb01nity, (3.10) implies that\nthe test function \u03c6 is asymptotically powerful. When s = o(\u221ad) and \u03ba is a constant, (3.9) yields\n\u03b3n = \u2126[\ufffds log d/n\u2227(1/\u03b12\u00b7s log d/n)], which matches the lower bound given in Theorem 3.1. Thus\nwe conclude that \u03b3\u2217\nn de\ufb01ned in (3.4) is the minimax rate of testing problem in (3.3). We remark that\nwhen s = \u2126(d), \u03b1 = 1, i.e., the standard (low-dimensional) setting of two sample testing, the bound\nprovided in (3.9) is sub-optimal as [22] shows that SNR rate \u221ad/n is suf\ufb01cient for asymptotically\npowerful detection when n = \u2126(\u221ad). It is thus worth noting that we focus on the highly sparse\nsetting s = o(\u221ad) and provided sharp minimax rate for this regime. In the de\ufb01nition of \u03c61 in\n(3.7), we search over the set B2(s). Since B2(s) contains\ufffdd\ns\ufffd distinct sets of supports, computing \u03c61\n\nrequires exponential running time.\n\n3.2 Computational Limits\n\nIn this section, we characterize the computationally tractable minimax rate \u03b3\u2217\nn given in De\ufb01nition 2.3.\nMoreover, we focus on the setting where \u03a3 is known a priori and the parameter spaces for the null\nand alternative hypotheses are de\ufb01ned in (3.1) and (3.2), respectively. The main result is that, in\n\nhighly sparse setting where s = o(\u221ad), we have\n\nWe \ufb01rst present the lower bound in the next result.\n\n\u03b3\u2217\n\nn =\ufffds2/n \u2227 (1/\u03b12 \u00b7 s log d/n).\n\n(3.11)\n\n(3.12)\n\nTheorem 3.3. For the testing problem in (3.3) with \u03a3 known a priori, we make the same assumptions\nas in Theorem 3.1. For any sequence {\u03b3n}n\u22651 such that\n\n\u03b3n = o\ufffd\u03b3\u2217\n\nn \u2228\ufffd\ufffds2/n \u2227 (1/\u03b12 \u00b7 s/n)\ufffd\ufffd ,\n\nwhere \u03b3\u2217\nn is de\ufb01ned in (3.4), any computationally tractable test is asymptotically powerless under the\nstatistical query model. That is, for any constant \u03b7 > 0 and any A \u2208 A(d\u03b7), there exists an oracle\nr \u2208 R[\u03be, n, Tn, \u03b7(QA )] such that limn\u2192\u221e R\nWe remark that the lower bound in (3.12) differs from \u03b3\u2217\n\ufffd1/n \u2264 \u03b12 \u2264\ufffds log d/n. We expect this gap to be eliminated by more delicate analysis under the\n\nstatistical query model.\nNow putting Theorems 3.1 and 3.3 together, we describe the \u201cmore supervision, less computation\u201d\nphenomenon as follows.\n\nn[G0(\u03a3),G1(\u03a3, \u03b3n); A , r] = 1.\n\nn in (3.11) by a logarithmic term when\n\n\u2217\n\nn and \u03b3\u2217\n\nn remains the same.\n\n(i) When 0 \u2264 \u03b1 \u2264 (log2 d/n)1/4, the computational lower bound implies that the uncorrupted\nlabels are unable to improve the quality of computationally tractable detection compared with the\nunsupervised setting. In addition, in this region, the gap between \u03b3\u2217\n(ii) When (log2 d/n)1/4 < \u03b1 \u2264 (s log d/n)1/4, the information-theoretic lower bound shows that\n\nthe uncorrupted labels cannot improve the quality of detection compared with unsupervised setting.\nHowever, more uncorrupted labels improve the statistical performances of hypothesis tests that are\ncomputationally tractable by shrinking the gap between \u03b3\u2217\n(iii) When (s log d/n)1/4 < \u03b1 \u2264 1, having more uncorrupted labels improves both statistical\nn and \u03b3\u2217\noptimality and the computational ef\ufb01ciency. In speci\ufb01c, in this case, the gap between \u03b3\u2217\nn\nvanishes and we have \u03b3\u2217\nNow we derive a nearly matching upper bound under the statistical query model, which establishes\nthe computationally tractable minimax rate together with Theorem 3.3. We construct a computation-\nally ef\ufb01cient testing procedure that combines two test functions which yields the two parts in \u03b3\u2217\nn\nrespectively. Similar to \u03c61 de\ufb01ned in (3.7), the \ufb01rst test function discards the information of labels,\nwhich works for the purely unsupervised setting where \u03b1 = 0. For j \u2208 [d], we denote by \u03c3j the j-th\ndiagonal element of \u03a3. Under the statistical query model, we consider the 2d query functions\n\nn = 1/\u03b12 \u00b7 s log d/n.\n\nn and \u03b3\u2217\nn.\n\nn = \u03b3\u2217\n\nqj(y, x) := xj/\u221a\u03c3j \u00b7 1{|xj/\u221a\u03c3j| \u2264 R \u00b7\ufffdlog d},\n\ufffdqj(y, x) := (x2\n\nj /\u03c3j \u2212 1) \u00b7 1{|xj/\u221a\u03c3j| \u2264 R \u00b7\ufffdlog d}, for all j \u2208 [d],\n\n(3.13)\n\n(3.14)\n\n7\n\n\fwhere R > 0 is an absolute constant. Here we apply truncation to the query functions to obtain\nbounded queries, which is speci\ufb01ed by the statistical query model in De\ufb01nition 2.2. We denote by zqj\nand z\ufffdqj the realizations of the random variables output by the statistical oracle for query functions qj\n\nand\ufffdqj , respectively. As for the second test function, similar to (3.8), we consider\nfor all v \u2208 B2(1). We denote by Zqv\nfunction qv. With these 4d query functions, we introduce test functions\n\nqv(y, x) = (2y \u2212 1) \u00b7 v\ufffddiag(\u03a3)\u22121/2x \u00b7 1\ufffd|v\ufffddiag(\u03a3)\u22121/2x| \u2264 R \u00b7\ufffdlog d\ufffd\nzqv \u2265 2\u03c4 2\ufffd,\n\nqj ) \u2265 C\u03c4 1\ufffd, \u03c62 := 1\ufffd sup\n\n\u03c61 := 1\ufffd sup\n\n(z\ufffdqj \u2212 z2\n\nv\u2208B2(1)\n\nj\u2208[d]\n\nthe output of the statistical oracle corresponding to query\n\n(3.15)\n\n(3.16)\n\nwhere \u03c4 1 and \u03c42 are positive parameters that will be speci\ufb01ed later and C is an absolute constant.\n\nTheorem 3.4. For the test functions \u03c61 and \u03c62 de\ufb01ned in (3.16) , we de\ufb01ne the ultimate test function\nas \u03c6 = \u03c61 \u2228 \u03c62. We set\n\n\u03c4 1 = R2 log d \u00b7\ufffdlog(4d/\u03be)/n, \u03c4 2 = R\ufffdlog d \u00b7\ufffdlog(4d/\u03be)/n,\n\n(3.17)\nwhere \u03be = o(1). For the hypothesis testing problem in (3.3), we further assume that \ufffd\u00b50\ufffd\u221e \u2228\n\ufffd\u00b51\ufffd\u221e \u2264 C0 for some constant C0 > 0. Under the assumption that\n\n(\u00b50,j \u2212 \u00b51,j)2/\u03c3j = \u2126\ufffd\ufffd1/\u03b12 \u00b7 log2 d \u00b7 log(d/\u03be)/n\ufffd \u2227 log d \u00b7\ufffdlog(d/\u03be)/n\ufffd ,\n\nsup\nj\u2208[d]\n\n(3.18)\n\nn(\u03c6) = sup\u03b8\u2208G0(\u03a3) P\u03b8(\u03c6 = 1) + sup\u03b8\u2208G1(\u03a3,\u03b3n) P\u03b8(\u03c6 = 0) \u2264 5\u03be. Here\nthe risk of \u03c6 satis\ufb01es that R\nwe denote by \u00b50,j and \u00b51,j the j-th entry of \u00b50 and \u00b51, respectively.\nIf we set the tail probability of the statistical query model to be \u03be = 1/d, (3.18) shows that \u03c6\nis asymptotically powerful if supj\u2208[d](\u00b50,j \u2212 \u00b51,j)2/\u03c3j = \u2126[(1/\u03b12 \u00b7 log3 d/n) \u2227 (log3 d/n)1/2].\nWhen the energy of \u00b50 \u2212 \u00b51 is spread over its support, \ufffd\u00b50 \u2212 \u00b51\ufffd\u221e and \ufffd\u00b50 \u2212 \u00b51\ufffd2/\u221as are close.\nUnder the assumption that the condition number \u03ba of \u03a3 is a constant, (3.18) is implied by\n\n\u2217\n\n\u03b3n \ufffd (s2 log3 d/n)1/2 \u2227 (1/\u03b12 \u00b7 s log3 d/n).\nCompared with Theorem 3.3, the above upper bound matches the computational lower bound up to\nn is between\ufffds2/n \u2227 (1/\u03b12 \u00b7 s log d/n) and (s2 log3 d/n)1/2 \u2227 (1/\u03b12 \u00b7\na logarithmic factor and \u03b3\u2217\ns log3 d/n). Note that the truncation on query functions in (3.13) and (3.14) yields an additional\nlogarithmic term, which could be reduced to (s2 log d/n)1/2\u2227 (1/\u03b12\u00b7 s log d/n) using more delicate\nanalysis. Moreover, the test function \u03c61 is essentially based on a diagonal thresholding algorithm\nperformed on the covariance matrix of X. The work in [6] provides a more delicate analysis of\nthis algorithm which establishes the\ufffds2/n rate. Their algorithm can also be formulated into the\nmore sophicated proof techinique, it can be shown that\ufffds2/n \u2227 (1/\u03b12 \u00b7 s log d/n) is the critical\n\nstatistical query model; we use the simpler version in (3.16) for ease of presentation. Therefore, with\n\nthreshold for asymptotically powerful detection with computational ef\ufb01ciency.\n\n3.3\n\nImplication for Estimation\n\nOur aforementioned phase transition in the detection problems directly implies the statistical and\ncomputational trade-offs in the problem of estimation. We consider the problem of estimating the\nparameter \u2206\u00b5 = \u00b50 \u2212 \u00b51 of the binary classi\ufb01cation model in (2.1) and (2.2), where \u2206\u00b5 is s-sparse\nand \u03a3 is known a priori. We assume that the signal to noise ratio is \u03c1(\u03b8) = \u2206\u00b5\ufffd\u03a3\u22121\u2206\u00b5 \u2265 \u03b3n =\nn). For any constant \u03b7 > 0 and any A \u2208 A(T ) with T = O(d\u03b7), suppose we obtain an estimator\no(\u03b3\u2217\n\u2206\ufffd\u00b5 of \u2206\u00b5 by algorithm A under the statistical query model. If \u2206\ufffd\u00b5 converges to \u2206\u00b5 in the sense that\nwe have |\u2206\ufffd\u00b5\ufffd\u03a3\u22121\u2206\ufffd\u00b5 \u2212 \u2206\u00b5\ufffd\u03a3\u22121\u2206\u00b5| = o(\u03b3n). Thus the test function \u03c6 = 1{\u2206\ufffd\u00b5\ufffd\u03a3\u2206\ufffd\u00b5 \u2265\n\u03b3n/2} is asymptotically powerful, which contradicts the computational lower bound in Theorem 3.3.\nTherefore, there exists a constant C such that (\u2206\ufffd\u00b5 \u2212 \u2206\u00b5)\ufffd\u03a3\u22121(\u2206\ufffd\u00b5 \u2212 \u2206\u00b5) \u2265 C\u03b32\nn/\u03c1(\u03b8) for any\nestimator \u2206\ufffd\u00b5 constructed from polynomial number of queries.\n\n(\u2206\ufffd\u00b5 \u2212 \u2206\u00b5)\ufffd\u03a3\u22121(\u2206\ufffd\u00b5 \u2212 \u2206\u00b5) = o[\u03b32\n\nAcknowledgments\n\nn/\u03c1(\u03b8)],\n\nWe would like to thank Vitaly Feldman for valuable discussions.\n\n8\n\n\fReferences\n\n[1] Angluin, D. and Laird, P. (1988). Learning from noisy examples. Machine Learning, 2 343\u2013370.\n[2] Berthet, Q. and Rigollet, P. (2013). Computational lower bounds for sparse PCA.\n\nIn Conference on\n\nLearning Theory.\n\n[3] Berthet, Q. and Rigollet, P. (2013). Optimal detection of sparse principal components in high dimension.\n\nThe Annals of Statistics, 41 1780\u20131815.\n\n[4] Chandrasekaran, V. and Jordan, M. I. (2013). Computational and statistical tradeoffs via convex relaxation.\n\nProceedings of the National Academy of Sciences, 110 1181\u20131190.\n\n[5] Chen, Y. and Xu, J. (2014). Statistical-computational tradeoffs in planted problems and submatrix local-\n\nization with a growing number of clusters and submatrices. arXiv preprint arXiv:1402.1267.\n\n[6] Deshpande, Y. and Montanari, A. (2014). Sparse PCA via covariance thresholding. In Advances in Neural\n\nInformation Processing Systems.\n\n[7] Fan, J., Feng, Y. and Tong, X. (2012). A road to classi\ufb01cation in high dimensional space: The regularized\n\noptimal af\ufb01ne discriminant. Journal of the Royal Statistical Society: Series B, 74 745\u2013771.\n\n[8] Fan, J., Liu, H., Wang, Z. and Yang, Z. (2016). Curse of heterogeneity: Computational barriers in sparse\n\nmixture models and phase retrieval. Manuscript.\n\n[9] Feldman, V., Grigorescu, E., Reyzin, L., Vempala, S. and Xiao, Y. (2013). Statistical algorithms and a\n\nlower bound for detecting planted cliques. In ACM Symposium on Theory of Computing.\n\n[10] Feldman, V., Guzman, C. and Vempala, S. (2015). Statistical query algorithms for stochastic convex\n\noptimization. arXiv preprint arXiv:1512.09170.\n\n[11] Feldman, V., Perkins, W. and Vempala, S. (2015). On the complexity of random satis\ufb01ability problems\n\nwith planted solutions. In ACM Symposium on Theory of Computing.\n\n[12] Fr\u00e9nay, B. and Verleysen, M. (2014). Classi\ufb01cation in the presence of label noise: A survey.\n\nIEEE\n\nTransactions on Neural Networks and Learning Systems, 25 845\u2013869.\n\n[13] Gao, C., Ma, Z. and Zhou, H. H. (2014). Sparse CCA: Adaptive estimation and computational barriers.\n\narXiv preprint arXiv:1409.8565.\n\n[14] Garc\u0131a-Garc\u0131a, D. and Williamson, R. C. (2011). Degrees of supervision. In Advances in Neural Informa-\n\ntion Processing Systems.\n\n[15] Hajek, B., Wu, Y. and Xu, J. (2014). Computational lower bounds for community detection on random\n\ngraphs. arXiv preprint arXiv:1406.6625.\n\n[16] Johnstone, I. M. (1994). On minimax estimation of a sparse normal mean vector. The Annals of Statistics,\n\n22 271\u2013289.\n\n[17] Joulin, A. and Bach, F. R. (2012). A convex relaxation for weakly supervised classi\ufb01ers. In International\n\nConference on Machine Learning.\n\n[18] Kearns, M. (1993). Ef\ufb01cient noise-tolerant learning from statistical queries. In ACM Symposium on Theory\n\nof Computing.\n\n[19] Ma, Z. and Wu, Y. (2014). Computational barriers in minimax submatrix detection. The Annals of\n\nStatistics, 43 1089\u20131116.\n\n[20] Nettleton, D. F., Orriols-Puig, A. and Fornells, A. (2010). A study of the effect of different types of noise\n\non the precision of supervised learning techniques. Arti\ufb01cial Intelligence Review, 33 275\u2013306.\n\n[21] Patrini, G., Nielsen, F., Nock, R. and Carioni, M. (2016). Loss factorization, weakly supervised learning\n\nand label noise robustness. arXiv preprint arXiv:1602.02450.\n\n[22] Ramdas, A., Singh, A. and Wasserman, L. (2016). Classi\ufb01cation accuracy as a proxy for two sample\n\ntesting. arXiv preprint arXiv:1602.02210.\n\n[23] Tony Cai, T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence.\n\nJournal of the Royal Statistical Society: Series B, 76 349\u2013372.\n\n[24] Tsybakov, A. B. (2008). Introduction to nonparametric estimation. Springer.\n[25] Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027.\n\n[26] Wang, T., Berthet, Q. and Samworth, R. J. (2014). Statistical and computational trade-offs in estimation\n\nof sparse principal components. arXiv preprint arXiv:1408.5369.\n\n[27] Wang, Z., Gu, Q. and Liu, H. (2015). Sharp computational-statistical phase transitions via oracle compu-\n\ntational model. arXiv preprint arXiv:1512.08861.\n\n[28] Zhang, Y., Wainwright, M. J. and Jordan, M. I. (2014). Lower bounds on the performance of polynomial-\n\ntime algorithms for sparse linear regression. In Conference on Learning Theory.\n\n9\n\n\f", "award": [], "sourceid": 2221, "authors": [{"given_name": "Xinyang", "family_name": "Yi", "institution": "UT Austin"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": "UT Austin"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}