{"title": "Learning Sample-Specific Models with Low-Rank Personalized Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3575, "page_last": 3585, "abstract": "Modern applications of machine learning (ML) deal with increasingly heterogeneous datasets comprised of data collected from overlapping latent subpopulations. As a result, traditional models trained over large datasets may fail to recognize highly predictive localized effects in favour of weakly predictive global patterns. This is a problem because localized effects are critical to developing individualized policies and treatment plans in applications ranging from precision medicine to advertising. To address this challenge, we propose to estimate sample-specific models that tailor inference and prediction at the individual level. In contrast to classical ML models that estimate a single, complex model (or only a few complex models), our approach produces a model personalized to each sample. These sample-specific models can be studied to understand subgroup dynamics that go beyond coarse-grained class labels. Crucially, our approach does not assume that relationships between samples (e.g. a similarity network) are known a priori. Instead, we use unmodeled covariates to learn a latent distance metric over the samples. We apply this approach to financial, biomedical, and electoral data as well as simulated data and show that sample-specific models provide fine-grained interpretations of complicated phenomena without sacrificing predictive accuracy compared to state-of-the-art models such as deep neural networks.", "full_text": "Learning Sample-Speci\ufb01c Models with Low-Rank\n\nPersonalized Regression\n\nBenjamin Lengerich\n\nCarnegie Mellon University\nblengeri@cs.cmu.edu\n\nBryon Aragam\n\nUniversity of Chicago\n\nbryon@chicagobooth.edu\n\nEric P. Xing\n\nCarnegie Mellon University\n\nepxing@cs.cmu.edu\n\nAbstract\n\nModern applications of machine learning (ML) deal with increasingly heteroge-\nneous datasets comprised of data collected from overlapping latent subpopulations.\nAs a result, traditional models trained over large datasets may fail to recognize\nhighly predictive localized effects in favour of weakly predictive global patterns.\nThis is a problem because localized effects are critical to developing individualized\npolicies and treatment plans in applications ranging from precision medicine to\nadvertising. To address this challenge, we propose to estimate sample-speci\ufb01c\nmodels that tailor inference and prediction at the individual level. In contrast to\nclassical ML models that estimate a single, complex model (or only a few complex\nmodels), our approach produces a model personalized to each sample. These\nsample-speci\ufb01c models can be studied to understand subgroup dynamics that go\nbeyond coarse-grained class labels. Crucially, our approach does not assume that\nrelationships between samples (e.g. a similarity network) are known a priori.\nInstead, we use unmodeled covariates to learn a latent distance metric over the\nsamples. We apply this approach to \ufb01nancial, biomedical, and electoral data as\nwell as simulated data and show that sample-speci\ufb01c models provide \ufb01ne-grained\ninterpretations of complicated phenomena without sacri\ufb01cing predictive accuracy\ncompared to state-of-the-art models such as deep neural networks.\n\n1\n\nIntroduction\n\nThe scale of modern datasets allows an unprecedented opportunity to infer individual-level effects\nby borrowing power across large cohorts; however, principled statistical methods for accomplishing\nthis goal are lacking. Standard approaches for adapting to heterogeneity in complex data include\nrandom effects models, mixture models, varying coef\ufb01cients, and hierarchical models. Recent work\nincludes the network lasso [11], the pliable lasso [32], personalized multi-task learning [37], and the\nlocalized lasso [38]. Despite this long history, these methods either fail to estimate individual-level\n(i.e. sample-speci\ufb01c) effects, or require prior knowledge regarding the relation between samples\n(e.g. a network). At the same time, as datasets continue to increase in size and complexity, the\npossibility of inferring sample-speci\ufb01c phenomena by exploiting patterns in these large datasets has\ndriven interest in important scienti\ufb01c problems such as precision medicine [5, 24]. The relevance\nand potential impact of sample-speci\ufb01c inference has also been widely acknowledged in applications\nincluding psychology [9], education [12], and \ufb01nance [1].\nIn this paper, we explore a solution to this dilemma through the framework of \u201cpersonalized\u201d models.\nPersonalized modeling seeks to estimate a large collection of simple models in which each model is\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Mixture model\n\n(b) Varying-Coef\ufb01cient\n\n(c) Deep Neural Net\n\n(d) Personalized\n\nFigure 1: Illustration of the bene\ufb01ts of personalized models. Each point represents the regression\nparameters for a sample. Black points indicate true effect sizes, while the red points are estimates.\nMixture models (a) estimate a limited number of models. The varying-coef\ufb01cients model (b)\nestimates sample-speci\ufb01c models but the non-linear structure of the true parameters violates the\nmodel assumptions, leading to a poor \ufb01t. The locally-linear models induced by a deep learning model\n(c) do not accurately recover the underlying effect sizes. In contrast, personalized regression (d)\naccurately recovers effect sizes.\n\ntailored\u2014or \u201cpersonalized\u201d\u2014to a single sample. This is in contrast to models that seek to estimate a\nsingle, complex model. To make this more precise, suppose we have n samples (X (i), Y (i)), where\nY (i) denotes the response and X (i) 2 Rp are predictors. A traditional ML model would model the\nrelationship between Y (i) and X (i) with a single function f (X (i); \u2713) parametrized by a complex\nparameter \u2713 (e.g. a deep neural network). In a personalized model, we model each sample with its\nown function, allowing \u2713 to be simple while varying with each sample. Thus, the model becomes\nY (i) = f (X (i); \u2713(i)). These models are estimated jointly with a single objective function, enabling\nstatistical power to be shared between sub-populations.\nThe \ufb02exibility of using different parameter values for different samples enables us to use a simple\nmodel class (e.g. logistic regression) to produce models which are simultaneously interpretable and\npredictive for each individual sample. By treating each sample separately, it is also possible to capture\nheterogeneous effects within similar subgroups (an example of this will be discussed in Section 3.3).\nFinally, the parameters learned through our framework accurately capture underlying effect sizes,\ngiving users con\ufb01dence that sample-speci\ufb01c interpretations correspond to real phenomena (Fig 1).\nWhereas previous work on personalized models either seeks only the population\u2019s distribution of\nparameters [34] or requires prior knowledge of the sample relationships [11, 37, 38], we develop a\nnovel framework which estimates sample-speci\ufb01c parameters by adaptively learning relationships\nfrom the data. A Python implementation is available at http://www.github.com/blengerich/\npersonalized_regression.\n\nMotivating Example. Consider the problem of understanding election outcomes at the local level.\nFor example, given data on a particular candidate\u2019s views and policy proposals, we wish to predict the\nprobability that a particular locality (e.g. county, township, district, etc.) will vote for this candidate.\nIn this example we focus on counties for concreteness. More importantly, in addition to making\naccurate predictions, we are interested in understanding and explaining how different counties react to\ndifferent platforms. The latter information\u2014in addition to simple predictive measures\u2014is especially\nimportant to candidates and political consultants seeking advantages in major elections such as a\npresidential election. This information is also important to social and political scientists seeking\nto understand the characteristics of an electorate and how it is evolving. An application of this\nmotivating example using personalized regression can be found in in Section 3.4.\nOne approach would be to build individual models for each county, using historical data from previous\nelections. Immediately we encounter several practical challenges: 1) By building independent models\nfor each county, we fail to share information between related counties, resulting in a loss of statistical\npower, 2) Since elections are relatively infrequent, the amount of data on each county is limited,\nresulting in a further loss of power, and 3) To ensure that the models are able to explain the preferences\nof an electorate, we will be forced to use simple models (e.g. logistic regression or decision trees),\nwhich will likely have limited predictive power compared to more complex models. This simultaneous\nloss of power and predictive accuracy is characteristic of modeling large, heterogeneous datasets\narising from aggregating multiple subpopulations. Crucially, in this example the total number of\nsamples may be quite large (e.g. there are more than 3,000 US counties and there have been 58 US\npresidential elections), but the number of samples per subpopulaton is small. Furthermore, these\n\n2\n\n\fchallenges are in no way unique to this example: similar problems arise for examples in \ufb01nancial,\nbiological, and marketing applications.\nOne way to alleviate these challenges is to model the ith county using a regression model f (X; \u2713(i)),\nwhere the \u2713(i) are parameters that vary with each sample and are trained jointly using all of the\ndata. This idea of personalized modeling allows us to train accurate models using only a single\nsample from each county\u2014this is useful in settings where collecting more data may be expensive\n(e.g. biology and medicine) or impossible (e.g. elections and marketing). By allowing the parameter\n\u2713(i) to be sample-speci\ufb01c, there is no longer any need for f to be complex, and simple linear and\nlogistic regression models will suf\ufb01ce, providing useful and interpretable models for each sample.\n\nAlternative approaches and related work. One natural approach to heterogeneity is to use mix-\nture models, e.g. a mixture of regression [27] or mixture of experts model [10]. While mixture\nmodels present an intriguing way to increase power and borrow strength across the entire cohort,\nthey are notoriously dif\ufb01cult to train and are best at capturing coarse-grained heterogeneity in data.\nImportantly, mixture models do not capture individual, sample-speci\ufb01c effects and thus cannot model\nheterogeneity within subgroups.\nFurthermore, previous approaches to personalized inference [11, 20, 37, 38] assume that there is\na known network or similarity matrix that encodes how samples in a cohort are related to each\nother. A crucial distinction between our approach and these approaches is that no such knowledge\nis assumed. Recent work has also focused on estimating sample-speci\ufb01c parameters for structured\nmodels [14, 16, 18, 20, 35]; in these cases, prior knowledge of the graph structure enables ef\ufb01cient\ntesting of sample-speci\ufb01c deviations.\nMore classical approaches include varying-coef\ufb01cient (VC) models [8, 13, 30], where the parameter\n\u2713(i) = \u2713(U (i)) is allowed to depend on additional covariates U in some smooth way, and random\neffects models [15], where \u2713 is modeled as a random variable. More recently, the spirit of the\nVC model has been adapted to use deep neural networks as encoders for complex covariates like\nimages [2, 3] or domain adaptation [26, 29]. In contrast to our approach, which does not impose\nany regularity or structural assumptions on the model, these approaches typically require strong\nsmoothness (in the case of VC) or distributional (in the case of random effects) assumptions.\nFinally, locally-linear models estimated by recent work in model explanations [28] can be interpreted\nas sample-speci\ufb01c models. We make explicit comparisons to this approach in our experiments\n(Section 3), but we point out here that local explanations serve to interpret a black-box model\u2014which\nmay be incorrect\u2014and not the true mechanisms underlying the data. This is clearly illustrated in\nFig 1c, where local linear approximations do a good job of explaining the behaviour of the underlying\nneural network, but nonetheless fail to capture the true regression coef\ufb01cients. This tradeoff between\ninference and prediction is well-established in the literature.\n\n2 Learning sample-speci\ufb01c models\n\nFor clarity, we describe the main idea using a linear model for each personalized model; extension\nto arbitrary generalized linear models including logistic regression is straightforward. In Section 3,\nwe include experiments using both linear and logistic regression. A traditional linear model would\ndictate Y (i) = hX (i),\u2713 i + w(i), where the w(i) are noise and the parameter \u2713 2 Rp is shared across\ndifferent samples. We relax this model by allowing \u2713 to vary with each sample, i.e.\n(1)\nClearly, without additional constraints, this model is overparametrized\u2014there is a (p1)-dimensional\nsubspace of solutions to the equation Y (i) = hX (i),\u2713 (i)i in \u2713(i) for each i. Thus, the key is to choose\na solution \u2713(i) that simultaneously leads to good generalization and accurate inferences about the ith\nsample. We propose two strategies for this: (a) a low-rank latent representation of the parameters \u2713(i)\nand (b) a novel regularization scheme.\n\nY (i) = hX (i),\u2713 (i)i + w(i).\n\n2.1 Low-rank representation\nWe constrain the matrix of personalized parameters \u2326= [ \u2713(1) | \u00b7\u00b7\u00b7 | \u2713(n)] 2 Rp\u21e5n to be low-rank,\ni.e. \u2713(i) = QT Z(i) for some loadings Z(i) 2 Rq and some dictionary Q 2 Rq\u21e5p. Letting Z 2 Rq\u21e5n\n\n3\n\n\fdenote the matrix of loadings, we have a low-rank representation of \u2326= QT Z. The choice of q is\n\ndetermined by the user\u2019s desired latent dimensionality; for q \u2327 p, using only \u21e5q(n + p) instead\n\nof the \u21e5(np) of a full-rank solution can greatly improve computational and statistical ef\ufb01ciency. In\naddition, the low-rank formulation enables us to use `2 distance in Z in Eq. (4) to restrict Euclidean\ndistances between the \u2713(i): After normalizing the columns of Q, we have\n\nk\u2713(i) \u2713(j)k \uf8ff ppkZ(i) Z(j)k.\n\n(2)\nThis illustrates that closeness in the loadings Z(i) implies closeness in parameters \u2713(i). This fact will\nbe exploited to regularize \u2713(i) (Section 2.2).\nThis use of a dictionary Q is common in multi-task learning [23] based on the assumption that tasks\ninherently use shared atomic representations. Here, we make the analogous assumption that samples\narise from combinations of shared processes, so sample-speci\ufb01c models based on a shared dictionary\nef\ufb01ciently characterize sample heterogeneity. Sparsity in \u2713 can be realized by sparsity in Z, Q; for\ninstance, effect sizes which are consistently zero across all samples can be created by zero vectors in\nthe columns of Q. The low-rank formulation also implicitly constrains the number of personalized\nsparsity patterns; this can be adjusted by changing the latent dimensionality q.\n\n2.2 Distance-matching\nExisting approaches [11, 20, 37, 38] assume that there is a known weighted network (ij)n\ni,j=1 over\nsamples such that k\u2713(i) \u2713(j)k \u21e1 ij. In other words, we have prior knowledge of which parameters\nshould be similar. We avoid this strong assumption by instead assuming that we have additional\ncovariates U (i) 2 Rk for which there exists some way to measure similarity that corresponds\nto similarity in the parameter space, however, we don\u2019t have advance knowledge of this. More\nspeci\ufb01cally, we regularize the parameters \u2713(i) by requiring that similarity in \u2713 corresponds to similarity\nin U, i.e. k\u2713(i) \u2713(j)k \u21e1 \u21e2(U (i), U (j)), where \u21e2 is an unknown, latent metric on the covariates\nU. In applications, the U (i) represent exogenous variables that we do not wish to directly model;\nfor example, in our motivating example of an electoral analysis, this may include demographic\ninformation about the localities.\nTo promote similar structures in parameters as in covariates, we adapt a distance-matching regular-\nization (DMR) scheme [17] to penalize the squared difference in implied distances. The covariate\ndistances are modeled as a weighted sum:\n\n\u21e2(u, v) =\n\n`d`(u`, v`), ` 0,\n\n(3)\n\nkX`=1\n\nwhere each d` (` = 1, . . . , k) is a metric for a covariate. The positive vector represents a linear\ntransformation of these \u201csimple\u201d distances into more useful latent distance functions. By using a\nlinear parametrization for \u21e2, we can interpret the learned effects by inspecting the weights assigned\nto each covariate.\nBy Eq. (2), in order for k\u2713(i) \u2713(j)k \u21e1 \u21e2(U (i), U (j)), it suf\ufb01ces to require kZ(i) Z(j)k \u21e1\n\u21e2(U (i), U (j)). With this in mind, de\ufb01ne the following distance-matching regularizer:\n\nD(i)\n\n (Z, ) =\n\n\n\n2 Xj2Br(i)\u21e2(U (i), U (j)) kZ(i) Z(j)k22,\n\n(4)\n\nwhere Br(i) = {j : kZ(i) Z(j)k2 < r}. This regularizer promotes imitating the structure of\ncovariate values in the regression parameters. By using Z instead of \u2326 in the regularizer, calculation\nof distances is much more ef\ufb01cient when q \u2327 p. A discussion of hyperparameter selection is\ncontained in Section. B.3 of the supplement.\n\n2.3 Personalized Regression\nLet `(x, y, \u2713) be a loss function, e.g. least-squares or logistic loss. For each sample i of the training\ndata, de\ufb01ne a regularized, sample-speci\ufb01c loss by\n\nL(i)(Z, Q, ) = `(X (i), Y (i), QT Z(i)) + (QT Z(i)) + D(i)\n\n (Z, ),\n\n(5)\n\n4\n\n\fwhere is a regularizer such as the `1 penalty and D(i)\nin Eq. (4). We learn \u2326 and by minimizing the following composite objective:\n\n is the distance-matching regularizer de\ufb01ned\n\nL(Z, Q, ) =\n\nL(i)(Z, Q, ) + k 1k2\n2,\n\n(6)\n\nnXi=1\n\nwhere the second term regularizes the distance function \u21e2 with strength set by , and we recall that\n\u2326= QT Z. The hyperparameter trades off sensitivity to prediction of the response variable against\nsensitivity to covariate structure.\n\nAlgorithm 1 Personalized Estimation\n\nt = \u21b5t/kb\u2713(i)\n\nThis property is discussed in more detail in Section 2.4.\n\nvery small value used only to enable factorization by the PCA algorithm. Each personalized estimator\nis endowed with a personalized learning rate \u21b5(i)\nlearning rate \u21b5t according to how far the estimator has traveled. In addition to working well in prac-\ntice, this scheme guarantees that the center of mass of the personalized regression coef\ufb01cients does\n\nOptimization. We minimize the composite objective L(Z, Q, ) with subgradient descent com-\nbined with a speci\ufb01c initialization and learning rate schedule. An outline of the algorithm can be\nfound in Alg. 1 below. In detail, we initialize \u2326 by setting \u2713(i) \u21e0 N (b\u2713pop,\u270fI ) for a population model\nb\u2713pop such as the Lasso or elastic net and then initialize Z and Q by factorizing \u2326 with PCA. \u270f is a\nt b\u2713(pop)k1, which scales the global\nnot deviate too far from the intializationb\u2713pop, even though the coef\ufb01cientsb\u2713(i) remain unconstrained.\nRequire: b\u2713pop, , , , \u21b5, c\n1: \u2713(1), . . . ,\u2713 (n) b\u2713pop\n2: \u2326 [\u2713(1)| . . .| \u2713(n)]\n3: Z, Q PCA(\u2326)\n4: 1\n5: \u21b5 \u21b50\n6: do\n7:\n8:\n9:\n\neZ,eQ,e Z, Q, \n@L(eZ,eQ,e; , , )\n \u21b5 @\nk\u2713(i)b\u2713popk1\u21e5 @\nZ(i) Z(i) \neQ@`(X (i), Y (i),\u2713 (i)) + @ (\u2713(i))\u21e4 8 i 2 [1, . . . , n]\nQ Q \u21b5\u21e5 @\n@QPn\n\n@Z (i)Pn\n (eZ,e) +Pn\n\n (eZ,e)+\ni=1 eZ(i)@`(X (i), Y (i),\u2713 (i))T + @ (\u2713(i))T\u21e4\n\ni=1 D(i)\n\n\u21b5\n\ni=1 D(i)\n\n10:\n11:\n12:\n13:\n14: while not converged\n15: return \u2326, Z, Q,\n\n\u21b5 \u21b5c\n\u2713(i) QT Z(i) 8 i 2 [1, . . . , n]\n\u2326 [\u2713(1)| . . .|\u2713(n)]\n\nPrediction. Given a test point (X, U ), we form a sample-speci\ufb01c model by averaging the model\nparameters of the kn nearest training points, according to the learned distance metric \u21e2:\n\n\u2713 =\n\n1\nkn\n\nknXj=1\n\n\u2713(\u2318(\u21e2,U )[j]),\u2318\n\n(\u21e2, U ) = argsort\n1\uf8ffi\uf8ffn\n\n\u21e2(U, U (i)).\n\n(7a)\n\nIncreasing kn drives the test models toward the population model to control over\ufb01tting. In our\nexperiments, we use kn = 3.\nWe have intentionally avoided using X to select \u2713 so that interpretation of \u2713 is not confounded by X.\nIn some cases, however, the sample predictors can provide additional insight to sample distances (e.g.\n[36]); we leave it to future work to examine how to augment estimations of sample distances by\nincluding distances between predictors.\n\n5\n\n\fScalability. Na\u00efvely, the distance-matching regularizer has O(n2) pairwise distances to calculate,\nhowever this calculation can be made ef\ufb01cient as follows. First, the terms involving d`(U (i)\n, U (j)\n)\n`\nremain unchanged during optimization, so that their computation can be amortized. This allows the\nuse of feature-wise distance metrics which are computationally intensive (e.g. the output of a deep\nlearning model for image covariates). Furthermore, these values are never optimized, so the distance\nmetrics d` need not be differentiable. This allows for a wide variety of distance metrics, such as the\ndiscrete metric for unordered categorical covariates. Second, we streamline the calculation of nearest\nneighbors in two ways: 1) Storing Z in a spatial data structure and 2) Shrinking the hyperparameter r\nused in (4). With these performance improvements, we are able to \ufb01t models to datasets with over\n10,000 samples and 1000s of predictors on a Macbook Pro with 16GB RAM in under an hour.\n\n`\n\n2.4 Analysis\n\nInitializing sample-speci\ufb01c models around a population estimate is convenient because the sample-\nspeci\ufb01c estimates do not diverge from the population estimate unless they have strong reason to\ndo so. Here, we analyze linear regression minimized by squared loss (e.g., f (X (i), Y (i),\u2713 (i)) =\n(Y (i) X (i)\u2713(i))2), though the properties extend to any predictive loss function with a Lipschitz-\ncontinuous subgradient.\nTheorem 1. Let us consider personalized linear regression with (x) = kxk1 (i.e. `1 regulariza-\ntion). Let X be normalized such that maxikX (i)k1 \uf8ff 1, kX (i)k1 = 1.\nDe\ufb01ne \u2713t := 1\nLet the overall learning rate follow a multiplicative decay such that \u21b5t = \u21b50ct, where \u21b50 is the initial\nlearning rate and c is a constant decay factor. Then at iteration \u2327,\n\nnPn\ni=1b\u2713(i)\n\n, whereb\u2713(i)\n\nis the current value ofb\u2713(i) after t iterations.\nk\u2713\u2327 b\u2713popk1 2O ().\n\nt\n\nt\n\n(8)\n\nThat is, the center of mass of the personalized regression coef\ufb01cients does not deviate too far from the\n\ninitializationb\u2713pop, even though the coef\ufb01cientsb\u2713(i) remain unconstrained. In addition, the distance-\n\nmatching regularizer does not move the center of mass and the update to the center of mass does\nnot grow with the number of samples. Proofs of these claims are included in Appendix A of the\nsupplement.\n\n3 Experiments\n\nWe compare personalized regression (hereafter, PR) to four baselines: 1) Population linear or\nlogistic regression, 2) A mixture regression (MR) model, 3) Varying coef\ufb01cients (VC), 4) Deep\nneural networks (DNN). First, we evaluate each method\u2019s ability to recover the true parameters from\nsimulated data. Then we present three real data case studies, each progressively more challenging than\nthe previous: 1) Stock prediction using \ufb01nancial data, 2) Cancer diagnosis from mass spectrometry\ndata, and 3) Electoral prediction using historical election data. The results are summarized in Table 1\nfor easy reference. Details on all the algorithms and datasets used, as well as additional results and\n\ufb01gures, can be found in Appendix B of the supplement.\nWe believe the out-of-sample prediction results provide strong evidence that any harmful over\ufb01tting\nof PR is outweighed by the bene\ufb01t of personalized estimation. This agrees with famous results such\nas [31], where it is showed that optimal ensembles of linear models consist of over\ufb01tted atoms; see\nespecially Eq. 12 and Fig. 2 therein.\n\n3.1 Simulation Study\n\nWe \ufb01rst investigate the capacity of personalized regression to recover true effect sizes in a small-\nscale simulation study. We generate Xj \u21e0 Unif(1, 1) (j = 1, 2), U \u21e0 Unif(0, 1), \u2713(i) =\n[U (i),I|U (i)|>0.5 + 0.1 sin(U (i))] 2 R2, and Y (i) = X (i)\u2713(i) + w(i), with w(i) \u21e0 N (0, 0.1). As\nshown in Fig. 1, this produces regression parameters with a discontinuous distribution. The algorithms\nare given both X and U as input during training, and we use LIME [28] to generate local linear\napproximations to the DNN in order to estimate parameters \u2713(i) for each sample. In this setting, there\n\n6\n\n\fTable 1: Predictive performance on test sets. For continuous response variables, we report correlation\ncoef\ufb01cient (R2) and mean squared error (MSE) of the predictions. For classi\ufb01cation tasks, we report\narea under the receiver operating characteristic curve (AUROC) and the accuracy (ACC). For the\nsimulation, we also report recovery error of the true regression parameters in the training set, with\n(mean \u00b1 std) values calculated over 20 experiments with different values of X, U, w.\n\nFinancial\n\nCancer\n\nElection\n\nModel\n\nPop.\nMR\nVC\nDNN\nPR\n\nSimulation\n\nkb\u2326 \u2326k2\n\n24.76 \u00b1 0.02\n19.31 \u00b1 0.87\n24.88 \u00b1 0.09\n30.29 \u00b1 0.55\n9.02 \u00b1 2.53\n\nR2\n\nMSE\n\nR2\n\nMSE\n\nAUROC\n\nAcc\n\nR2\n\n0.57 \u00b1 0.03\n0.83 \u00b1 0.03\n0.66 \u00b1 0.02\n0.91 \u00b1 0.03\n0.936 \u00b1 0.05\n\n0.133 \u00b1 0.01\n0.054 \u00b1 0.01\n0.106 \u00b1 0.01\n0.028 \u00b1 0.01\n0.020 \u00b1 0.01\n\n0.01\n0.74\n0.06\n0.02\n0.86\n\n64144\n16146\n60694\n63028\n4822\n\n0.794\n0.876\n0.430\n0.901\n0.923\n\n0.962\n0.939\n0.863\n0.955\n0.975\n\n0.00\n0.56\n0.00\n0.00\n0.45\n\nMSE\n\n0.019\n0.031\n0.019\n0.019\n0.011\n\nexists a discontinuous function which could output exactly the sample-speci\ufb01c regression models\nfrom the covariates that a neural network should be able to learn accurately. In this sense, the neural\nnetwork is \u201ccorrectly speci\ufb01ed\u201d for this dataset, testing how well locally-linear models approximate\nthe true parameters. More extensive simulation experiments, with varying n and p are available in\nSec. C.1 of the Supplement.\n\nResults. The results are presented in Table 1 and visualized in Fig. 1. As expected, the recovery\nerror is much lower for PR, while the DNN shows competitive predictive error. The population\nestimator successfully recovers the mean effect sizes, but this central model is not accurate for any\nindividual, resulting in poor performance both in recovering \u2326 and in prediction. Similarly, both MR\nand VC perform poorly. As expected, the deep learning model excels at predictive error, however, the\nlocal linear approximations do not accurately recover the sample-speci\ufb01c linear models. In contrast,\nPR exhibits both the \ufb02exibility and the structure to learn the true regression parameters while retaining\npredictive performance.\n\n3.2 Financial Prediction\n\nA common task in \ufb01nancial trading is to predict the price of a security at some point in the future.\nThis is a challenging task made more dif\ufb01cult by nonstationarity\u2014the interpretation of an event\nchanges over time, and different securities may respond to the same event differently. We built a\ndataset of security prices over a 30-year time frame by joining stock and ETF trading histories to a\ndatabase of global news headlines (details in supplement). The predictors X (i,t) consist of the trading\nhistory of the 24 securities over the previous 2 weeks as well as global news headlines from the same\ntime period. The covariates U (i,t) consist of the date and security characteristics (name, region, and\nindustry). The target Y (i,t) is the price of this security 2 weeks after t.\n\nResults. PR signi\ufb01cantly outperforms baseline methods to predict price movements (Table 1).\nIn contrast to standard models which average effects over long time periods and/or securities, PR\nsummarizes gradual shifts in attention. The estimated sample-speci\ufb01c models are visualized in Fig. 2.\nThe strongest clustering behavior is due to time (Fig. 2b). For instance, models \ufb01t to samples in the\nera of U.S. \u201cstag\ufb02ation\" (1973-1975) are overlaid on models for samples in the early 1990s U.S.\nrecession. In both of these cases, real equity prices declined against the background of high in\ufb02ation\nrates. In contrast, the recessions marked by structural problems such as the Great Financial Crisis of\n2008 are separated from the others. Within each time period, we also see that industries (Fig. S4a),\nregions (Fig. S4b), and securities (Fig. S4c) are strongly clustered (details in supplement).\n\n3.3 Cancer Analysis\n\nIn cancer analysis, the challenges of sample heterogeneity are paramount and well-known. Increasing\nbiomedical evidence suggests that patients do not fall into discrete clusters [5, 21], but rather each\npatient experiences a unique disease that should be approached from an individualized perspective [7].\nHere, we investigate the capacity of PR to distinguish malignant from benign skin lesions using a\ndataset of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) of a common\nskin cancer, basal cell carcinoma (BCC) [22] (details in supplement).\n\n7\n\n\f(a) Models colored by industry of the security.\n\n(b) Models colored by date of prediction.\n\nFigure 2: Personalized \ufb01nancial models using t-SNE [33] embedding. Each point represents a\nregression model for one security at a single date.\n\nResults. As shown in Table 1, PR produces the best predictions of tumor status amongst the\nmethods evaluated. The substantial improvement over competing methods is likely due to the long\ntail of the distribution of characteristic features\u2014we observe that the number of samples which\nassign the largest in\ufb02uence to each feature has a long tail (Fig. S7). By summing the most important\nfeatures for each instance, we can transform these sample-speci\ufb01c explanations into patient-speci\ufb01c\nexplanations (Table S4). These explanations depict a clustering of patients in which there are 8\ndistinct subtypes (visualized in Fig. S6). While we may hope that a mixture model could recover\nthese patient clusters, actual mixture components are less accurate in prediction (Table 1), likely due\nto their independent estimation and reduced statistical power. Furthermore, this clustering by patient\nis incomplete\u2014there is also signi\ufb01cant heterogeneity in the models for each patient (Fig. S6). This\nmay point to the \u201cmosaic\" view of tumors, under which single tumors are comprised of multiple cell\nlines [19]. This example underscores the bene\ufb01ts of treating sample heterogeneity as fundamental by\ndesigning algorithms to estimate sample-speci\ufb01c models.\n\n3.4 Presidential Election Analysis\n\nOur last experiment illustrates a practical use case for the example of modeling election outcomes\ndiscussed in Section 1. The goals are twofold: 1) To predict county-level election results, and\n2) To explore the use of distinct regression models as embeddings of samples in order to better\nunderstand voting preferences at the county (i.e. sample-speci\ufb01c) level. The data are from the 2012\nU.S. presidential election and consist of discrete representations of each candidate based on candidate\npositions while the outcomes are the county-level vote proportions (details in supplement). For the\ncovariates U, we used county demographic information from the 2010 U.S. Census. As the outcome\nvaries across samples but the predictors remain constant, the personalized regression models must\nencode sample heterogeneity by estimating different regression parameters for different samples, thus\ncreating county representations (\u201cembeddings\") which combine both voting and demographic data.\n\nResults. The out-of-sample predictive error is signi\ufb01cantly reduced by personalization (Table 1).\nFigs. 3, S8 depict embeddings of the Pennsylvania counties included in the training set. Generating\ncounty embeddings based solely on voting outcome constrains the embeddings near a one-dimensional\nmanifold (Fig. S8b), while demographics produce embeddings which do not strongly correspond to\nvoting patterns (Fig. 3a). In contrast, the personalized models produce a structure which interpolates\nbetween the two types of data (Fig. 3b). An interesting case is that of the Lackawanna and Allegheny\ncounties. While these counties had similar voting results in the 2012 election, their embeddings are\nfar apart due to the difference in demographics between their major metropolitan areas. This indicates\nthat the county populations may be voting for different reasons despite similar demographics, a\n\ufb01nding that is not discovered by jointly inspecting the demographic and voting data (Fig. S8e). Thus,\nsample-speci\ufb01c models can be used to understand the complexities of election results.\n\n8\n\n\f(a) Demographic Covariates, U\n\nFigure 3: Embeddings of Pennsylvania counties. Each point represents a county, with color gradi-\nent corresponding to the 2012 election result (red for Republican candidate, blue for Democratic\ncandidate). Due to space constraints, the name of each county has been abbreviated, with a key in\nTable S5 of the Supplement. (a) The raw covariates U lie near a low-dimensional manifold that does\n\n(b) Personalized Estimation, bZ\n\ninterpolate between demographic and voting information.\n\nnot correspond to voter outcome. (b) Personalized regression models form embeddings (bZ) which\n\n4 Discussion and Future Work\n\nWe have presented a framework to estimate collections of models by matching structure in sample\ncovariates to structure in regression parameters. We showed that this framework accurately recovers\nsample-speci\ufb01c parameters, enabling collections of simple models to surpass the predictive capacity\nof larger, uninterpretable models. Our framework also enables \ufb01ne-grained analyses which can be\nused to understand sample heterogeneity, even within groups of similar samples. Beyond estimating\nsample-speci\ufb01c models, we also believe it would be possible to adapt these ideas to improve standard\nmodels. For instance, the distance-matching regularizer may be applied to augment standard mixture\nmodels. It would also be interesting to consider extensions of this framework to more structured\nmodels such as personalized probabilistic graphical models. Overall, the success of these personalized\nmodels underscores the importance of directly treating sample heterogeneity rather than building\nincreasingly-complicated cohort-level models.\n\nAcknowledgments\nWe thank Maruan Al-Shedivat, Gregory Cooper, and Rich Caruana for insightful discussion.\nThis material is based upon work supported by NIH R01GM114311. Any opinions, \ufb01ndings and\nconclusions or recommendations expressed in this material are those of the author(s) and do not\nnecessarily re\ufb02ect the views of the National Institutes of Health.\n\nReferences\n[1] I. I. Ageenko, K. A. Doherty, and A. P. Van Cleave. Personalized lifetime \ufb01nancial planning\n\ntool, June 24 2010. US Patent App. 12/316,967.\n\n[2] M. Al-Shedivat, A. Dubey, and E. P. Xing. Contextual explanation networks. arXiv preprint\n\narXiv:1705.10301, 2017.\n\n[3] M. Al-Shedivat, A. Dubey, and E. P. Xing. Personalized survival prediction with contextual\n\nexplanation networks. arXiv preprint arXiv:1801.09810, 2018.\n\n[4] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n9\n\n\f[5] F. Buettner, K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teich-\nmann, J. C. Marioni, and O. Stegle. Computational analysis of cell-to-cell heterogeneity in\nsingle-cell rna-sequencing data reveals hidden subpopulations of cells. Nature biotechnology,\n33(2):155, 2015.\n\n[6] X. Ding, Y. Zhang, T. Liu, and J. Duan. Deep learning for event-driven stock prediction. In\n\nIJCAI, pages 2327\u20132333, 2015.\n\n[7] H. K. Dressman, A. Berchuck, G. Chan, J. Zhai, A. Bild, R. Sayer, J. Cragun, J. Clarke, R. S.\nWhitaker, L. Li, et al. An integrated genomic-based approach to individualized treatment of\npatients with advanced-stage ovarian cancer. Journal of Clinical Oncology, 25(5):517\u2013525,\n2007.\n\n[8] J. Fan and W. Zhang. Statistical estimation in varying coef\ufb01cient models. Annals of Statistics,\n\npages 1491\u20131518, 1999.\n\n[9] A. J. Fisher, J. D. Medaglia, and B. F. Jeronimus. Lack of group-to-individual generalizability\nis a threat to human subjects research. Proceedings of the National Academy of Sciences, 115\n(27):E6106\u2013E6115, 2018. ISSN 0027-8424. doi: 10.1073/pnas.1711978115. URL https:\n//www.pnas.org/content/115/27/E6106.\n\n[10] I. C. Gormley, T. B. Murphy, et al. A mixture of experts model for rank data with applications\n\nin election studies. The Annals of Applied Statistics, 2(4):1452\u20131477, 2008.\n\n[11] D. Hallac, J. Leskovec, and S. Boyd. Network lasso: Clustering and optimization in large graphs.\nIn Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery\nand data mining, pages 387\u2013396. ACM, 2015.\n\n[12] S. A. Hart. Precision education initiative: Moving toward personalized education. Mind,\nBrain, and Education, 10(4):209\u2013211, 2016. doi: 10.1111/mbe.12109. URL https://\nonlinelibrary.wiley.com/doi/abs/10.1111/mbe.12109.\n\n[13] T. Hastie and R. Tibshirani. Varying-coef\ufb01cient models. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 757\u2013796, 1993.\n\n[14] F. Jabbari, S. Visweswaran, and G. F. Cooper. Instance-speci\ufb01c bayesian network structure\nlearning. In V. Kratochv\u00edl and M. Studen\u00fd, editors, Proceedings of the Ninth International\nConference on Probabilistic Graphical Models, volume 72 of Proceedings of Machine Learning\nResearch, pages 169\u2013180, Prague, Czech Republic, 11\u201314 Sep 2018. PMLR. URL http:\n//proceedings.mlr.press/v72/jabbari18a.html.\n\n[15] J. Jiang. Linear and generalized linear mixed models and their applications. Springer Science\n\n& Business Media, 2007.\n\n[16] M. L. Kuijjer, M. G. Tung, G. Yuan, J. Quackenbush, and K. Glass. Estimating sample-speci\ufb01c\n\nregulatory networks. iScience, 14:226\u2013240, 2019.\n\n[17] B. J. Lengerich, B. Aragam, and E. P. Xing. Personalized regression enables sample-speci\ufb01c pan-\ncancer analysis. Bioinformatics, 34(13):i178\u2013i186, 2018. doi: 10.1093/bioinformatics/bty250.\nURL http://dx.doi.org/10.1093/bioinformatics/bty250.\n\n[18] X. Li, S. Xie, P. McColgan, S. J. Tabrizi, R. I. Scahill, D. Zeng, and Y. Wang. Learning\nsubject-speci\ufb01c directed acyclic graphs with mixed effects structural equation models from\nobservational data. Frontiers in genetics, 9, 2018.\n\n[19] C. Liu, J. C. Sage, M. R. Miller, R. G. Verhaak, S. Hippenmeyer, H. Vogel, O. Foreman, R. T.\nBronson, A. Nishiyama, L. Luo, et al. Mosaic analysis with double markers reveals tumor cell\nof origin in glioma. Cell, 146(2):209\u2013221, 2011.\n\n[20] X. Liu, Y. Wang, H. Ji, K. Aihara, and L. Chen. Personalized characterization of diseases using\n\nsample-speci\ufb01c networks. Nucleic acids research, 44(22):e164\u2013e164, 2016.\n\n10\n\n\f[21] S. Ma, S. Ogino, P. Parsana, R. Nishihara, Z. Qian, J. Shen, K. Mima, Y. Masugi, Y. Cao, J. A.\nNowak, K. Shima, Y. Hoshida, E. L. Giovannucci, M. K. Gala, A. T. Chan, C. S. Fuchs, G. Parmi-\ngiani, C. Huttenhower, and L. Waldron. Continuity of transcriptomes among colorectal cancer\nsubtypes based on meta-analysis. Genome Biology, 19(1):142, Sep 2018. ISSN 1474-760X. doi:\n10.1186/s13059-018-1511-4. URL https://doi.org/10.1186/s13059-018-1511-4.\n\n[22] K. Margulis, A. S. Chiou, S. Z. Aasi, R. J. Tibshirani, J. Y. Tang, and R. N. Zare. Distinguishing\nmalignant from benign microscopic skin lesions using desorption electrospray ionization mass\nspectrometry imaging. Proceedings of the National Academy of Sciences, 115(25):6347\u20136352,\n2018. ISSN 0027-8424. doi: 10.1073/pnas.1803733115. URL https://www.pnas.org/\ncontent/115/25/6347.\n\n[23] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning.\n\nIn ICML (2), pages 343\u2013351, 2013.\n\n[24] K. Ng, J. Sun, J. Hu, and F. Wang. Personalized predictive modeling and risk factor identi\ufb01cation\nusing patient similarity. AMIA Summits on Translational Science Proceedings, 2015:132, 2015.\n[25] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation.\nIn Proceedings of the 2014 conference on empirical methods in natural language processing\n(EMNLP), pages 1532\u20131543, 2014.\n\n[26] E. A. Platanios, M. Sachan, G. Neubig, and T. Mitchell. Contextual parameter generation for\n\nuniversal neural machine translation. arXiv preprint arXiv:1808.08493, 2018.\n\n[27] X. Puig and J. Ginebra. A bayesian cluster analysis of election results. Journal of Applied\n\nStatistics, 41(1):73\u201394, 2014.\n\n[28] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of\nany classi\ufb01er. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge\ndiscovery and data mining, pages 1135\u20131144. ACM, 2016.\n\n[29] A. Rozantsev, M. Salzmann, and P. Fua. Beyond sharing weights for deep domain adaptation.\n\nIEEE transactions on pattern analysis and machine intelligence, 41(4):801\u2013814, 2018.\n\n[30] W. M. Shyu, E. Grosse, and W. S. Cleveland. Local regression models. In Statistical models in\n\nS, pages 309\u2013376. Routledge, 1991.\n\n[31] P. Sollich and A. Krogh. Learning with ensembles: How over\ufb01tting can be useful. In Advances\n\nin neural information processing systems, pages 190\u2013196, 1996.\n\n[32] R. Tibshirani and J. Friedman. A pliable lasso. arXiv preprint arXiv:1712.00484, 2017.\n[33] L. Van Der Maaten. Accelerating t-sne using tree-based algorithms. The Journal of Machine\n\nLearning Research, 15(1):3221\u20133245, 2014.\n\n[34] R. K. Vinayak, W. Kong, G. Valiant, and S. Kakade. Maximum likelihood estimation for\nlearning populations of parameters. In International Conference on Machine Learning, pages\n6448\u20136457, 2019.\n\n[35] S. Visweswaran and G. F. Cooper. Learning instance-speci\ufb01c predictive models. Journal of\n\nMachine Learning Research, 11(Dec):3333\u20133369, 2010.\n\n[36] B. Wang, A. M. Mezlini, F. Demir, M. Fiume, Z. Tu, M. Brudno, B. Haibe-Kains, and A. Gold-\nenberg. Similarity network fusion for aggregating data types on a genomic scale. Nature\nmethods, 11(3):333, 2014.\n\n[37] J. Xu, J. Zhou, and P.-N. Tan. Formula: Factorized multi-task learning for task discovery in\npersonalized medical models. In Proceedings of the 2015 International Conference on Data\nMining. SIAM, 2015.\n\n[38] M. Yamada, K. Takeuchi, T. Iwata, J. Shawe-Taylor, and S. Kaski. Localized lasso for high-\n\ndimensional regression. stat, 1050:20, 2016.\n\n[39] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n11\n\n\f", "award": [], "sourceid": 1932, "authors": [{"given_name": "Ben", "family_name": "Lengerich", "institution": "Carnegie Mellon University"}, {"given_name": "Bryon", "family_name": "Aragam", "institution": "University of Chicago"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. / Carnegie Mellon University"}]}