{"title": "A Framework for Individualizing Predictions of Disease Trajectories by Exploiting Multi-Resolution Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 748, "page_last": 756, "abstract": "For many complex diseases, there is a wide variety of ways in which an individual can manifest the disease. The challenge of personalized medicine is to develop tools that can accurately predict the trajectory of an individual's disease, which can in turn enable clinicians to optimize treatments. We represent an individual's disease trajectory as a continuous-valued continuous-time function describing the severity of the disease over time. We propose a hierarchical latent variable model that individualizes predictions of disease trajectories. This model shares statistical strength across observations at different resolutions--the population, subpopulation and the individual level. We describe an algorithm for learning population and subpopulation parameters offline, and an online procedure for dynamically learning individual-specific parameters. Finally, we validate our model on the task of predicting the course of interstitial lung disease, a leading cause of death among patients with the autoimmune disease scleroderma. We compare our approach against state-of-the-art and demonstrate significant improvements in predictive accuracy.", "full_text": "A Framework for Individualizing Predictions of Disease\nTrajectories by Exploiting Multi-Resolution Structure\n\nPeter Schulam\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21218\npschulam@jhu.edu\n\nSuchi Saria\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21218\n\nssaria@cs.jhu.edu\n\nAbstract\n\nFor many complex diseases, there is a wide variety of ways in which an indi-\nvidual can manifest the disease. The challenge of personalized medicine is to\ndevelop tools that can accurately predict the trajectory of an individual\u2019s disease,\nwhich can in turn enable clinicians to optimize treatments. We represent an in-\ndividual\u2019s disease trajectory as a continuous-valued continuous-time function de-\nscribing the severity of the disease over time. We propose a hierarchical latent\nvariable model that individualizes predictions of disease trajectories. This model\nshares statistical strength across observations at different resolutions\u2013the popula-\ntion, subpopulation and the individual level. We describe an algorithm for learning\npopulation and subpopulation parameters of\ufb02ine, and an online procedure for dy-\nnamically learning individual-speci\ufb01c parameters. Finally, we validate our model\non the task of predicting the course of interstitial lung disease, a leading cause\nof death among patients with the autoimmune disease scleroderma. We compare\nour approach against state-of-the-art and demonstrate signi\ufb01cant improvements in\npredictive accuracy.\n\nIntroduction\n\n1\nIn complex, chronic diseases such as autism, lupus, and Parkinson\u2019s, the way the disease manifests\nmay vary greatly across individuals [1]. For example, in scleroderma, the disease we use as a running\nexample in this work, individuals may be affected across six organ systems\u2014the lungs, heart, skin,\ngastrointestinal tract, kidneys, and vasculature\u2014to varying extents [2]. For any single organ system,\nsome individuals may show rapid decline throughout the course of their disease, while others may\nshow early decline but stabilize later on. Often in such diseases, the most effective drugs have\nstrong side-effects. With tools that can accurately predict an individual\u2019s disease activity trajectory,\nclinicians can more aggressively treat those at greatest risk early, rather than waiting until the disease\nprogresses to a high level of severity. To monitor the disease, physicians use clinical markers to\nquantify severity. In scleroderma, for example, PFVC is a clinical marker used to measure lung\nseverity. The task of individualized prediction of disease activity trajectories is that of using an\nindividual\u2019s clinical history to predict the future course of a clinical marker; in other words, the goal\nis to predict a function representing a trajectory that is updated dynamically using an individual\u2019s\nprevious markers and individual characteristics.\nPredicting disease activity trajectories presents a number of challenges. First, there are multiple la-\ntent factors that cause heterogeneity across individuals. One such factor is the underlying biological\nmechanism driving the disease. For example, two different genetic mutations may trigger distinct\ndisease trajectories (e.g. as in Figures 1a and 1b). If we could divide individuals into groups accord-\ning to their mechanisms\u2014or disease subtypes (see e.g. [3, 4, 5, 6])\u2014it would be straightforward\nto \ufb01t separate models to each subpopulation. In most complex diseases, however, the mechanisms\nare poorly understood and clear de\ufb01nitions of subtypes do not exist. If subtype alone determined\ntrajectory, then we could cluster individuals. However, other unobserved individual-speci\ufb01c factors\n\n1\n\n\fsuch as behavior and prior exposures affect health and can cause different trajectories across indi-\nviduals of the same subtype. For instance, a chronic smoker will typically have unhealthy lungs and\nso may have a trajectory that is consistently lower than a non-smoker\u2019s, which we must account for\nusing individual-speci\ufb01c parameters. An individual\u2019s trajectory may also be in\ufb02uenced by transient\nfactors\u2014e.g. an infection unrelated to the disease that makes it dif\ufb01cult to breath (similar to the\n\u201cdips\u201d in Figure 1c or the third row in Figure 1d). This can cause marker values to temporarily drop,\nand may be hard to distinguish from disease activity. We show that these factors can be arranged in\na hierarchy (population, subpopulation, and individual), but that not all levels of the hierarchy are\nobserved. Finally, the functional outcome is a rich target, and therefore more challenging to model\nthan scalar outcomes. In addition, the marker data is observed in continuous-time and is irregularly\nsampled, making commonly used discrete-time approaches to time series modeling (or approaches\nthat rely on imputation) not well suited to this domain.\nRelated work. The majority of predictive models in medicine explain variability in the target out-\ncome by conditioning on observed risk factors alone. However, these do not account for latent\nsources of variability such as those discussed above. Further, these models are typically cross-\nsectional\u2014they use features from data measured up until the current time to predict a clinical marker\nor outcome at a \ufb01xed point in the future. As an example, consider the mortality prediction model by\nLee et al. [7], where logistic regression is used to integrate features into a prediction about the prob-\nability of death within 30 days for a given patient. To predict the outcome at multiple time points,\ntypically separate models are trained. Moreover, these models use data from a \ufb01xed-size window,\nrather than a growing history.\nResearchers in the statistics and machine learning communities have proposed solutions that ad-\ndress a number of these limitations. Most related to our work is that by Rizopoulos [8], where the\nfocus is on making dynamical predictions about a time-to-event outcome (e.g.\ntime until death).\nTheir model updates predictions over time using all previously observed values of a longitudinally\nrecorded marker. Besides conditioning on observed factors, they account for latent heterogeneity\nacross individuals by allowing for individual-speci\ufb01c adjustments to the population-level model\u2014\ne.g. for a longitudinal marker, deviations from the population baseline are modeled using random\neffects by sampling individual-speci\ufb01c intercepts from a common distribution. Other closely related\nwork by Proust-Lima et al. [9] tackle a similar problem as Rizopoulos, but address heterogeneity\nusing a mixture model.\nAnother common approach to dynamical predictions is to use Markov models such as order-p\nautoregressive models (AR-p), HMMs, state space models, and dynamic Bayesian networks\n(see e.g. in [10]). Although such models naturally make dynamic predictions using the full history\nby forward-\ufb01ltering, they typically assume discrete, regularly-spaced observation times. Gaussian\nprocesses (GPs) are a commonly used alternative for handling continuous-time observations\u2014see\nRoberts et al. [11] for a recent review of GP time series models. Since Gaussian processes are non-\nparametric generative models of functions, they naturally produce functional predictions dynami-\ncally by using the posterior predictive conditioned on the observed data. Mixtures of GPs have been\napplied to model heterogeneity in the covariance structure across time series (e.g. [12]), however as\nnoted in Roberts et al., appropriate mean functions are critical for accurate forecasts using GPs. In\nour work, an individual\u2019s trajectory is expressed as a GP with a highly structured mean comprising\npopulation, subpopulation and individual-level components where some components are observed\nand others require inference.\nMore broadly, multi-level models have been applied in many \ufb01elds to model heterogeneous collec-\ntions of units that are organized within a hierarchy [13]. For example, in predicting student grades\nover time, individuals within a school may have parameters sampled from the school-level model,\nand the school-level model parameters in turn may be sampled from a county-speci\ufb01c model. In our\nsetting, the hierarchical structure\u2014which individuals belong to the same subgroup\u2014is not known a\npriori. Similar ideas are studied in multi-task learning, where relationships between distinct predic-\ntion tasks are used to encourage similar parameters. This has been applied to modeling trajectories\nby treating predictions at each time point as a separate task and enforcing similarity between sub-\nmodels close in time [14]. This approach is limited, however, in that it models a \ufb01nite number\nof times. Others, more recently, have developed models for disease trajectories (see [15, 16] and\nreferences within) but these focus on retrospective analysis to discover disease etiology rather than\ndynamical prediction. Schulam et al. [16] incorporate differences in trajectories due to subtypes and\nindividual-speci\ufb01c factors. We build upon this work here. Finally, recommender systems also share\n\n2\n\n\fFigure 1: Plots (a-c) show example marker trajectories. Plot (d) shows adjustments to a population and\nsubpopulation \ufb01t (row 1). Row 2 makes an individual-speci\ufb01c long-term adjustment. Row 3 makes short-\nterm structured noise adjustments. Plot (e) shows the proposed graphical model. Levels in the hierarchy are\ncolor-coded. Model parameters are enclosed in dashed circles. Observed random variables are shaded.\n\ninformation across individuals with the aim of tailoring predictions (see e.g. [17, 18, 19]), but the\ntask is otherwise distinct from ours.\nContributions. We propose a hierarchical model of disease activity trajectories that directly ad-\ndresses common\u2014latent and observed\u2014sources of heterogeneity in complex, chronic diseases us-\ning three levels: the population level, subpopulation level, and individual level. The model discovers\nthe subpopulation structure automatically, and infers individual-level structure over time when mak-\ning predictions. In addition, we include a Gaussian process as a model of structured noise, which\nis designed to explain away temporary sources of variability that are unrelated to disease activity.\nTogether, these four components allow individual trajectories to be highly heterogeneous while si-\nmultaneously sharing statistical strength across observations at different \u201cresolutions\u201d of the data.\nWhen making predictions for a given individual, we use Bayesian inference to dynamically update\nour posterior belief over individual-speci\ufb01c parameters given the clinical history and use the poste-\nrior predictive to produce a trajectory estimate. Finally, we evaluate our approach by developing a\nstate-of-the-art trajectory prediction tool for lung disease in scleroderma. We train our model using\na large, national dataset containing individuals with scleroderma tracked over 20 years and compare\nour predictions against alternative approaches. We \ufb01nd that our approach yields signi\ufb01cant gains in\npredictive accuracy of disease activity trajectories.\n2 Disease Trajectory Model\nWe describe a hierarchical model of an individual\u2019s clinical marker values. The graphical model\nis shown in Figure 1e. For each individual i, we use Ni to denote the number of observed mark-\ners. We denote each individual observation using yij and its measurement time using tij where\nj \u2208 {1, . . . , Ni}. We use (cid:126)yi \u2208 RNi and (cid:126)ti \u2208 RNi to denote all of individual i\u2019s marker values and\nmeasurement times respectively. In the following discussion, \u03a6(tij) denotes a column-vector con-\n\ntaining a basis expansion of the time tij and we use \u03a6(cid:0)(cid:126)ti\n\n(cid:1) = [\u03a6(ti1), . . . , \u03a6(tiNi)](cid:62) to denote the\n\nmatrix containing the basis expansion of points in (cid:126)ti in each of its rows. We model the jth marker\nvalue for individual i as a normally distributed random variable with a mean assumed to be the sum\nof four terms: a population component, a subpopulation component, an individual component, and\na structured noise component:\n\n\uf8f6\uf8f7\uf8f8 .\n\n, \u03c32\n\n(1)\n\n\uf8eb\uf8ec\uf8ed\u03a6p(tij)(cid:62)\u039b (cid:126)xip\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n\n(A) population\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nyij \u223c N\n\n+ \u03a6z(tij)(cid:62) (cid:126)\u03b2zi\n\n+ \u03a6(cid:96)(tij)(cid:62)(cid:126)bi\n\n+\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nfi(tij)\n\n(B) subpopulation\n\n(C) individual\n\n(D) structured noise\n\nThe four terms in the sum serve two purposes. First, they allow for a number of different sources\nof variation to in\ufb02uence the observed marker value, which allows for heterogeneity both across and\nwithin individuals. Second, they share statistical strength across different subsets of observations.\nThe population component shares strength across all observations. The subpopulation component\n\n3\n\nSubtype marginal model coef\ufb01cientszifiM\u21b5yijtijNi~gG~bi\u2303b~\u21e2iPopulation model featuresPopulation model coef\ufb01cients Population model features-to-coef\ufb01cient mapSubtype B-spline coef\ufb01cientsSubtype indicator 2{1,...,G}Individual model covariance matrix2Rd`\u21e5d`Individual model coef\ufb01cients2Rd`Structured noise GP hyper-parametersStructured noise function2RR~xizSubtype marginal model features~wg2Rqz2Rqz2Rdz2Rqp2Rdp2Rdp\u21e5qp~xip\u21e4(d)(a)(b)(c)(e)\fshares strength across observations belonging to subgroups of individuals. The individual compo-\nnent shares strength across all observations belonging to the same individual. Finally, the structured\nnoise shares information across observations belonging to the same individual that are measured at\nsimilar times. Predicting an individual\u2019s trajectory involves estimating her subtype and individual-\nspeci\ufb01c parameters as new clinical data becomes available1. We describe each of the components in\ndetail below.\nPopulation level. The population model predicts aspects of an individual\u2019s disease activity trajec-\ntory using observed baseline characteristics (e.g. gender and race), which are represented using the\nfeature vector (cid:126)xip. This sub-model is shown within the orange box in Figure 1e. Here we assume\nthat this component is a linear model where the coef\ufb01cients are a function of the features (cid:126)xip \u2208 Rqp.\nThe predicted value of the jth marker of individual i measured at time tij is shown in Eq. 1 (A),\nwhere \u03a6p (t) \u2208 Rdp is a basis expansion of the observation time and \u039b \u2208 Rdp\u00d7qp is a matrix used as\na linear map from an individual\u2019s covariates (cid:126)xip to coef\ufb01cients \u03c1i \u2208 Rdp. At this level, individuals\nwith similar covariates will have similar coef\ufb01cients. The matrix \u039b is learned of\ufb02ine.\nSubpopulation level. We model an individual\u2019s subtype using a discrete-valued latent variable\nzi \u2208 {1, . . . , G}, where G is the number of subtypes. We associate each subtype with a unique\ndisease activity trajectory represented using B-splines, where the number and location of the knots\nand the degree of the polynomial pieces are \ufb01xed prior to learning. These hyper-parameters de-\ntermine a basis expansion \u03a6z(t) \u2208 Rdz mapping a time t to the B-spline basis function values at\nthat time. Trajectories for each subtype are parameterized by a vector of coef\ufb01cients (cid:126)\u03b2g \u2208 Rdz\nfor g \u2208 {1, . . . , G}, which are learned of\ufb02ine. Under subtype zi, the predicted value of marker\nyij measured at time tij is shown in Eq. 1 (B). This component explains differences such as those\nobserved between the trajectories in Figures 1a and 1b. In many cases, features at baseline may be\npredictive of subtype. For example, in scleroderma, the types of antibody an individual produces\n(i.e. the presence of certain proteins in the blood) are correlated with certain trajectories. We can\nimprove predictive performance by conditioning on baseline covariates to infer the subtype. To do\nthis, we use a multinomial logistic regression to de\ufb01ne feature-dependent marginal probabilities:\nzi | (cid:126)xiz \u223c Mult (\u03c01:G ((cid:126)xiz)), where \u03c0g ((cid:126)xiz) \u221d e (cid:126)w(cid:62)\ng (cid:126)xiz. We denote the weights of the multinomial\nregression using (cid:126)w1:G, where the weights of the \ufb01rst class are constrained to be (cid:126)0 to ensure model\nidenti\ufb01ability. The remaining weights are learned of\ufb02ine.\nIndividual level. This level models deviations from the population and subpopulation models us-\ning parameters that are learned dynamically as the individual\u2019s clinical history grows. Here, we\nparameterize the individual component using a linear model with basis expansion \u03a6(cid:96)(t) \u2208 Rd(cid:96) and\nindividual-speci\ufb01c coef\ufb01cients (cid:126)bi \u2208 Rd(cid:96). An individual\u2019s coef\ufb01cients are modeled as latent vari-\nables with marginal distribution (cid:126)bi \u223c N ((cid:126)0, \u03a3b). For individual i, the predicted value of marker yij\nmeasured at time tij is shown in Eq. 1 (C). This component can explain, for example, differences in\noverall health due to an unobserved characteristic such as chronic smoking, which may cause atyp-\nically lower lung function than what is predicted by the population and subpopulation components.\nSuch an adjustment is illustrated across the \ufb01rst and second rows of Figure 1d.\nStructured noise. Finally, the structured noise component fi captures transient trends. For ex-\nample, an infection may cause an individual\u2019s lung function to temporarily appear more restricted\nthan it actually is, which may cause short-term trends like those shown in Figure 1c and the third\nrow of Figure 1d. We treat fi as a function-valued latent variable and model it using a Gaus-\nsian process with zero-valued mean function and Ornstein-Uhlenbeck (OU) covariance function:\n\nKOU(t1, t2) = a2 exp(cid:8)\u2212(cid:96)\u22121|t1 \u2212 t2|(cid:9). The amplitude a controls the magnitude of the structured\n\nnoise that we expect to see and the length-scale (cid:96) controls the length of time over which we expect\nthese temporary trends to occur. The OU kernel is ideal for modeling such deviations as it is both\nmean-reverting and draws from the corresponding stochastic process are only \ufb01rst-order continuous,\nwhich eliminates long-range dependencies between deviations [20]. Applications in other domains\nmay require different kernel structures motivated by properties of the noise in the trajectories.\n\n1The model focuses on predicting the long-term trajectory of an individual when left untreated. In many\nchronic conditions, as is the case for scleroderma, drugs only provide short-term relief (accounted for in our\nmodel by the individual-speci\ufb01c adjustments). If treatments that alter long-term course are available and com-\nmonly prescribed, then these should be included within the model as an additional component that in\ufb02uences\nthe trajectory.\n\n4\n\n\f2.1 Learning\nObjective function. To learn the parameters of our model \u0398 = {\u039b, (cid:126)w1:G, (cid:126)\u03b21:G, \u03a3b, a, (cid:96), \u03c32}, we\nmaximize the observed-data log-likelihood (i.e. the probability of all individual\u2019s marker values (cid:126)yi\ngiven measurement times (cid:126)ti and features {(cid:126)xip, (cid:126)xiz}). This requires marginalizing over the latent\nvariables {zi,(cid:126)bi, fi} for each individual. This yields a mixture of multivariate normals:\n\nG(cid:88)\n\n\u03c0zi ((cid:126)xiz)N(cid:16)\n\n(cid:126)yi | \u03a6p\n\n(cid:0)(cid:126)ti\n\n(cid:1) \u039b (cid:126)xip + \u03a6z\n\n(cid:0)(cid:126)ti\n\n(cid:1) (cid:126)\u03b2zi, K(cid:0)(cid:126)ti, (cid:126)ti\n\n(cid:1)(cid:17)\n\nP ((cid:126)yi | Xi, \u0398) =\n\n,\n\n(2)\n\nzi=1\n\nlog-likelihood for all individuals is therefore: L (\u0398) =(cid:80)M\n\nwhere K(t1, t2) = \u03a6(cid:96)(t1)(cid:62)\u03a3b\u03a6(cid:96)(t2) + KOU(t1, t2) + \u03c32I(t1 = t2). The observed-data\ni=1 log P ((cid:126)yi | Xi, \u0398). A more detailed\n\nderivation is provided in the supplement.\nOptimizing the objective. To maximize the observed-data log-likelihood with respect to \u0398, we\npartition the parameters into two subsets. The \ufb01rst subset, \u03981 = {\u03a3b, \u03b1, (cid:96), \u03c32}, contains values that\nparameterize the covariance function K(t1, t2) above. As is often done when designing the ker-\nnel of a Gaussian process, we use a combination of domain knowledge to choose candidate values\nand model selection using observed-data log-likelihood as a criterion for choosing among candi-\ndates [20]. The second subset, \u03982 = {\u039b, (cid:126)w1:G, (cid:126)\u03b21:G}, contains values that parameterize the mean\nof the multivariate normal distribution in Equation 2. We learn these parameters using expectation\nmaximization (EM) to \ufb01nd a local maximum of the observed-data log-likelihood.\nExpectation step. All parameters related to (cid:126)bi and fi are limited to the covariance kernel and are\nnot optimized using EM. We therefore only need to consider the subtype indicators zi as unob-\nserved in the expectation step. Because zi is discrete, its posterior is computed by normalizing the\njoint probability of zi and (cid:126)yi. Let \u03c0\u2217\nig denote the posterior probability that individual i has subtype\ng \u2208 {1, . . . , G}, then we have\n\nig \u221d \u03c0g ((cid:126)xiz)N(cid:16)\n\n\u03c0\u2217\n\n(cid:126)yi | \u03a6p\n\n(cid:0)(cid:126)ti\n\n(cid:1) \u039b (cid:126)xip + \u03a6z\n\n(cid:0)(cid:126)ti\n\n(cid:1) (cid:126)\u03b2g, K(cid:0)(cid:126)ti, (cid:126)ti\n\n(cid:1)(cid:17)\n\n.\n\n(3)\n\nMaximization step. In the maximization step, we optimize the marginal probability of the soft\nassignments under the multinomial logistic regression model with respect to (cid:126)w1:G using gradient-\nbased methods. To optimize the expected complete-data log-likelihood with respect to \u039b and (cid:126)\u03b21:G,\nwe note that the mean of the multivariate normal for each individual is a linear function of these\nparameters. Holding \u039b \ufb01xed, we can therefore solve for (cid:126)\u03b21:G in closed form and vice versa. We use\na block coordinate ascent approach, alternating between solving for \u039b and (cid:126)\u03b21:G until convergence.\nBecause the expected complete-data log-likelihood is concave with respect to all parameters in \u03982,\neach maximization step is guaranteed to converge. We provide additional details in the supplement.\n\n2.2 Prediction\nOur prediction \u02c6y(t(cid:48)\ni is the expectation of the marker y(cid:48)\ni\nunder the posterior predictive conditioned on observed markers (cid:126)yi measured at times (cid:126)ti thus far.\nThis requires evaluating the following expression:\n\ni) for the value of the trajectory at time t(cid:48)\n\n(4)\n\n(5)\n\n,\n\n(6)\n\n(cid:90)\n\nG(cid:88)\n\nzi=1\n\n(cid:90)\n\nRNi\n\n\u02c6y (t(cid:48)\n\ni) =\n\nRd(cid:96)\n\n(cid:104)\n\n(cid:62)\n\n(cid:123)(cid:122)\n\n= E\u2217\n\nzi,(cid:126)bi,fi\n\n\u03a6p (t(cid:48)\ni)\n\n(cid:124)\n\n= \u03a6p (t(cid:48)\ni)\n\n\u039b (cid:126)xip\n\npopulation prediction\n\n(cid:125)\n\ndfi d(cid:126)bi\n\nE(cid:104)\n(cid:124)\n\nprediction given latent vars.\n(cid:62)\n\ni\n\nP\n\n(cid:16)\n\nzi,(cid:126)bi, fi | (cid:126)yi, Xi, \u0398\n\n(cid:123)(cid:122)\ni | zi,(cid:126)bi, fi, t(cid:48)\ny(cid:48)\n\n(cid:105)\n(cid:17)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n(cid:125)\n(cid:62) (cid:126)bi + fi (t(cid:48)\n(cid:62) (cid:126)\u03b2zi + \u03a6(cid:96) (t(cid:48)\n\u039b (cid:126)xip + \u03a6z (t(cid:48)\ni)\ni)\ni)\n(cid:122) (cid:125)(cid:124) (cid:123)\n(cid:123)\n(cid:125)(cid:124)\n(cid:104)(cid:126)\u03b2zi\n(cid:105)\n(cid:104)(cid:126)bi\n(cid:105)\n(cid:126)b\u2217\ni (Eq. 10)\n(cid:125)\n(cid:123)(cid:122)\n(cid:125)\nE\u2217\n(cid:126)bi\n\n(cid:126)\u03b2\u2217\ni (Eq. 7)\nE\u2217\nzi\n\n+ \u03a6(cid:96) (t(cid:48)\ni)\n\nposterior over latent vars.\n\n(cid:122)\n(cid:123)(cid:122)\n\n(cid:124)\n\n+\n\n(cid:62)\n\n+ \u03a6z (t(cid:48)\ni)\n\n(cid:62)\n\nsubpopulation prediction\n\nindividual prediction\n\n(cid:124)\n\n(cid:105)\n(cid:122)\n(cid:124)\n\nf\u2217\ni (t(cid:48)\nE\u2217\nfi\n\ni) (Eq. 12)\n[fi (t(cid:48)\ni)]\n\n(cid:125)(cid:124)\n(cid:123)(cid:122)\n\n(cid:123)\n(cid:125)\n\nstructured noise prediction\n\nwhere E\u2217 denotes an expectation conditioned on (cid:126)yi, Xi, \u0398. In moving from Eq. 4 to 5, we have\nwritten the integral as an expectation and substituted the inner expectation with the mean of the\nnormal distribution in Eq. 1. From Eq. 5 to 6, we use linearity of expectation. Eqs. 7, 10, and 12\n\n5\n\n\fFigure 2: Plots (a) and (c) show dynamic predictions using the proposed model for two individuals. Red\nmarkers are unobserved. Blue shows the trajectory predicted using the most likely subtype, and green shows\nthe second most likely. Plot (b) shows dynamic predictions using the B-spline GP baseline. Plot (d) shows\npredictions made using the proposed model without individual-speci\ufb01c adjustments.\n\nbelow show how the expectations in Eq. 6 are computed. An expanded version of these steps are\nprovided in the supplement.\nComputing the population prediction is straightforward as all quantities are observed. To compute\nthe subpopulation prediction, we need to compute the marginal posterior over zi, which we used in\nthe expectation step above (Eq. 3). The expected subtype coef\ufb01cients are therefore\n\n(cid:44)(cid:16)(cid:80)G\n\n(cid:126)\u03b2\u2217\n\ni\n\nzi=1 \u03c0\u2217\n\nizi\n\n(cid:126)\u03b2zi\n\n(cid:17)\n\n.\n\n(7)\n\nTo compute the individual prediction, note that by conditioning on zi, the integral over the likelihood\nwith respect to fi and the prior over (cid:126)bi form the likelihood and prior of a Bayesian linear regression.\nLet Kf = KOU((cid:126)ti, (cid:126)ti) + \u03c32I, then the posterior over (cid:126)bi conditioned on zi is:\n\n(cid:16)(cid:126)bi | zi, (cid:126)yi, Xi, \u0398\n\n(cid:17) \u221d N(cid:16)(cid:126)bi | 0, \u03a3b\n\n(cid:17)N(cid:16)\n\nP\n\n(cid:0)(cid:126)ti\n\n(cid:1) (cid:126)\u03b2zi + \u03a6(cid:96)\n\n(cid:0)(cid:126)ti\n\n(cid:1)(cid:126)bi, Kf\n\n(cid:17)\n\n.\n\n(8)\n\n(cid:126)yi | \u03a6p\u039b (cid:126)xip + \u03a6z\n\nJust as in Eq. 2, we have integrated over fi moving its effect from the mean of the normal distribution\nto the covariance. Because the prior over (cid:126)bi is conjugate to the likelihood on the right side of Eq. 8,\nthe posterior can be written in closed form as a normal distribution (see e.g. [10]). The mean of the\nleft side of Eq. 8 is therefore\n\u03a3\u22121\nb + \u03a6(cid:96)((cid:126)ti)(cid:62)K\u22121\n\n(cid:16)\n(cid:126)yi \u2212 \u03a6p((cid:126)ti)\u039b (cid:126)xip \u2212 \u03a6z((cid:126)ti)(cid:126)\u03b2zi\n\n\u03a6(cid:96)((cid:126)ti)(cid:62)K\u22121\n\n(cid:105)\u22121(cid:104)\n\nf \u03a6(cid:96)((cid:126)ti)\n\n(cid:17)(cid:105)\n\n(cid:104)\n\n(9)\n\n,\n\nf\n\nTo compute the unconditional posterior mean of (cid:126)bi we take the expectation of Eq. 9 with respect to\nthe posterior over zi. Eq. 9 is linear in (cid:126)\u03b2zi, so we can directly replace (cid:126)\u03b2zi with its mean (Eq. 7):\n\n(cid:126)b\u2217\n\ni\n\n\u03a3\u22121\nb + \u03a6(cid:96)((cid:126)ti)(cid:62)K\u22121\n\nf \u03a6(cid:96)((cid:126)ti)\n\n\u03a6(cid:96)((cid:126)ti)(cid:62)K\u22121\n\nf\n\n(cid:126)yi \u2212 \u03a6p((cid:126)ti)\u039b (cid:126)xip \u2212 \u03a6z((cid:126)ti)(cid:126)\u03b2\u2217\n\ni\n\n.\n\n(10)\n\n(cid:105)\u22121(cid:104)\n\n(cid:16)\n\n(cid:44)(cid:104)\n\nFinally, to compute the structured noise prediction, note that conditioned on zi and (cid:126)bi, the GP prior\nand marker likelihood (Eq. 1) form a standard GP regression (see e.g.\n[20]). The conditional\nposterior of fi(t(cid:48)\nKOU(t(cid:48)\n\n(cid:126)yi \u2212 \u03a6p((cid:126)ti)\u039b (cid:126)xip \u2212 \u03a6z((cid:126)ti)(cid:126)\u03b2zi \u2212 \u03a6(cid:96)((cid:126)ti)(cid:126)bi\n\n(cid:17)\n\n(11)\n\n.\n\nTo compute the unconditional posterior expectation of fi(t(cid:48)\nlinear in zi and (cid:126)bi and so their expectations can be plugged in to obtain\n\ni), we note that the expression above is\n\ni) is therefore a GP with mean\n\ni, (cid:126)ti)(cid:2)KOU((cid:126)ti, (cid:126)ti) + \u03c32I(cid:3)\u22121(cid:16)\ni, (cid:126)ti)(cid:2)KOU((cid:126)ti, (cid:126)ti) + \u03c32I(cid:3)\u22121(cid:16)\n\nf\u2217(t(cid:48)\n\ni) (cid:44) KOU(t(cid:48)\n\n(cid:126)yi \u2212 \u03a6p((cid:126)ti)\u039b (cid:126)xip \u2212 \u03a6z((cid:126)ti)(cid:126)\u03b2\u2217\n\ni \u2212 \u03a6(cid:96)((cid:126)ti)(cid:126)b\u2217\n\ni\n\n.\n\n(12)\n\n6\n\n(cid:17)(cid:105)\n\n(cid:17)\n\n124\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf4060800.02.55.07.510.00.02.55.07.510.00.02.55.07.510.0124\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf204060801000.02.55.07.510.00.02.55.07.510.00.02.55.07.510.0124\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf4060800.02.55.07.510.00.02.55.07.510.00.02.55.07.510.0Pr()=Pr()=0.57 0.18Pr()=Pr()=0.71 0.21Pr()=Pr()=0.60 0.39124\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf204060801000.02.55.07.510.00.02.55.07.510.00.02.55.07.510.0Pr()=Pr()=0.53 0.26Pr()=Pr()=0.54 0.46Pr()=Pr()=0.99 0.01Pr()=Pr()=0.56 0.40Pr()=Pr()=0.54 0.46Years Since First Seen(a)(b)(c)(d)\f3 Experiments\nWe demonstrate our approach by building a tool to predict the lung disease trajectories of individ-\nuals with scleroderma. Lung disease is currently the leading cause of death among scleroderma\npatients, and is notoriously dif\ufb01cult to treat because there are few predictors of decline and there is\ntremendous variability across individual trajectories [21]. Clinicians track lung severity using per-\ncent of predicted forced vital capacity (PFVC), which is expected to drop as the disease progresses.\nIn addition, demographic variables and molecular test results are often available at baseline to aid\nprognoses. We train and validate our model using data from the Johns Hopkins Scleroderma Center\npatient registry, which is one of the largest in the world. To select individuals from the registry, we\nused the following criteria. First, we include individuals who were seen at the clinic within two\nyears of their earliest scleroderma-related symptom. Second, we exclude all individuals with fewer\nthan two PFVC measurements after their \ufb01rst visit. Finally, we exclude individuals who received a\nlung transplant. The dataset contains 672 individuals and a total of 4, 992 PFVC measurements.\nFor the population model, we use constant functions (i.e. observed covariates adjust an individual\u2019s\nintercept). The population covariates ((cid:126)xip) are gender, African American race, and indicators of\nACA and Scl-70 antibodies\u2014two proteins believed to be connected to scleroderma-related lung\ndisease. Note that all features are binary. For the subpopulation B-splines, we set boundary knots\nat 0 and 25 years (the maximum observation time in our data set is 23 years), use two interior knots\nthat divide the time period from 0-25 years into three equally spaced chunks, and use quadratics\nas the piecewise components. These B-spline hyperparameters (knots and polynomial degree) are\nalso used for all baseline models. We select G = 9 subtypes using BIC. The covariates in the\nsubtype marginal model ((cid:126)xiz) are the same used in the population model. For the individual model,\nwe use linear functions. For the hyper-parameters \u03981 = {\u03a3b, \u03b1, (cid:96), \u03c32} we set \u03a3b to be a diagonal\ncovariance matrix with entries [16, 10\u22122] along the diagonal, which correspond to intercept and\nslope variances respectively. Finally, we set \u03b1 = 6, (cid:96) = 2, and \u03c32 = 1 using domain knowledge;\nwe expect transient deviations to last around 2 years and to change PFVC by around \u00b16 units.\nBaselines. First, to compare against typical approaches used in clinical medicine that condition on\nbaseline covariates only (e.g. [22]), we \ufb01t a regression model conditioned on all covariates included\nin (cid:126)xiz above. The mean is parameterized using B-spline bases (\u03a6(t)) as:\n\n\u02c6y | (cid:126)xiz = \u03a6(t)(cid:62)(cid:16)(cid:126)\u03b20 +(cid:80)\n\n(cid:126)\u03b2i +(cid:80)\n\n(cid:17)\n\nxi in (cid:126)xiz\n\nxi\n\nxi,xj in pairs of (cid:126)xiz\n\nxixj\n\n(cid:126)\u03b2ij\n\n.\n\n(13)\n\nThe second baseline is similar to [8] and [23] and extends the \ufb01rst baseline by accounting for\nindividual-speci\ufb01c heterogeneity. The model has a mean function identical to the \ufb01rst baseline and\nindividualizes predictions using a GP with the same kernel as in Equation 2 (using hyper-parameters\nas above). Another natural approach is to explain heterogeneity by using a mixture model similar to\n[9]. However, a mixture model cannot adequately explain away individual-speci\ufb01c sources of vari-\nability that are unrelated to subtype and therefore fails to recover subtypes that capture canonical\ntrajectories (we discuss this in detail in the supplemental section). The recovered subtypes from the\nfull model do not suffer from this issue. To make the comparison fair and to understand the extent\nto which the individual-speci\ufb01c component contributes towards personalizing predictions, we create\na mixture model (Proposed w/ no personalization) where the subtypes are \ufb01xed to be the same as\nthose in the full model and the remaining parameters are learned. Note that this version does not\ncontain the individual-speci\ufb01c component.\nEvaluation. We make predictions after one, two, and four years of follow-up. Errors are summa-\nrized within four disjoint time periods: (1, 2], (2, 4], (4, 8], and (8, 25] years2. To measure error,\nwe use the absolute difference between the prediction and a smoothed version of the individual\u2019s\nobserved trajectory. We estimate mean absolute error (MAE) using 10-fold CV at the level of indi-\nviduals (i.e. all of an individual\u2019s data is held-out), and test for statistically signi\ufb01cant reductions in\nerror using a one-sided, paired t-test. For all models, we use the MAP estimate of the individual\u2019s\ntrajectory. In the models that include subtypes, this means that we choose the trajectory predicted by\nthe most likely subtype under the posterior. Although this discards information from the posterior,\nin our experience clinicians \ufb01nd this choice to be more interpretable.\nQualitative results. In Figure 2 we present dynamically updated predictions for two patients (one\nper row, dynamic updates move left to right). Blue lines indicate the prediction under the most likely\nsubtype and green lines indicate the prediction under the second most likely. The \ufb01rst individual\n\n2After the eighth year, data becomes too sparse to further divide this time span.\n\n7\n\n\fModel\nB-spline with Baseline Feats.\nB-spline + GP\nProposed\nProposed w/ no personalization\n\nB-spline with Baseline Feats.\nB-spline + GP\nProposed\nProposed w/ no personalization\n\nB-spline with Baseline Feats.\nB-spline + GP\nProposed\nProposed w/ no personalization\n\nPredictions using 1 year of data\n(2, 4] % Im.\n12.73\n7.70\n\u22177.04\n7.12\n\n(1, 2] % Im.\n12.78\n5.49\n5.26\n6.17\nPredictions using 2 years of data\n\n8.6\n\n12.73\n5.88\n\u22175.48\n6.00\n\n6.8\n\nPredictions using 4 years of data\n\n(4, 8] % Im.\n12.40\n9.67\n10.17\n9.38\n\n(8, 25] % Im.\n12.14\n10.71\n12.12\n12.85\n\n12.40\n8.65\n\u22177.95\n8.12\n\n12.40\n6.00\n\u22175.14\n5.75\n\n8.1\n\n14.3\n\n12.14\n10.02\n9.53\n11.39\n\n12.14\n8.88\n\u22177.58\n9.16\n\n14.3\n\nTable 1: MAE of PFVC predictions for the two baselines and the proposed model. Bold numbers indicate best\nperformance across models (\u2217 is stat. signi\ufb01cant). \u201c% Im.\u201d reports percent improvement over next best.\n\n(Figure 2a) is a 50-year-old, white woman with Scl-70 antibodies, which are thought to be associated\nwith active lung disease. Within the \ufb01rst year, her disease seems stable, and the model predicts this\ncourse with 57% con\ufb01dence. After another year of data, the model shifts 21% of its belief to a\nrapidly declining trajectory; likely in part due to the sudden dip in year 2. We contrast this with the\nbehavior of the B-spline GP shown in Figure 2b, which has limited capacity to express individualized\nlong-term behavior. We see that the model does not adequately adjust in light of the downward trend\nbetween years one and two. To illustrate the value of including individual-speci\ufb01c adjustments, we\nnow turn to Figures 2c and 2d (which plot predictions made by the proposed model with and without\npersonalization respectively). This individual is a 60-year-old, white man that is Scl-70 negative,\nwhich makes declining lung function less likely. Both models use the same set of subtypes, but\nwhereas the model without individual-speci\ufb01c adjustment does not consider the recovering subtype\nto be likely until after year two, the full model shifts the recovering subtype trajectory downward\ntowards the man\u2019s initial PFVC value and identify the correct trajectory using a single year of data.\nQuantitative results. Table 1 reports MAE for the baselines and the proposed model. We note that\nafter observing two or more years of data, our model\u2019s errors are smaller than the two baselines (and\nstatistically signi\ufb01cantly so in all but one comparison). Although the B-spline GP improves over the\n\ufb01rst baseline, these results suggest that both subpopulation and individual-speci\ufb01c components en-\nable more accurate predictions of an individual\u2019s future course as more data are observed. Moreover,\nby comparing the proposed model with and without personalization, we see that subtypes alone are\nnot suf\ufb01cient and that individual-speci\ufb01c adjustments are critical. These improvements also have\nclinical signi\ufb01cance. For example, individuals who drop by more than 10 PFVC are candidates for\naggressive immunosuppressive therapy. Out of the 7.5% of individuals in our data who decline by\nmore than 10 PFVC, our model predicts such a decline at twice the true-positive rate of the B-spline\nGP (31% vs. 17%) and with a lower false-positive rate (81% vs. 90%).\n4 Conclusion\nWe have described a hierarchical model for making individualized predictions of disease activity\ntrajectories that accounts for both latent and observed sources of heterogeneity. We empirically\ndemonstrated that using all elements of the proposed hierarchy allows our model to dynamically\npersonalize predictions and reduce error as more data about an individual is collected. Although\nour analysis focused on scleroderma, our approach is more broadly applicable to other complex,\nheterogeneous diseases [1]. Examples of such diseases include asthma [3], autism [4], and COPD\n[5]. There are several promising directions for further developing the ideas presented here. First, we\nobserved that predictions are less accurate early in the disease course when little data is available to\nlearn the individual-speci\ufb01c adjustments. To address this shortcoming, it may be possible to leverage\ntime-dependent covariates in addition to the baseline covariates used here. Second, the quality of our\npredictions depends upon the allowed types of individual-speci\ufb01c adjustments encoded in the model.\nMore sophisticated models of individual variation may further improve performance. Moreover,\napproaches for automatically learning the class of possible adjustments would make it possible to\napply our approach to new diseases more quickly.\n\n8\n\n\fReferences\n[1] J. Craig. Complex diseases: Research and applications. Nature Education, 1(1):184, 2008.\n[2] J. Varga, C.P. Denton, and F.M. Wigley. Scleroderma: From Pathogenesis to Comprehensive Manage-\n\nment. Springer Science & Business Media, 2012.\n\n[3] J. L\u00a8otvall et al. Asthma endotypes: a new approach to classi\ufb01cation of disease entities within the asthma\n\nsyndrome. Journal of Allergy and Clinical Immunology, 127(2):355\u2013360, 2011.\n\n[4] L.D. Wiggins, D.L. Robins, L.B. Adamson, R. Bakeman, and C.C. Henrich. Support for a dimensional\nview of autism spectrum disorders in toddlers. Journal of autism and developmental disorders, 42(2):191\u2013\n200, 2012.\n\n[5] P.J. Castaldi et al. Cluster analysis in the copdgene study identi\ufb01es subtypes of smokers with distinct\n\npatterns of airway disease and emphysema. Thorax, 2014.\n\n[6] S. Saria and A. Goldenberg. Subtyping: What Is It and Its Role in Precision Medicine. IEEE Intelligent\n\nSystems, 30, 2015.\n\n[7] D.S. Lee, P.C. Austin, J.L. Rouleau, P.P Liu, D. Naimark, and J.V. Tu. Predicting mortality among patients\nhospitalized for heart failure: derivation and validation of a clinical model. Jama, 290(19):2581\u20132587,\n2003.\n\n[8] D. Rizopoulos. Dynamic predictions and prospective accuracy in joint models for longitudinal and time-\n\nto-event data. Biometrics, 67(3):819\u2013829, 2011.\n\n[9] C. Proust-Lima et al. Joint latent class models for longitudinal and time-to-event data: A review. Statisti-\n\ncal Methods in Medical Research, 23(1):74\u201390, 2014.\n\n[10] K.P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.\n[11] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for time-\nseries modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi-\nneering Sciences, 371(1984):20110550, 2013.\n\n[12] J.Q. Shi, R. Murray-Smith, and D.M. Titterington. Hierarchical gaussian process mixtures for regression.\n\nStatistics and computing, 15(1):31\u201341, 2005.\n\n[13] A. Gelman and J. Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge\n\nUniversity Press, 2006.\n\n[14] H. Wang et al. High-order multi-task feature learning to identify longitudinal phenotypic markers for\nalzheimer\u2019s disease progression prediction. In Advances in Neural Information Processing Systems, pages\n1277\u20131285, 2012.\n\n[15] J. Ross and J. Dy. Nonparametric mixture of gaussian processes with constraints. In Proceedings of the\n\n30th International Conference on Machine Learning (ICML-13), pages 1346\u20131354, 2013.\n\n[16] P.F. Schulam, F.M. Wigley, and S. Saria. Clustering longitudinal clinical marker trajectories from elec-\ntronic health data: Applications to phenotyping and endotype discovery. In Proceedings of the Twinty-\nNinth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[17] B.M. Marlin. Modeling user rating pro\ufb01les for collaborative \ufb01ltering. In Advances in neural information\n\nprocessing systems, 2003.\n\n[18] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of\nthe state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE Transactions on,\n17(6):734\u2013749, 2005.\n\n[19] D. Sontag, K. Collins-Thompson, P.N. Bennett, R.W. White, S. Dumais, and B. Billerbeck. Probabilistic\nmodels for personalizing web search. In Proceedings of the \ufb01fth ACM international conference on Web\nsearch and data mining, pages 433\u2013442. ACM, 2012.\n\n[20] C.E. Rasmussen and C.K. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[21] Y. Allanore et al. Systemic sclerosis. Nature Reviews Disease Primers, 2015.\n[22] D. Khanna et al. Clinical course of lung physiology in patients with scleroderma and interstitial lung\ndisease: analysis of the scleroderma lung study placebo group. Arthritis & Rheumatism, 63(10):3078\u2013\n3085, 2011.\n\n[23] J.Q. Shi, B. Wang, E.J. Will, and R.M. West. Mixed-effects gaussian process functional regression models\n\nwith application to dose\u2013response curve prediction. Stat. Med., 31(26):3165\u20133177, 2012.\n\n9\n\n\f", "award": [], "sourceid": 503, "authors": [{"given_name": "Peter", "family_name": "Schulam", "institution": "Johns Hopkins University"}, {"given_name": "Suchi", "family_name": "Saria", "institution": "Johns Hopkins University"}]}