{"title": "Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors", "book": "Advances in Neural Information Processing Systems", "page_first": 1845, "page_last": 1853, "abstract": "An accurate model of patient survival time can help in the treatment and care of cancer patients. The common practice of providing survival time estimates based only on population averages for the site and stage of cancer ignores many important individual differences among patients. In this paper, we propose a local regression method for learning patient-specific survival time distribution based on patient attributes such as blood tests and clinical assessments. When tested on a cohort of more than 2000 cancer patients, our method gives survival time predictions that are much more accurate than popular survival analysis models such as the Cox and Aalen regression models. Our results also show that using patient-specific attributes can reduce the prediction error on survival time by as much as 20% when compared to using cancer site and stage only.", "full_text": "Learning Patient-Speci\ufb01c Cancer Survival\n\nDistributions as a Sequence of Dependent Regressors\n\nChun-Nam Yu, Russell Greiner, Hsiu-Chin Lin\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, AB T6G 2E8\n\n{chunnam,rgreiner,hsiuchin}@ualberta.ca\n\nVickie Baracos\n\nDepartment of Oncology\n\nUniversity of Alberta\n\nEdmonton, AB T6G 1Z2\n\nvickie.baracos@ualberta.ca\n\nAbstract\n\nAn accurate model of patient survival time can help in the treatment and care\nof cancer patients. The common practice of providing survival time estimates\nbased only on population averages for the site and stage of cancer ignores many\nimportant individual differences among patients. In this paper, we propose a local\nregression method for learning patient-speci\ufb01c survival time distribution based\non patient attributes such as blood tests and clinical assessments. When tested\non a cohort of more than 2000 cancer patients, our method gives survival time\npredictions that are much more accurate than popular survival analysis models\nsuch as the Cox and Aalen regression models. Our results also show that using\npatient-speci\ufb01c attributes can reduce the prediction error on survival time by as\nmuch as 20% when compared to using cancer site and stage only.\n\n1\n\nIntroduction\n\nWhen diagnosed with cancer, most patients ask about their prognosis: \u201chow long will I live\u201d, and\n\u201cwhat is the success rate of each treatment option\u201d. Many doctors provide patients with statistics\non cancer survival based only on the site and stage of the tumor. Commonly used statistics include\nthe 5-year survival rate and median survival time, e.g., a doctor can tell a speci\ufb01c patient with early\nstage lung cancer that s/he has a 50% 5-year survival rate.\nIn general, today\u2019s cancer survival rates and median survival times are estimated from a large group\nof cancer patients; while these estimates do apply to the population in general, they are not particu-\nlarly accurate for individual patients, as they do not include patient-speci\ufb01c information such as age\nand general health conditions. While doctors can make adjustments to their survival time predic-\ntions based on these individual differences, it is better to directly incorporate these important factors\nexplicitly in the prognostic models \u2013 e.g. by incorporating the clinical information, such as blood\ntests and performance status assessments [1] that doctors collect during the diagnosis and treatment\nof cancer. These data reveal important information about the state of the immune system and or-\ngan functioning of the patient, and therefore are very useful for predicting how well a patient will\nrespond to treatments and how long s/he will survive. In this work, we develop machine learning\ntechniques to incorporate this wealth of healthcare information to learn a more accurate prognostic\nmodel that uses patient-speci\ufb01c attributes. With improved prognostic models, cancer patients and\ntheir families can make more informed decisions on treatments, lifestyle changes, and sometimes\nend-of-life care.\nIn survival analysis [2], the Cox proportional hazards model [3] and other parametric survival dis-\ntributions have long been used to \ufb01t the survival time of a population. Researchers and clinicians\nusually apply these models to compare the survival time of two populations or to test for signi\ufb01cant\nrisk factors affecting survival; n.b., these models are not designed for the task of predicting survival\n\n1\n\n\ftime for individual patients. Also, as these models work with the hazard function instead of the sur-\nvival function (see Section 2), they might not give good calibrated predictions on survival rates for\nindividuals. In this work we propose a new method, multi-task logistic regression (MTLR), to learn\npatient-speci\ufb01c survival distributions. MTLR directly models the survival function by combining\nmultiple local logistic regression models in a dependent manner. This allows it to handle censored\nobservations and the time-varying effects of features naturally. Compared to survival regression\nmethods such as the Cox and Aalen regression models, MTLR gives signi\ufb01cantly more accurate\npredictions on survival rates over several datasets, including a large cohort of more than 2000 cancer\npatients. MTLR also reduces the prediction error on survival time by 20% when compared to the\ncommon practice of using the median survival time based on cancer site and stage.\nSection 2 surveys basic survival analysis and related works. Section 3 introduces our method for\nlearning patient-speci\ufb01c survival distributions. Section 4 evaluates our learned models on a large\ncohort of cancer patients, and also provides additional experiments on two other datasets.\n\n2 Survival Time Prediction for Cancer Patients\n\nIn most regression problems, we know both the covariates and \u201coutcome\u201d values for all individuals.\nBy contrast, it is typical to not know many of the outcome values in survival data. In many medical\nstudies, the event of interest for many individuals (death, disease recurrence) might not have oc-\ncurred within the \ufb01xed period of study. In addition, other subjects could move out of town or decide\nto drop out any time. Here we know only the date of the \ufb01nal visit, which provides a lower bound\non the survival time. We refer to the time recorded as the \u201cevent time\u201d, whether it is the true survival\ntime, or just the time of the last visit (censoring time). Such datasets are considered censored.\nSurvival analysis provides many tools for modeling the survival time T of a population, such as a\ngroup of stage-3 lung cancer patients. A basic quantity of interest is the survival function S(t) =\nP (T \u2265 t), which is the probability that an individual within the population will survive longer than\ntime t. Given the survival times of a set of individuals, we can plot the proportion of surviving\nindividuals against time, as a way to visualize S(t). The plot of this empirical survival distribution\nis called the Kaplan-Meier curve [4] (Figure 1(left)).\nThis is closely related to the hazard function \u03bb(t), which describes the instantaneous rate of failure\nat time t\n\n(cid:18)\n\n(cid:90) t\n\n(cid:19)\n\n\u03bb(t) = lim\n\u2206t\u21920\n\nP (t \u2264 T < t + \u2206t | T \u2265 t)/\u2206t, and S(t) = exp\n\n\u2212\n\n\u03bb(u)du\n\n.\n\n0\n\n2.1 Regression Models in Survival Analysis\n\nOne of the most well-known regression model in survival analysis is Cox\u2019s proportional hazards\nmodel [3]. It assumes the hazard function \u03bb(t) depends multiplicatively on a set of features (cid:126)x:\n\n\u03bb(t | (cid:126)x) = \u03bb0(t) exp((cid:126)\u03b8 \u00b7 (cid:126)x).\n\nIt is called the proportional hazards model because the hazard rates of two individuals with features\n(cid:126)x1 and (cid:126)x2 differ by a ratio exp((cid:126)\u03b8 \u00b7 ((cid:126)x1 \u2212 (cid:126)x2)). The function \u03bb0(t), called the baseline hazard, is\nusually left unspeci\ufb01ed in Cox regression. The regression coef\ufb01cients (cid:126)\u03b8 are estimated by maximiz-\ning a partial likelihood objective, which depends only on the relative ordering of survival time of\nindividuals but not on their actual values. Cox regression is mostly used for identifying important\nrisk factors associated with survival in clinical studies. It is typically not used to predict survival\ntime since the hazard function is incomplete without the baseline hazard \u03bb0. Although we can \ufb01t\na non-parametric survival function for \u03bb0(t) after the coef\ufb01cients of Cox regression are determined\n[2], this requires a cumbersome 2-step procedure. Another weakness of the Cox model is its propor-\ntional hazards assumption, which restricts the effect of each feature on survival to be constant over\ntime.\nThere are alternatives to the Cox model that avoids the proportional hazards restriction, including\nthe Aalen additive hazards model [5] and other time-varying extensions to the Cox model [6]. The\nAalen linear hazard model assumes the hazard function has the form\n\n\u03bb(t | (cid:126)x) = (cid:126)\u03b8(t) \u00b7 (cid:126)x.\n\n2\n\n(1)\n\n\fFigure 1: (Left) Kaplan-Meier curve: each point (x, y) means proportion y of the patients are alive\nat time x. Vertical line separates those who have died versus those who survive at t = 20 months.\n(Middle) Example binary encoding for patient 1 (uncensored) with survival time 21.3 months and for\npatient 2 (censored), with last visit time at 21.3 months. (Right) Example discrete survival function\nfor a single patient predicted by MTLR.\n\nWhile there are now many estimation techniques, goodness-of-\ufb01t tests, hypothesis tests for these\nsurvival regression models, they are rarely evaluated on the task of predicting survival time of in-\ndividual patients. Moreover, it is not easy to choose between the various assumptions imposed by\nthese models, such as whether the hazard rate should be a multiplicative or additive function of the\nfeatures. In this paper we will test our MTLR method, which directly models the survival function,\nagainst Cox regression and Aalen regression as representatives of these survival analysis models.\nIn machine learning, there are a few recently proposed regression technqiues for survival prediction\n[7, 8, 9, 10]. These methods attempt to optimize speci\ufb01c loss function or performance measures,\nwhich usually involve modifying the common regression loss functions to handle censored data. For\nexample, Shivaswamy et al. [7] modi\ufb01ed the support vector regression (SVR) loss function from\n\n(cid:110)|y \u2212 (cid:126)\u03b8 \u00b7 (cid:126)x| \u2212 \u0001, 0\n(cid:111)\n\nmax\n\n(cid:110)\n\n(cid:111)\n\nto max\n\n(y \u2212 (cid:126)\u03b8 \u00b7 (cid:126)x) \u2212 \u0001, 0\n\n,\n\nwhere y is the time of censoring and \u0001 is a tolerance parameter. In this way any prediction (cid:126)\u03b8\u00b7(cid:126)x above\nthe censoring time y is deemed consistent with observation and is not penalized. This class of direct\nregression methods usually give very good results on the particular loss functions they optimize\nover, but could fail if the loss function is non-convex or dif\ufb01cult to optimize. Moreover, these\nmethods only predict a single survival time value (a real number) without an associated con\ufb01dence\non prediction, which is a serious drawback in clinical applications.\nOur MTLR model below is closely related to local regression models [11] and varying coef\ufb01cient\nmodels [12] in statistics. Hastie and Tibshirani [12] described a very general class of regression\nmodels that allow the coef\ufb01cients to change with another set of variables called \u201ceffect modi\ufb01ers\u201d;\nthey also discussed an application of their model to overcome the proportional hazards assumption\nin Cox models. While we focus on predicting survival time, they instead focused on evaluating the\ntime-varying effect of prognostic factors and worked with the rank-based partial likelihood objective.\n\n3 Survival Distribution Modeling via a Sequence of Dependent Regressors\n\nConsider a simpler classi\ufb01cation task of predicting whether an individual will survive for more than\nt months. A common approach for this classi\ufb01cation task is the logistic regression model [13],\nwhere we model the probability of surviving more than t months as:\n1 + exp((cid:126)\u03b8 \u00b7 (cid:126)x + b)\n\n(cid:17)\u22121\n\nP(cid:126)\u03b8(T \u2265 t | (cid:126)x) =\n\n(cid:16)\n\n.\n\nThe parameter vector (cid:126)\u03b8 describes the effect of how the features (cid:126)x affect the chance of survival, with\nthe threshold b. This task corresponds to a speci\ufb01c time point on the Kaplan-Meier curve, which\nattempts to discriminate those who survive against those who have died, based on the features (cid:126)x\n(Figure 1(left)). Equivalently, the logistic regression model can be seen as modeling the individual\nsurvival probabilities of cancer patients at the time snapshot t.\nTaking this idea one step further, consider modeling the probability of survival of patients at each of\na vector of time points \u03c4 = (t1, t2, . . . , tm) \u2013 e.g., \u03c4 could be the 60 monthly intervals from 1 month\n\n3\n\n0102030405060700.00.20.40.60.81.0Time (Months)Proportion Surviving00011000y1y2y21y22y60y1y2y21y22y60t1=1t2=2t21=21t22=22t60=60t1=1t2=2t21=21t22=22t60=60. . . . . . . . . . . . . . . . . . . . . . . . s=21.3sc=21.3Patient 1 (uncensored)Patient 2 (censored)llllllllllllllllllllllllllllllllllllllllllllllllllllllllllll01020304050600.00.20.40.60.81.0Time (Months)P(survival)\fup to 60 months. We can set up a series of logistic regression models for each of these:\n\n(T \u2265 ti | (cid:126)x) =\n\nP(cid:126)\u03b8i\n\n1 + exp((cid:126)\u03b8i \u00b7 (cid:126)x + bi)\n\n,\n\n1 \u2264 i \u2264 m,\n\n(2)\n\n(cid:16)\n\n(cid:17)\u22121\n\nwhere (cid:126)\u03b8i and bi are time-speci\ufb01c parameter vector and thresholds. The input features (cid:126)x stay the\nsame for all these classi\ufb01cation tasks, but the binary labels yi = [T \u2265 ti] can change depending\non the threshold ti. This particular setup allows us to answer queries about the survival probability\nof individual patients at each of the time snapshots {ti}, getting close to our goal of modeling\na personal survival time distribution for individual patients. The use of time-speci\ufb01c parameter\nvector naturally allows us to capture the effect of time-varying covariates, similar to many dynamic\nregression models [14, 12].\nHowever the outputs of these logistic regression models are not independent, as a death event at\nor before time ti implies death at all subsequent time points tj for all j > i. MTLR enforces\nthe dependency of the outputs by predicting the survival status of a patient at each of the time\nsnapshots ti jointly instead of independently. We encode the survival time s of a patient as a binary\nsequence y = (y1, y2, . . . , ym), where yi \u2208 {0, 1} denotes the survival status of the patient at time\nti, so that yi = 0 (no death event yet) for all i with ti < s, and yi = 1 (death) for all i with\nti \u2265 s (see Figure 1(middle)). We denote such an encoding of the survival time s as y(s), and\nlet yi(s) be the value at its ith position. Here there are m + 1 possible legal sequences of the form\n(0, 0, . . . , 1, 1, . . . , 1), including the sequence of all \u20180\u2019s and the sequence of all \u20181\u2019s. The probability\nof observing the survival status sequence y = (y1, y2, . . . , ym) can be represented by the following\ngeneralization of the logistic regression model:\n\nP\u0398(Y =(y1, y2, . . . , ym) | (cid:126)x) =\n\nwhere \u0398 = ((cid:126)\u03b81, . . . , (cid:126)\u03b8m), and f\u0398((cid:126)x, k) =(cid:80)m\n\ni=k+1((cid:126)\u03b8i \u00b7 (cid:126)x + bi) for 0 \u2264 k \u2264 m is the score of the\nsequence with the event occuring in the interval [tk, tk+1) before taking the logistic transform, with\nthe boundary case f\u0398((cid:126)x, m) = 0 being the score for the sequence of all \u20180\u2019s. This is similar to the\nobjective of conditional random \ufb01elds [15] for sequence labeling, where the labels at each node are\nscored and predicted jointly.\nTherefore the log likelihood of a set of uncensored patients with survival time s1, s2, . . . , sn and\nfeature vectors (cid:126)x1, (cid:126)x2, . . . , (cid:126)xn is\n\nexp((cid:80)m\n(cid:80)m\n\ni=1 yi((cid:126)\u03b8i \u00b7 (cid:126)x + bi))\nk=0 exp(f\u0398((cid:126)x, k))\n\n,\n\n(cid:88)n\n\ni=1\n\n(cid:88)m\n\nyj(si)((cid:126)\u03b8j \u00b7 (cid:126)xi + bj) \u2212 log\n\nj=1\n\n(cid:104)(cid:88)m\n\uf8ee\uf8f0 m(cid:88)\n(cid:107)(cid:126)\u03b8j+1\u2212(cid:126)\u03b8j(cid:107)2\u2212 n(cid:88)\nm\u22121(cid:88)\n\nm(cid:88)\n\nj=1\n\nmin\n\u0398\n\nC1\n2\n\n(cid:107)(cid:126)\u03b8j(cid:107)2+\n\nC2\n2\n\nexp f\u0398((cid:126)xi, k)\n\n.\n\nk=0\n\n(cid:105)\n\nm(cid:88)\n\nInstead of directly maximizing this log likelihood, we solve the following optimization problem:\n\n\uf8f9\uf8fb (3)\n\nyj(si)((cid:126)\u03b8j\u00b7(cid:126)xi +bj)\u2212log\n\nexp f\u0398((cid:126)xi, k)\n\nj=1\n\ni=1\n\nj=1\n\nk=0\n\nThe \ufb01rst regularizer over (cid:107)(cid:126)\u03b8j(cid:107)2 ensures the norm of the parameter vector is bounded to prevent\nover\ufb01tting. The second regularizer (cid:107)(cid:126)\u03b8j+1\u2212(cid:126)\u03b8j(cid:107)2 ensures the parameters vary smoothly across con-\nsecutive time points, and is especially important for controlling the capacity of the model when the\ntime points become dense. The regularization constants C1 and C2, which control the amount of\nsmoothing for the model, can be estimated via cross-validation. As the above optimization problem\nis convex and differentiable, optimization algorithms such as Newton\u2019s method or quasi-Newton\nmethods can be applied to solve it ef\ufb01ciently. Since we model the survival distribution as a series of\ndependent prediction tasks, we call this model multi-task logistic regression (MTLR). Figure 1(right)\nshows an example survival distribution predicted by MTLR for a test patient.\n\n3.1 Handling Censored Data\n\nOur multi-task logistic regression model can handle censoring naturally by marginalizing over the\nunobserved variables in a survival status sequence (y1, y2, . . . , ym). For example, suppose a patient\nwith features (cid:126)x is censored at time sc, and tj is the closest time point after sc. Then all the sequences\n\n4\n\n\fTable 1: Left: number of cancer patients for each site and stage in the cancer registry dataset. Right:\nfeatures used in learning survival distributions\nsite\\stage\nBronchus & Lung\nColorectal\nHead and Neck\nEsophagus\nPancreas\nStomach\nOther Digestive\nMisc\n\nage, sex, weight gain/loss,\nBMI, cancer site, cancer stage\nno appetite, nausea, sore mouth,\ntaste funny, constipation, pain,\ndental problem, dry mouth, vomit,\ndiarrhea, performance status\ngranulocytes, LDH-serum, HGB,\nlyphocytes platelet, WBC count,\ncalcium-serum, creatinine, albumin\n\n4\n390\n545\n206\n63\n134\n128\n77\n123\n\n3\n186\n233\n14\n1\n0\n1\n0\n3\n\n2\n44\n157\n8\n1\n3\n0\n1\n0\n\n1\n61\n15\n6\n0\n1\n0\n0\n1\n\nbasic\n\ngeneral wellbeing\n\nblood test\n\ny = (y1, y2, . . . , ym) with yi = 0 for i < j are consistent with this censored observation (see\nFigure 1(middle)). The likelihood of this censored patient is\n\nP\u0398(T \u2265 tj | (cid:126)x) =\n\nexp(f\u0398((cid:126)x, k))/\n\nk=j\n\nk=0\n\nexp(f\u0398((cid:126)x, k)),\n\n(4)\n\n(cid:88)m\n\n(cid:88)m\n\nwhere the numerator is the sum over all consistent sequences. While the sum in the numerator makes\nthe log-likelihood non-concave, we can still learn the parameters effectively using EM or gradient\ndescent with suitable initialization.\nIn summary, the proposed MTLR model holds several advantages over classical regression models\nin survival analysis for survival time prediction. First, it directly models the more intuitive survival\nfunction rather than the hazard function (conditional rate of failure/death), avoiding the dif\ufb01culties\nof choosing between different forms of hazards. Second, by modeling the survival distribution as\nthe joint output of a sequence of dependent local regressors, we can capture the time-varying effects\nof features and handle censored data easily and naturally. Third, we will see that our model can give\nmore accurate predictions on survival and better calibrated probabilities (see Section 4), which are\nimportant in clinical applications.\nOur goal here is not to replace these tried-and-tested models in survival analysis, which are very\neffective for hypothesis testing and prognostic factor discovery. Instead, we want a tool that can\naccurately and effectively predict an individual\u2019s survival time.\n\n3.2 Relations to Other Machine Learning Models\n\nThe objective of our MTLR model is of the same form as a general CRF [15], but there are several\nimportant differences from typical applications of CRFs for sequence labeling. First MTLR has\nno transition features (edge potentials) (Eq (3)); instead the dependencies between labels in the\nsequence are enforced implicitly by only allowing a linear number (m+1) of legal labelings. Second,\nin most sequence labeling applications of CRFs, the weights for the node potentials are shared across\nnodes to share statistic strengths and improve generalization. Instead, MTLR uses a different weight\nvector (cid:126)\u03b8i at each node to capture the time-varying effects of input features. Unlike typical sequence\nlabeling problems, the sequence construction of our model might be better viewed as a device to\nobtain a \ufb02exible discrete approximation of the survival distribution of individual patients.\nOur approach can also be seen as an instance of multi-task learning [16], where the prediction of\nindividual survival status at each time snapshot tj can be regarded as a separate task. The smoothing\npenalty (cid:107)(cid:126)\u03b8j \u2212 (cid:126)\u03b8j+1(cid:107)2 is used by many multi-task regularizers to encourage weight sharing between\nrelated tasks. However, unlike typical multi-task learning problems, in our model the outputs of\ndifferent tasks are dependent to satisfy the monotone condition of a survival function.\n\n4 Experiments\n\nOur main dataset comes from the Alberta Cancer Registry obtained through the Cross Cancer Insti-\ntute at the University of Alberta, which included 2402 cancer patients with tumors at different sites.\nAbout one third of the patients have censored survival times. Table 1 shows the groupings of cancer\npatients in the dataset and the patient-speci\ufb01c attributes for learning survival distributions. All these\nmeasurements are taken before the \ufb01rst chemotherapy.\n\n5\n\n\fIn all experiments we report \ufb01ve-fold cross validation (5CV) results, where MTLR\u2019s regularization\nparameters C1 and C2 are selected by another 5CV within the training fold, based on log likelihood.\nWe pick the set of time points \u03c4 in these experiments to be the 100 points from the 1st percentile\nup to the 100th percentile of the event time (true survival time or censoring time) over all patients.\nSince all the datasets contain censored data, we \ufb01rst train an MTLR model using the event time\n(survival/censoring) as regression targets (no hidden variables). Then the trained model is used as\nthe initial weights in the EM procedure in Eq (4) to train the \ufb01nal model.\nThe Cox proportional hazards model is trained using the survival package in R, followed by the\n\ufb01tting of the baseline hazard \u03bb0(t) using the Kalb\ufb02eisch-Prentice estimator [2]. The Aalen linear\nhazards model is trained using the timereg package. Both the Cox and the Aalen models are\ntrained using the same set of 25 features. As a baseline for this cancer registry dataset, we also\nprovide a prediction based on the median survival time and survival probabilities of the subgroup of\npatients with cancer at a speci\ufb01c site and at a speci\ufb01c stage, estimated from the training fold.\n\n4.1 Survival Rate Prediction\n\nOur \ufb01rst evaluation focuses on the classi\ufb01cation accuracy and calibration of predicted survival prob-\nabilities at different time thresholds. In addition to giving a binary prediction on whether a patient\nwould survive beyond a certain time period, say 2 years, it is very useful to give an associated con-\n\ufb01dence of the prediction in terms of probabilities (survival rate). We use mean square error (MSE),\nalso called the Brier score in this setting [17], to measure the quality of probability predictions.\nPrevious work [18] showed that MSE can be decomposed into two components, one measuring\ncalibration and one measuring discriminative power (i.e., classi\ufb01cation accuracy) of the probability\npredictions.\nTable 2 shows the classi\ufb01cation accuracy and MSE on the predicted probabilities of different models\nat 5, 12, and 22 months, which correspond to the 25% lower quantile, median, and 75% upper\nquantile of the survival time of all the cancer patients in the dataset. Our MTLR models produce\npredictions on survival status and survival probability that are much more accurate than the Cox\nand Aalen regression models. This shows the advantage of directly modeling the survival function\ninstead of going through the hazard function when predicting survival probabilites. The Cox model\nand the Aalen model have classi\ufb01cation accuracies and MSE that are similar to one another on\nthis dataset. All regression models (MTLR, Cox, Aalen) beat the baseline prediction using median\nsurvival time based on cancer stage and site only, indicating that there is substantial advantage of\nemploying extra clinical information to improve survival time predictions given to cancer patients.\n\n4.2 Visualization\n\nFigure 2 visualizes the MTLR, Cox and Aalen regression models for two patients on a test fold.\nPatient 1 is a short survivor who lives for only 3 months from diagnosis, while patient 2 is a long\nsurvivor whose survival time is censored at 46 months. All three regression models (correctly) give\npoor prognosis for patient 1 and good prognosis for patient 2, but there are a few interesting dif-\nferences when we examine the plots. The MTLR model is able to produce smooth survival curves\nof different shapes for the two patients (one convex with the other one slightly concave), while the\nCox model always predict survival curves of similar shapes because of the proportional hazards as-\nsumption. Indeed it is well known that the survival curves of two individuals never crosses for a\nCox model. For the Aalen model, we observe that the survival function is not (locally) monotoni-\ncally decreasing. This is a consequence of the linear hazards assumption (Eq (1)), which allows the\nhazard to become negative and therefore the survival function to increase. This problem is less com-\nmon when predicting survival curves at population level, but could be more frequent for individual\nsurvival distribution predictions.\n\n4.3 Survival Time Predictions Optimizing Different Loss Functions\n\nOur third evaluation on the predicted survival distributions involves applying them to make pre-\ndictions that minimize different clinically-relevant loss functions. For example, if the patient is\ninterested in knowing whether s/he has weeks, months, or years to live, then measuring errors in\nterms of the logarithm of the survival time can be appropriate. In this case we can measure the loss\n\n6\n\n\fTable 2: Classi\ufb01cation accuracy and MSE of survival probability predictions on cancer registry\ndataset (standard error of 5CV shown in brackets). Bold numbers indicate signi\ufb01cance with a paired\nt-test at p = 0.05 level (this applies to all subsequent tables).\nAccuracy\nMTLR\nCox\nAalen\nBaseline\n\n22 month\n0.170 (0.007)\n0.232 (0.016)\n0.288 (0.020)\n0.243 (0.012)\n\n12 month\n0.158 (0.004)\n0.270 (0.008)\n0.278 (0.008)\n0.299 (0.011)\n\n5 month\n0.101 (0.005)\n0.196 (0.009)\n0.198 (0.004)\n0.227 (0.012)\n\n22 month\n74.5 (1.3)\n62.8 (3.5)\n59.6 (3.6)\n57.0 (1.4)\n\nMSE\nMTLR\nCox\nAalen\nBaseline\n\n5 month\n86.5 (0.7)\n74.5 (0.9)\n73.3 (1.2)\n69.2 (0.3)\n\n12 month\n76.1 (0.9)\n59.3 (1.1)\n61.0 (1.7)\n56.2 (2.0)\n\nFigure 2: Predicted survival function for two patients in test set: MTLR (left), Cox (center), Aalen\n(right). Patient 1 lives for 3 months while patient 2 has survival time censored at 46 months.\n\nusing the absolute error (AE) over log survival time\n\nlAE\u2212log(p, t) = | log p \u2212 log t|,\nwhere p and t are the predicted and true survival time respectively.\nIn other scenarios, we might be more concerned about the difference of the predicted and true\nsurvival time. For example, as the cost of hospital stays and medication scales linearly with the\nsurvival time, the AE loss on the survival time could be appropriate, i.e,\n\n(5)\n\nlAE(p, t) = |p \u2212 t|.\n\n(6)\n\nWe also consider an error measure called the relative absolute error (RAE):\n\nlRAE(p, t) = min{|(p \u2212 t)/p| , 1} ,\n\n(7)\nwhich is essentially AE scaled by the predicted survival time p, since p is known at prediction time\nin clinical applications. The loss is truncated at 1 to prevent large penalizations for small predicted\nsurvival time. Knowing that the average RAE of a predictor is 0.3 means we can expect the true\nsurvival time to be within 30% of the predicted time.\nGiven any of these loss models l above, we can make a point prediction hl((cid:126)x) of the survival time\nfor a patient with features (cid:126)x using the survival distribution P\u0398 estimated by our MTLR model:\n\nhl((cid:126)x) = argmin\n\np\u2208{t1,...,tm}\n\nk=0\n\nl(p, tk)P\u0398(Y = y(tk) | (cid:126)x),\n\n(8)\n\n(cid:88)m\n\nwhere y(tk) is the survival time encoding de\ufb01ned in Section 3.\nTable 3 shows the results on optimizing the three proposed loss functions using the individual sur-\nvival distribution learned with MTLR against other methods. For this particular evaluation, we also\nimplemented the censored support vector regression (CSVR) proposed in [7, 8]. We train two CSVR\nmodels, one using the survival time and the other using logarithm of the survival time as regression\ntargets, which correspond to minimizing the AE and AE-log loss functions. For RAE we report the\nbest result from linear and log-scale CSVR in the table, since this non-convex loss is not minimized\nby either of them. As we do not know the true survival time for censored patients, we adopt the\napproach of not penalizing a prediction p for a patient with censoring time t if p > t, i.e., l(p, t) = 0\nfor the loss functions de\ufb01ned in Eqs (5) to (7) above. This is exactly the same censored training loss\nused in CSVR. Note that it is undesirable to test on uncensored patients only, as the survival time\ndistributions are very different for censored and uncensored patients. For Cox and Aalen models we\nreport results using predictions based on the median, as optimizing for different loss functions using\nEq (8) with the distributions predicted by Cox and Aalen models give inferior results.\nThe results in Table 3 show that, although CSVR has the advantage of optimizing the loss function\ndirectly during training, our MTLR model is still able to make predictions that improve on CSVR,\n\n7\n\n 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60P(survival)monthsMTLRpatient 1patient 2 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60P(survival)monthsCoxpatient 1patient 2 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60P(survival)monthsAalenpatient 1patient 2\fTable 3: Results on Optimizing Different Loss Functions on the Cancer Registry Dataset\n\nAE\nAE-log\nRAE\n\nMTLR\n9.58 (0.11)\n0.56 (0.02)\n0.40 (0.01)\n\nCox\n10.76 (0.12)\n0.61 (0.02)\n0.44 (0.02)\n\nAalen\n19.06 (2.04)\n0.76 (0.06)\n0.44 (0.02)\n\nCSVR\n9.96 (0.32)\n0.56 (0.02)\n0.44 (0.03)\n\nBaseline\n11.73 (0.62)\n0.70 (0.05)\n0.53 (0.02)\n\nTable 4: (Top) MSE of Survival Probability Predictions on SUPPORT2 (left) and RHC (right).\n(Bottom) Results on Optimizing Different Loss Functions: SUPPORT2 (left), RHC (right)\n\nSupport2 14 day\nMTLR\nCox\nAalen\nSupport2 AE\nMTLR\nCox\nAalen\nCSVR\n\nAE-log\n\n11.74 (0.35) 1.19 (0.03)\n14.08 (0.49) 1.35 (0.03)\n14.61 (0.66) 1.28 (0.04)\n11.62 (0.15) 1.18 (0.02)\n\n58 day\n\n252 day\n\n0.102(0.002) 0.162(0.002) 0.189(0.004)\n0.152(0.003) 0.213(0.004) 0.199(0.006)\n0.141(0.003) 0.195(0.004) 0.195(0.008)\n\n27 day\n\n163 day\n\n8 day\n\nRHC\nMTLR 0.121(0.002) 0.175(0.005) 0.201(0.004)\n0.180(0.005) 0.239(0.004) 0.223(0.004)\nCox\nAalen\n0.176(0.004) 0.229(0.006) 0.221(0.006)\nRHC AE\nMTLR 2.90 (0.09)\n3.08 (0.09)\nCox\nAalen\n3.55 (0.85)\nCSVR 2.96 (0.07)\n\nRAE\n0.49 (0.01)\n0.53 (0.01)\n0.54 (0.01)\n0.58 (0.01)\n\nAE-log\n1.07 (0.02)\n1.10 (0.02)\n1.10 (0.06)\n1.09 (0.02)\n\nRAE\n0.53 (0.01)\n0.71 (0.01)\n0.65 (0.01)\n0.65 (0.01)\n\nsometimes signi\ufb01cantly. Moreover MTLR is able to make survival time prediction with improved\nRAE, which is dif\ufb01cult for CSVR to optimize directly. MTLR also beats the Cox and Aalen models\non all three loss functions. When compared to the baseline of predicting the median survival time\nby cancer site and stage, MTLR is able to employ extra clinical features to reduce the absolute error\non survival time from 11.73 months to 9.58 months, and the error ratio between true and predicted\nsurvival time from being off by exp(0.70) \u2248 2.01 times to exp(0.56) \u2248 1.75 times. Both error\nmeasures are reduced by about 20%.\n\n4.4 Evaluation on Other Datasets\n\nAs additional evaluations, we also tested our model on the SUPPORT2 and RHC datasets (available\nat http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets), which record the\nsurvival time for patients hospitalized with severe illnesses. SUPPORT2 contains over 9000 patients\n(32% censored) while RHC contains over 5000 patients (35% censored).\nTable 4 (top) shows the MSE on survival probability prediction over the SUPPORT2 dataset and\nRHC dataset (we omit classi\ufb01cation accuracy due to lack of space). The thresholds are again chosen\nat 25% lower quantile, median, and 75% upper quantile of the population survival time. The MTLR\nmodel, again, produces signi\ufb01cantly more accurate probabilty predictions when compared against\nthe Cox and Aalen regression models. Table 4 (bottom) shows the results on optimizing different\nloss functions for SUPPORT2 and RHC. The results are consistent with the cancer registry dataset,\nwith MTLR beating Cox and Aalen regressions while tying with CSVR on AE and AE-log.\n\n5 Conclusions\n\nWe plan to extend our model to an online system that can update survival predictions with new\nmeasurements. Our current data come from measurements taken when cancers are \ufb01rst diagnosed;\nit would be useful to be able to update survival predictions for patients incrementally, based on new\nblood tests or physician\u2019s assessments.\nWe have presented a new method for learning patient-speci\ufb01c survival distributions. Experiments\non a large cohort of cancer patients show that our model gives much more accurate predictions of\nsurvival rates when compared to the Cox or Aalen survival regression models. Our results demon-\nstrate that incorporating patient-speci\ufb01c features can signi\ufb01cantly improve the accuracy of survival\nprediction over just using cancer site and stage, with prediction errors reduced by as much as 20%.\n\nAcknowledgments\n\nThis work is supported by Alberta Innovates Centre for Machine Learning (AICML) and NSERC.\nWe would also like to thank the Alberta Cancer Registry for the datasets used in this study.\n\n8\n\n\fReferences\n[1] M.M. Oken, R.H. Creech, D.C. Tormey, J. Horton, T.E. Davis, E.T. McFadden, and P.P. Carbone. Toxicity\nand response criteria of the eastern cooperative oncology group. American Journal of Clinical Oncology,\n5(6):649, 1982.\n\n[2] J.D. Kalb\ufb02eisch and R.L. Prentice. The statistical analysis of failure time data. Wiley New York:, 1980.\n[3] D.R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Method-\n\nological), 34(2):187\u2013220, 1972.\n\n[4] E.L. Kaplan and P. Meier. Nonparametric estimation from incomplete observations. Journal of the Amer-\n\nican Statistical Association, 53(282):457\u2013481, 1958.\n\n[5] O.O. Aalen. A linear regression model for the analysis of life times. Statistics in Medicine, 8(8):907\u2013925,\n\n1989.\n\n[6] T. Martinussen and T.H. Scheike. Dynamic regression models for survival data. Springer Verlag, 2006.\n[7] P.K. Shivaswamy, W. Chu, and M. Jansche. A support vector approach to censored targets. In ICDM\n\n2007, pages 655\u2013660. IEEE, 2008.\n\n[8] A. Khosla, Y. Cao, C.C.Y. Lin, H.K. Chiu, J. Hu, and H. Lee. An integrated machine learning approach\n\nto stroke prediction. In KDD, pages 183\u2013192. ACM, 2010.\n\n[9] V. Raykar, H. Steck, B. Krishnapuram, C. Dehing-Oberije, and P. Lambin. On ranking in survival analysis:\n\nBounds on the concordance index. NIPS, 20, 2007.\n\n[10] G.C. Cawley, N.L.C. Talbot, G.J. Janacek, and M.W. Peck. Sparse bayesian kernel survival analysis for\nmodeling the growth domain of microbial pathogens. IEEE Transactions on Neural Networks, 17(2):471\u2013\n481, 2006.\n\n[11] W.S. Cleveland and S.J. Devlin. Locally weighted regression: an approach to regression analysis by local\n\n\ufb01tting. Journal of the American Statistical Association, 83(403):596\u2013610, 1988.\n\n[12] T. Hastie and R. Tibshirani. Varying-coef\ufb01cient models. Journal of the Royal Statistical Society. Series B\n\n(Methodological), 55(4):757\u2013796, 1993.\n\n[13] B. Efron. Logistic regression, survival analysis, and the Kaplan-Meier Curve. Journal of the American\n\nStatistical Association, 83(402):414\u2013425, 1988.\n\n[14] D. Gamerman. Dynamic Bayesian models for survival data. Applied Statistics, 40(1):63\u201379, 1991.\n[15] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, pages 282\u2013289, 2001.\n\n[16] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[17] G.W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly weather review, 78(1):1\u2013\n\n3, 1950.\n\n[18] M.H. DeGroot and S.E. Fienberg. The comparison and evaluation of forecasters. Journal of the Royal\n\nStatistical Society. Series D (The Statistician), 32(1):12\u201322, 1983.\n\n9\n\n\f", "award": [], "sourceid": 1044, "authors": [{"given_name": "Chun-Nam", "family_name": "Yu", "institution": ""}, {"given_name": "Russell", "family_name": "Greiner", "institution": ""}, {"given_name": "Hsiu-Chin", "family_name": "Lin", "institution": ""}, {"given_name": "Vickie", "family_name": "Baracos", "institution": ""}]}