{"title": "Robust Feature-Sample Linear Discriminant Analysis for Brain Disorders Diagnosis", "book": "Advances in Neural Information Processing Systems", "page_first": 658, "page_last": 666, "abstract": "A wide spectrum of discriminative methods is increasingly used  in diverse applications for classification or regression tasks. However, many existing discriminative methods assume that the input data is nearly noise-free, which limits their applications to solve real-world problems. Particularly for disease diagnosis, the data acquired by the neuroimaging devices are always prone to different sources of noise. Robust discriminative models are somewhat scarce and only a few attempts have been made to make them robust against noise or outliers. These methods focus on detecting either the sample-outliers or feature-noises. Moreover, they usually use unsupervised de-noising procedures, or separately de-noise the training and the testing data. All these factors may induce biases in the learning process, and thus limit its performance. In this paper, we propose a classification method based on the least-squares formulation of linear discriminant analysis, which simultaneously detects the sample-outliers and feature-noises. The proposed method operates under a semi-supervised setting, in which both labeled training and unlabeled testing data are incorporated to form the intrinsic geometry of the sample space. Therefore, the violating samples or feature values are identified  as sample-outliers or feature-noises, respectively. We test our algorithm on one synthetic and two brain neurodegenerative databases (particularly for Parkinson's disease and Alzheimer's disease). The results demonstrate that our method outperforms all baseline and state-of-the-art methods, in terms of both accuracy and the area under the ROC curve.", "full_text": "Robust Feature-Sample Linear Discriminant Analysis\n\nfor Brain Disorders Diagnosis\n\nEhsan Adeli-Mosabbeb, Kim-Han Thung, Le An, Feng Shi, Dinggang Shen, for the ADNI\u2217\n\nDepartment of Radiology and BRIC\n\nUniversity of North Carolina at Chapel Hill, NC, 27599, USA\n\n{eadeli,khthung,le_an,fengshi,dgshen}@med.unc.edu\n\nAbstract\n\nA wide spectrum of discriminative methods is increasingly used in diverse appli-\ncations for classi\ufb01cation or regression tasks. However, many existing discrimi-\nnative methods assume that the input data is nearly noise-free, which limits their\napplications to solve real-world problems. Particularly for disease diagnosis, the\ndata acquired by the neuroimaging devices are always prone to different sources\nof noise. Robust discriminative models are somewhat scarce and only a few at-\ntempts have been made to make them robust against noise or outliers. These\nmethods focus on detecting either the sample-outliers or feature-noises. More-\nover, they usually use unsupervised de-noising procedures, or separately de-noise\nthe training and the testing data. All these factors may induce biases in the learn-\ning process, and thus limit its performance. In this paper, we propose a classi\ufb01-\ncation method based on the least-squares formulation of linear discriminant anal-\nysis, which simultaneously detects the sample-outliers and feature-noises. The\nproposed method operates under a semi-supervised setting, in which both labeled\ntraining and unlabeled testing data are incorporated to form the intrinsic geometry\nof the sample space. Therefore, the violating samples or feature values are iden-\nti\ufb01ed as sample-outliers or feature-noises, respectively. We test our algorithm on\none synthetic and two brain neurodegenerative databases (particularly for Parkin-\nson\u2019s disease and Alzheimer\u2019s disease). The results demonstrate that our method\noutperforms all baseline and state-of-the-art methods, in terms of both accuracy\nand the area under the ROC curve.\n\n1\n\nIntroduction\n\nDiscriminative methods pursue a direct mapping from the input to the output space for a classi-\n\ufb01cation or a regression task. As an example, linear discriminant analysis (LDA) aims to \ufb01nd the\nmapping that reduces the input dimensionality, while preserving the most class discriminatory in-\nformation. Discriminative methods usually achieve good classi\ufb01cation results compared to the gen-\nerative models, when there are enough number of training samples. But they are limited when there\nare small number of labeled data, as well as when the data is noisy. Various efforts have been made\nto add robustness to these methods. For instance, [17] and [9] proposed robust Fisher/linear discrim-\ninant analysis methods, and [19] introduced a worst-case LDA, by minimizing the upper bound of\nthe LDA cost function. These methods are all robust to sample-outliers. On the other hand, some\nmethods were proposed to deal with the intra-sample-outliers (or feature-noises), such as [12, 15].\n\u2217Parts of the data used in preparation of this article were obtained from the Alzheimer\u2019s Disease Neuroimag-\ning Initiative (ADNI) database (http://adni.loni.ucla.edu). The investigators within the ADNI con-\ntributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or\nwriting of this paper. A complete listing of ADNI investigators can be found at: http://adni.loni.\nucla.edu/wp-content/uploads/howtoapply/ADNIAcknowledgementList.pdf.\n\n1\n\n\fmin\n\n\u03b2\u03b2\u03b2\n\n(cid:107)H(Ytr \u2212 \u03b2\u03b2\u03b2Xtr)(cid:107)2\nF,\n\nAs in many previous works, de-noising the training and the testing data are often conducted sepa-\nrately. This might induce a bias or inconsistency to the whole learning process. Besides, for many\nreal-world applications, it is a cumbersome task to acquire enough training samples to perform a\nproper discriminative analysis. Hence, we propose to take advantage of the unlabeled testing data\navailable, to build a more robust classi\ufb01er. To this end, we introduce a semi-supervised discrimi-\nnative classi\ufb01cation model, which, unlike previous works, jointly estimates the noise model (both\nsample-outliers and feature-noises) on the whole labeled training and unlabeled testing data and\nsimultaneously builds a discriminative model upon the de-noised training data.\nIn this paper, we introduce a novel classi\ufb01cation model based on LDA, which is robust against\nboth sample-outliers and feature-noises, and hence, here, it is called robust feature-sample linear\ndiscriminant analysis (RFS-LDA). LDA \ufb01nds the mapping between the sample space and the label\nspace through a linear transformation matrix, maximizing a so-called Fisher discriminant ratio [17].\nIn practice, the major drawback of the original LDA is the small sample size problem, which arises\nwhen the number of available training samples is less than the dimensionality of the feature space\n[18]. A reformulation of LDA based on the reduced-rank least-squares problem (known LS-LDA)\n[10] tackles this problem. LS-LDA \ufb01nds the mapping \u03b2\u03b2\u03b2 \u2208 Rl\u00d7d by solving the following problem1:\n(1)\nwhere Ytr \u2208 Rl\u00d7Ntr is a binary class label indicator matrix, for l different classes (or labels), and\nXtr \u2208 Rd\u00d7Ntr is the matrix containing Ntr d-dimensional training samples. H is a normalization\nfactor de\ufb01ned as H = (YtrY(cid:62)\ntr)\u22121/2 that compensates for the different number of samples in each\nclass [10]. As a result, the mapping \u03b2\u03b2\u03b2 is a reduced rank transformation matrix [10, 15], which could\nbe used to project a test data xtst \u2208 Rd\u00d71 onto a l dimensional space. The class label could therefore\nbe simply determined using a k-NN strategy.\nTo make LDA robust against noisy data, Fidler et al. [12] proposed to construct a basis, which con-\ntains complete discriminative information for classi\ufb01cation. In the testing phase, the estimated basis\nidenti\ufb01es the outliers in samples (images in their case) and then is used to calculate the coef\ufb01cients\nusing a subsampling approach. On the other hand, Huang et al. [15] proposed a general formulation\nfor robust regression (RR) and classi\ufb01cation (robust LDA or RLDA). In the training stage, they de-\nnoise the feature values using a strategy similar to robust principle component analysis (RPCA) [7]\nand build the above LS-LDA model using the de-noised data. In the testing stage, they de-noise the\ndata by performing a locally compact representation of the testing samples from the de-noised train-\ning data. This separate de-noising procedure could not effectively form the underlying geometry of\nsample space to de-noise the data. Huang et al. [15] only account for feature-noise by imposing a\nsparse noise model constraint on the features matrix. On the other hand, the data \ufb01tting term in (1)\nis vulnerable to large sample-outliers. Recently, in robust statistics, it is found that (cid:96)1 loss functions\nare able to make more reliable estimations [2] than (cid:96)2 least-squares \ufb01tting functions. This has been\nadopted in many applications, including robust face recognition [28] and robust dictionary learning\n[22]. Reformulating the objective in (1), using this idea, would yield to this problem:\n\n(cid:107)H(Ytr \u2212 \u03b2\u03b2\u03b2Xtr)(cid:107)1.\n\nmin\n\n\u03b2\u03b2\u03b2\n\n(2)\n\nWe incorporate this \ufb01tting function in our formulation to deal with the sample-outliers by iteratively\nre-weighting each single sample, while simultaneously de-noising the data from feature-noises. This\nis done through a semi-supervised setting to take advantage of all labeled and unlabeled data to build\nthe structure of the sample space more robustly. Semi-supervised learning [8, 34] has long been of\ngreat interest in different \ufb01elds, because it can make use of unlabeled or poorly labeled data. For\ninstance, Joulin and Bach [16] introduced a convex relaxation and use the model in different semi-\nsupervised learning scenarios. In another work, Cai et al. [5] proposed a semi-supervised discrim-\ninant analysis, where the separation between different classes is maximized using the labeled data\npoints, while the unlabeled data points estimate the structure of the data. In contrast, we incorporate\nthe unlabeled testing data to form the intrinsic geometry of the sample space and de-noise the data,\nwhilst building the discriminative model.\n\n1Bold capital letters denote matrices (e.g., D). All non-bold letters denote scalar variables. dij is the scalar\nin the row i and column j of D. (cid:104)d1, d2(cid:105) denotes the inner product between d1 and d2. (cid:107)d(cid:107)2\n2 and (cid:107)d(cid:107)1\nrepresent the squared Euclidean Norm and the (cid:96)1 norm of d, respectively. (cid:107)D(cid:107)2\nij dij and\n(cid:107)D(cid:107)\u2217 designate the squared Frobenius Norm and the nuclear norm (sum of singular values) of D, respectively.\n\nF = tr(D(cid:62)D) =(cid:80)\n\n2\n\n\f\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nX = [Xtr Xtst] \u2208 Rd\u00d7N\n\nx11 x12 . . . x1Ntr\nx21 x22 . . . x2Ntr\nx31 x32 . . . x3Ntr\n\nx1Ntr +1 . . . x1N\nx2Ntr +1 . . . x2N\nx3Ntr +1 . . . x3N\n\n.\n.\n.\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n\nxd1 xd2 . . . xdNtr\n\nxdNtr +1 . . . xdN\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n=\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nD = [Dtr Dtst] \u2208 Rd\u00d7N\n\nd11 d12 . . . d1Ntr\nd21 d22 . . . d2Ntr\nd31 d32 . . . d3Ntr\n\nd1Ntr +1 . . . d1N\nd2Ntr +1 . . . d2N\nd3Ntr +1 . . . d3N\n\n.\n.\n.\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n\ndd1 dd2 . . . ddNtr\n\nddNtr +1 . . . ddN\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n+\n\nE \u2208 Rd\u00d7N\n\ne11 e12 . . . e1N\ne21 e22 . . . e2N\ne31 e32 . . . e3N\n\n.\n.\n.\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n\ned1 ed2 . . . edN\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nMapping \u03b2\u03b2\u03b2\n\ny11 y12 . . . y1Ntr\n\n.\n.\n.\n\n. . .\n\n.\n.\n.\n.\n.\n.\nyl1 yl2 . . . ylNtr\nYtr \u2208 Rl\u00d7Ntr\n\nFigure 1: Outline of the proposed method: The original data matrix, X, is composed of both labeled\ntraining and unlabeled testing data. Our method decomposes this matrix to a de-noised data matrix,\nD, and an error matrix, E, to account for feature-noises. Simultaneously, we learn a mapping from\nthe de-noised training samples in D (Dtr) through a robust (cid:96)1 \ufb01tting function, dealing with the\nsample-outliers. The same learned mapping on the testing data, Dtst, leads to the test labels.\n\nWe apply our method for the diagnosis of neurodegenerative brain disorders. The term neurodegen-\nerative disease is an umbrella term for debilitating and incurable conditions related to progressive\ndegeneration or death of the cells in the brain nervous system. Although neurodegenerative diseases\nmanifest with diverse pathological features, the cellular level processes resemble similar structures.\nFor instance, Parkinson\u2019s disease (PD) mainly affects the basal ganglia region and the substansia\nnigra sub-region of the brain, leading to decline in generation of a chemical messenger, dopamine.\nLack of dopamine yields loss of ability to control body movements, along with some non-motor\nproblems (e.g., depression, anxiety) [35].\nIn Alzheimer\u2019s disease (AD), deposits of tiny protein\nplaques yield into brain damage and progressive loss of memory [26]. These diseases are often\nincurable and thus, early diagnosis and treatment are crucial to slow down the progression of the\ndisease in its initial stages. In this study, we use two popular databases: PPMI and ADNI. The\nformer aims at investigating PD and its related disorders, while the latter is designed for diagnosing\nAD and its prodormal stage, known as mild cognitive impairment (MCI).\nContributions: The contribution of this paper would therefore be multi-fold: (1) We propose an\napproach to deal with the sample-outliers and feature-noises simultaneously, and build a robust dis-\ncriminative classi\ufb01cation model. The sample-outliers are penalized through an (cid:96)1 \ufb01tting function,\nby re-weighing the samples based on their prediction power, while discarding the feature-noises.\n(2) Our proposed model operates under a semi-supervised setting, where the whole data (labeled\ntraining and unlabeled testing samples) are incorporated to build the intrinsic geometry of the sam-\nple space, which leads to better de-noising the data. (3) We further select the most discriminative\nfeatures for the learning process through regularizing the weights matrix with an (cid:96)1 norm. This is\nspeci\ufb01cally of great interest for the neurodegenerative disease diagnosis, where the features from\ndifferent regions of the brain are extracted, but not all the regions are associated with a certain dis-\nease. Therefore, the most discriminative regions in the brain that utmost affect the disease would be\nidenti\ufb01ed, leading to a more reliable diagnosis model.\n\n2 Robust Feature-Sample Linear Discriminant Analysis (RFS-LDA)\n\nLet\u2019s assume we have Ntr training and Ntst testing samples, each with a d-dimensional feature\nvector, which leads to a set of N = Ntr + Ntst total samples. Let X \u2208 Rd\u00d7N denote the set of all\nsamples (both training and testing), in which each column indicates a single sample, and yi \u2208 R1\u00d7N\ntheir corresponding ith labels. In general, with l different labels, we can de\ufb01ne Y \u2208 Rl\u00d7N . Thus,\nX and Y are composed by stacking up the training and testing data as: X = [Xtr Xtst] and\nY = [Ytr Ytst]. Our goal is to determine the labels of the test samples, Ytst \u2208 Rl\u00d7Ntst.\nFormulation: An illustration of the proposed method is depicted in Fig 1. First, all the samples\n(labeled or unlabeled) are arranged into a matrix, X. We are interested in de-noising this matrix.\nFollowing [14, 21], this could be done by assuming that X can be spanned on a low-rank subspace\nand therefore should be rank-de\ufb01cient. This assumption supports the fact that samples from same\nclasses should be more correlated [14, 15]. Therefore, the original matrix X is decomposed into two\n\n3\n\n\fcounterparts, D and E, which represent the de-noised data matrix and the error matrix, respectively,\nsimilar to RPCA [7]. The de-noised data matrix shall hold the low-rank assumption and the error\nmatrix is considered to be sparse. But, this process of de-noising does not incorporate the label\ninformation and is therefore unsupervised. Nevertheless, note that we also seek a mapping between\nthe de-noised training samples and their respective labels. So, matrix D should be spanned on a\nlow-rank subspace, which also leads to a good classi\ufb01cation model of its sub-matrix, Dtr.\nTo ensure the rank-de\ufb01ciency of the matrix D, like in many previous works [7, 14, 21], we approx-\nimate the rank function using the nuclear norm (the sum of the singular values of the matrix). The\nnoise is modeled using the (cid:96)1 norm of the matrix, which ensures a sparse noise model on the feature\nvalues. Accordingly, the objective function for RFS-LDA under a semi-supervised setting would be:\n\nmin\n\n\u03b2\u03b2\u03b2,D, \u02c6D,E\n\n(cid:107)H(Ytr \u2212 \u03b2\u03b2\u03b2 \u02c6D)(cid:107)1 + (cid:107)D(cid:107)\u2217 + \u03bb1(cid:107)E(cid:107)1 + \u03bb2R(\u03b2\u03b2\u03b2),\n\n\u03b7\n2\n\ns.t. D = X + E, \u02c6D = [Dtr; 1(cid:62)],\n\n(3)\n\nwhere the \ufb01rst term is the (cid:96)1 regression model introduced in (2). This term only operates on the de-\nnoised training samples from matrix D with a row of all 1s is added to it, to ensure an appropriate\nlinear classi\ufb01cation model. The second and the third terms together with the \ufb01rst constraint are\nsimilar to the RPCA formulation [7]. They de-noise the labeled training and unlabeled testing data\ntogether.\nIn combination with the \ufb01rst term, we ensure that the de-noised data also provides a\nfavorable regression/classi\ufb01cation model. The last term is a regularization on the learned mapping\ncoef\ufb01cients to ensure the coef\ufb01cients do not get trivial or unexpectedly large values. The parameters\n\u03b7, \u03bb1 and \u03bb2 are constant regularization parameters, which are discussed in more details later.\nThe regularization on the coef\ufb01cients could be posed as a simple norm of the \u03b2\u03b2\u03b2 matrix. But, in many\napplications like ours (disease diagnosis) many of the features in the feature vectors are redundant. In\npractice, features from different brain regions are often extracted, but not all the regions contribute\nto a certain disease. Therefore, it is desirable to determine which features (regions) are the most\nrelevant and the most discriminative to use. Following [11, 26, 28], we are looking for a sparse set\nof weights that ensures incorporating the least and the most discriminative features. We propose a\nregularization on the weights vector as a combination of the (cid:96)1 and Frobenius norms:\n\nR(\u03b2\u03b2\u03b2) = (cid:107)\u03b2\u03b2\u03b2(cid:107)1 + \u03b3(cid:107)\u03b2\u03b2\u03b2(cid:107)F.\n\n(4)\n\nEvidently, the solution to the objective function in (3) is not easy to achieve, since the \ufb01rst term\ncontains a quadratic term and minimization of the (cid:96)1 \ufb01tting function is not straightforward (because\nof its indifferentiability). To this end, we formalize the solution with a similar strategy as in iter-\natively re-weighted least squares (IRLS) [2]. The (cid:96)1 minimization problem is approximated by a\nconventional (cid:96)2 least-squares, in which each of the samples in the \u02c6D matrix are weighted with the\nreverse of their regression residual. Therefore the new problem would be:\n\nmin\n\n\u03b2\u03b2\u03b2,D, \u02c6D,E\n\n(cid:107)H(Ytr \u2212 \u03b2\u03b2\u03b2 \u02c6D)\u02c6\u03b1\u03b1\u03b1(cid:107)2\n\nF + (cid:107)D(cid:107)\u2217 + \u03bb1(cid:107)E(cid:107)1 + \u03bb2R(\u03b2\u03b2\u03b2),\n\n\u03b7\n2\n\ns.t. D = X + E, \u02c6D = [Dtr; 1(cid:62)].\n\nwhere \u02c6\u03b1\u03b1\u03b1 is a diagonal matrix, the ith diagonal element of which is the ith sample\u2019s weight:\n\n(cid:113)\n(yi \u2212 \u03b2\u03b2\u03b2 \u02c6di)2 + \u03b4, \u2200 i, j \u2208 {0, . . . , Ntr}, i (cid:54)= j, \u02c6\u03b1\u03b1\u03b1ij = 0,\n\n\u02c6\u03b1\u03b1\u03b1ii = 1/\n\n(5)\n\n(6)\n\nwhere \u03b4 is a very small positive number (equal to 0.0001 in our experiments). In the next subsection,\nwe introduce an algorithm to solve this optimization problem.\nOur work is closely related to the RR and RLDA formulations in [15], where the authors impose a\nlow-rank assumption on the training data feature values and an (cid:96)1 assumption on the noise model.\nThe discriminant model is learned similar to LS-LDA, as illustrated in (1), while a sample-weighting\nstrategy is employed to achieve a more robust model. On the other hand, our model operates under a\nsemi-supervised learning setting, where both the labeled training and the unlabeled testing samples\nare de-noised simultaneously. Therefore, the geometry of the sample space is better modeled on the\nlow-dimensional subspace, by interweaving both labeled training and unlabeled testing data. In ad-\ndition, our model further selects the most discriminative features to learn the regression/classi\ufb01cation\nmodel, by regularizing the mapping weights vector and enforcing an sparsity condition on them.\n\n4\n\n\fAlgorithm 1 RFS-LDA optimization algorithm.\n\nInput: X = [Xtr Xtst], Ytr, parameters \u03b7, \u03bb1, \u03bb2, \u03c1 and \u03b3.\nInitialization: D0 = [Xtr Xtst], \u02c6D0 = [Xtr; 1(cid:62)], \u03b2\u03b2\u03b20 = Ytr( \u02c6D0)(cid:62)( \u02c6D0( \u02c6D0)(cid:62) + \u03b3I), E0 = 0, L 0\nX/(cid:107)X(cid:107)2, L 0\n\n3 = \u03b2\u03b2\u03b20/(cid:107)\u03b2\u03b2\u03b20(cid:107)2, \u00b51 = dN\n\n2 = Xtr/(cid:107)Xtr(cid:107)2, L 0\n\n4 (cid:107)Xtr(cid:107)1, \u00b53 = dc\n\n4 (cid:107)X(cid:107)1, \u00b52 = dNtr\n\n4 (cid:107)\u03b2\u03b2\u03b20(cid:107)1.\n\n1 =\n\n(cid:46) Main optimization loop\n(cid:46) Update \u03b2\u03b2\u03b2\n\ni \u2212 \u02c6\u03b2\u03b2\u03b2t \u02c6dk\n\n( \u02c6Dk)(cid:62) + \u00b53(Bk \u2212 L k\n\nt \u2190 0, \u02c6\u03b2\u03b2\u03b20 = \u03b2\u03b2\u03b2k\nrepeat\n\n(cid:113)\n( \u02c6Dk)(cid:62) + \u03b3I(cid:1), t \u2190 t + 1\n\u2200 i, j \u2208 {0, . . . , Ntr \u2212 1}, i (cid:54)= j, \u02c6\u03b1\u03b1\u03b1ij \u2190 0 and \u02c6\u03b1\u03b1\u03b1ii \u2190 1/\n(cid:62)\ntr; 1(cid:62)](cid:1)\n\u02c6Dk+1](1:Ntr ,:) 0(cid:3)(cid:1)\n\n3 )(cid:1)(cid:0) \u02c6Dk \u02c6\u03b1\u03b1\u03b1\u02c6\u03b1\u03b1\u03b1\n2 I(cid:1)\u22121(cid:0)\u03b7 \u02c6\u03b1\u03b1\u03b1\n1 (X \u2212 Ek) +(cid:2)[L k\n\n(\u03b2\u03b2\u03b2k+1)(cid:62)Ytr \u2212 L k\n2 + \u00b5k\n2\n\n\u02c6\u03b2\u03b2\u03b2t+1 \u2190(cid:0)Ytr \u02c6\u03b1\u03b1\u03b1\u02c6\u03b1\u03b1\u03b1\n\u02c6Dk+1 \u2190(cid:0)\u03b7 \u02c6\u03b1\u03b1\u03b1\n\n1: k \u2190 0\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n(\u03b2\u03b2\u03b2k+1)(cid:62)\u03b2\u03b2\u03b2k+1 \u02c6\u03b1\u03b1\u03b1 + \u00b5k\n9:\n10:\n1 + \u00b5k\n2 )\n(X \u2212 Dk+1 + L k\n11:\n(\u03b2\u03b2\u03b2k+1 + L k\n12:\n3 )\n1 (X \u2212 Dk+1 \u2212 Ek+1)\n13:\n2 ( \u02c6D \u2212 [Dk+1\n14:\ntr\n2 \u2190 min(\u03c1\u00b5k\n15:\n1 , 109), \u00b5k+1\n16:\n17: until (cid:107)X \u2212 Dk \u2212 Ek(cid:107)F/(cid:107)X(cid:107)F < 10\u22128 and (cid:107) \u02c6Dk \u2212 [Dk\n\nuntil (cid:107)\u02c6\u03b2\u03b2\u03b2t\u22121 \u2212 \u02c6\u03b2\u03b2\u03b2t(cid:107)F/((cid:107)\u02c6\u03b2\u03b2\u03b2t\u22121(cid:107)F \u00d7 (cid:107)\u02c6\u03b2\u03b2\u03b2t(cid:107)F) < 0.001 or t > 100\n\u03b2\u03b2\u03b2k+1 \u2190 \u02c6\u03b2\u03b2\u03b2t.\n(cid:62)\nDk+1 \u2190 D1/(\u00b5k\nEk+1 \u2190 S\u03bb1/\u00b5k\nBk+1 \u2190 S\u03bb2/\u00b5k\n1 \u2190 L k\nL k+1\n1 + \u00b5k\n2 \u2190 L k\nL k+1\n2 + \u00b5k\n1 \u2190 min(\u03c1\u00b5k\n\u00b5k+1\nk \u2190 k + 1\n\n; 1(cid:62)]), L k+1\n\n(cid:0)L k\n\ni )2 + 0.0001\n\n1 + \u00b5k\n\n2 [Dk\n\n1 )\n1 /\u00b5k\n\n3 (\u03b2\u03b2\u03b2 \u2212 B)\n3 + \u00b5k\n3 \u2190 min(\u03c1\u00b5k\n\n3 \u2190 L k\n2 , 109), \u00b5k+1\ntr ; 1(cid:62)](cid:107)F/(cid:107) \u02c6Dk(cid:107)F < 10\u22128 and (cid:107)\u03b2\u03b2\u03b2k \u2212 Bk(cid:107)F/(cid:107)\u03b2\u03b2\u03b2k(cid:107)F < 10\u22128\n\n3 , 109)\n\n2 + \u00b5k\n\n(cid:46) Update \u02c6D\n(cid:46) Update D\n(cid:46) Update E\n(cid:46) Update B\n(cid:46) Update multipliers and parameters\n\n(yk\n\n1\n\n3\n\nOutput: \u03b2\u03b2\u03b2, D, E and Ytst = \u03b2\u03b2\u03b2Xtst.\n\n(cid:62)\n\n(cid:62)\n\nOptimization: Problem (5) could be ef\ufb01ciently solved using the augmented Lagrangian multipliers\n(ALM) approach. Hence, we introduce the Lagrangian multipliers, L1 \u2208 Rd\u00d7N , L2 \u2208 R(d+1)\u00d7Ntr\nand L3 \u2208 Rl\u00d7(d+1), an auxiliary variable, B \u2208 Rl\u00d7(d+1), and write the Lagrangian function as:\n\nL(\u03b2\u03b2\u03b2, B, D, \u02c6D, E) =\n\n(cid:107)H(Ytr \u2212 \u03b2\u03b2\u03b2 \u02c6D)\u02c6\u03b1\u03b1\u03b1(cid:107)2\n\u03b7\n2\n+ (cid:104)L1, X \u2212 D \u2212 E(cid:105) +\n(cid:107) \u02c6D \u2212 [Dtr; 1(cid:62)](cid:107)2\n\n+\n\n\u00b52\n2\n\nF + (cid:107)D(cid:107)\u2217 + \u03bb1(cid:107)E(cid:107)1 + \u03bb2((cid:107)B(cid:107)1 + \u03b3(cid:107)\u03b2\u03b2\u03b2(cid:107)F)\n\n(cid:107)X \u2212 D \u2212 E(cid:107)2\n\u00b51\n2\nF + (cid:104)L3, \u03b2\u03b2\u03b2 \u2212 B(cid:105) +\n\nF + (cid:104)L2, \u02c6D \u2212 [Dtr; 1(cid:62)](cid:105)\n\u00b53\n2\n\n(cid:107)\u03b2\u03b2\u03b2 \u2212 B(cid:107)2\nF,\n\n(7)\n\nwhere \u00b51, \u00b52 and \u00b53 are penalty parameters. There are \ufb01ve variables (\u03b2\u03b2\u03b2, B, D, \u02c6D and E) contribut-\ning to the problem. We alternatively optimize for each variable, while \ufb01xing the others. Except for\nthe matrix \u03b2\u03b2\u03b2, all the variables have straightforward or closed-form solutions. \u03b2\u03b2\u03b2 is calculated through\nIRLS [2], by iteratively calculating the weights in \u02c6\u03b1\u03b1\u03b1 and solving the conventional least-squares\nproblem, until convergence.\nThe detailed optimization steps are given in Algorithm 1. The normalization factor H is omitted\nin this algorithm, for easier readability. In this algorithm, I is the identity matrix and the operators\nD\u03c4 (.) and S\u03ba(.) are de\ufb01ned in the following. D\u03c4 (A) = UD\u03c4 (\u03a3\u03a3\u03a3)V\u2217 applies singular value thresh-\nolding algorithm [6] on the intermediate matrix \u03a3\u03a3\u03a3, as D\u03c4 (\u03a3\u03a3\u03a3) = diag({(\u03c3i \u2212 \u03c4 )+}), where U\u03a3\u03a3\u03a3V\u2217\nis the singular values decomposition (SVD) of A and \u03c3is are the singular values. Additionally,\nS\u03ba(a) = (a \u2212 \u03ba)+ \u2212 (\u2212a \u2212 \u03ba)+ is the soft thresholding operator or the proximal operator for the\n(cid:96)1 norm [3]. Note that s+ is the positive part of s, de\ufb01ned as s+ = max(0, s).\nAlgorithm analysis: The solution for each of the matrices B, D, \u02c6D, E is a convex function, while\nall the other variables are \ufb01xed. For \u03b2\u03b2\u03b2, the solution is achieved via the IRLS approach, in an iterative\nmanner. Both the (cid:96)1 \ufb01tting function and the approximated re-weighted least-squares functions are\nconvex. We only need to ensure that the minimization of the latter is numerically better tractable\nthan the minimization of the former. This is discussed in depth and the convergence is proved in [2].\nTo estimate the computational complexity of the algorithm, we need to investigate the complexity of\nthe sub-procedures of the algorithm. The two most computationally expensive steps in the loop are\nthe iterative update of \u03b2\u03b2\u03b2 (Algorithm 1, Steps 4-7) and the SVT operation (Algorithm 1, Step 10). The\nformer includes solving a least-squares iteratively, which is O(d2N ) in each iteration and the latter\nhas the SVD operation as the most computational intensive operation, which is of O(d2N + N 3).\n\n5\n\n\fBy considering the maximum number of iterations for the \ufb01rst sub-procedure equal to tmax = 100,\nthe overall computational complexity of the algorithm in each iteration would be O(100d2N + N 3).\nThe number of iterations of the whole algorithm until convergence is dependent on the choice of\n{\u00b5}s. If \u00b5 penalty parameters are increasing smoothly in each iteration (as in Step 15, Algorithm 1),\nthe overall algorithm would be Q-linearly convergent. A reasonable choice for the sequence of all\n{\u00b5}s yields in a decrease in the number of required SVD operations [1, 21].\n\n3 Experiments\n\nWe compare our method with several baseline and state-of-the-art methods in three different scenar-\nios. The \ufb01rst experiment is on synthetic data, which highlights how the proposed method is robust\nagainst sample-outliers or feature-noises, separately or when they occur at the same time. The next\ntwo experiments are conducted for neurodegenerative brain disorders diagnosis. We use two popular\ndatabases, one for Parkinson\u2019s disease (PD) and the other for Alzheimer\u2019s disease (AD).\nWe compare our results with different baseline methods, including: Conventional LS-LDA [10],\nRLDA [15], RPCA on the X matrix separately to de-noise and then LS-LDA for the classi\ufb01ca-\ntion (denoted as RPCA+LS-LDA) [15], linear support vector machines (SVM), and sparse feature\nselection with SVM (SFS+SVM) or with RLDA (SFS+RLDA). Except for RPCA+LDA, the other\nmethods in comparison do not incorporate the testing data. In order to have a fair set of comparisons,\nwe also compare against the transductive matrix completion (MC) approach [14]. Additionally, to\nalso evaluate the effect of the regularization on matrix \u03b2\u03b2\u03b2, we report results for RFS-LDA when regu-\nlarized by only \u03b3(cid:107)\u03b2\u03b2\u03b2(cid:107)F (denoted as RFS-LDA\u2217), instead of the term introduced in (4). Moreover, we\nalso train our proposed RFS-LDA in a fully supervised setting, i.e., not involving any testing data\nin the training process, to show the effect of the established semi-supervised learning framework in\nour proposed method. This is simply done by replacing variable X in (3) with Xtr and solving the\nproblem correspondingly. This method, referred to as S-RFS-LDA, only uses the training data to\nform the geometry of the sample space and, therefore, only cleans the training feature-noises.\nFor the choice of parameters, the best parameters are selected through an inner 10-fold cross valida-\nset with a same strategy as in [15]: \u03bb1 = \u039b1/((cid:112)min(d, N )), \u03bb2 = \u039b2/\ntion on the training data, for all the competing methods. For the proposed method, the parameters are\nd, \u03b7k = \u039b3(cid:107)X(cid:107)\u2217/(cid:107)Ytr \u2212 \u03b2\u03b2\u03b2k \u02c6Dk(cid:107)2\nF,\nand \u03c1 (controlling the {\u00b5}s in the algorithm) is set to 1.01. We have set \u039b1, \u039b2, \u039b3 and \u03b3 through\ninner cross validation, and found that all set to 1 yields to reasonable results across all datasets.\nSynthetic Data: We construct two independent 100-dimensional subspaces, with bases U1 and U2\n(same as described in [21]). U1 \u2208 R100\u00d7100 is a random orthogonal matrix and U2 = TU1, in\nwhich T is a random rotation matrix. Then, 500 vectors are sampled from each subspace through\nXi = UiQi, i = {1, 2}, with Qi, a 100 \u00d7 500 matrix, independent and identically distributed\n(i.i.d.) from N (0, 1). This leads to a binary classi\ufb01cation problem. We gradually add additional\nnoisy samples and features to the data, drawn i.i.d from N (0, 1), and evaluate our proposed method.\nThe accuracy means and standard deviations of three different runs are illustrated in Fig. 2. This\nexperiment is conducted under three settings: (1) First, we analyze the behavior of the method\nagainst gradually added noise to some of the features (feature-noises), illustrated in Fig. 2a. (2)\nWe randomly add some noisy samples to the aforementioned noise-free samples and evaluate the\nmethods in the sole presence of sample-outliers. Results are depicted in Fig. 2b. (3) Finally, we\nsimultaneously add noisy features and samples. Fig. 2c shows the mean\u00b1std accuracy as a function\nof the additional number of noisy features and samples. Note that all the reported results are obtained\nthrough 10-fold cross-validation. As can be seen, our method is able to select a better subset of\nfeatures and samples and achieve superior results compared to RLDA and conventional LS-LDA\napproaches. Furthermore, our method behaves more robust against the increase in the noise factor.\nBrain neurodegenrative disease diagnosis databases: The \ufb01rst set of data used in this paper is\nobtained from the Parkinson\u2019s progression markers initiative (PPMI) database2 [23]. PPMI is the\n\ufb01rst substantial study for identifying the PD progression biomarkers to advance the understanding\nof the disease. In this research, we use the MRI data acquired by the PPMI study, in which a T1-\nweighted, 3D sequence (e.g., MPRAGE or SPGR) is acquired for each subject using 3T SIEMENS\nMAGNETOM TrioTim syngo scanners. We use subjects scanned using MPRAGE sequence to\n\n\u221a\n\n2http://www.ppmi-info.org/data\n\n6\n\n\f100\n\n)\n\n%\n\n(\n\ny\nc\na\nr\nu\nc\nc\nA\n\n90\n\n80\n\n70\n\nRFS-LDA\n\nRLDA\nLS-LDA\n\n100\n\n80\n\n60\n\n100\n\n80\n\n60\n\n0\n# of added noisy features\n\n100\n\n200\n\n0\n# of added noisy samples\n\n100\n\n200\n\n0\n\n100\n\n200\n\n# of added noisy samples and features\n\n(a) Only added noisy features\n\n(b) Only added noisy samples\n\n(c) Added noisy samples & features\n\nFigure 2: Results comparisons on synthetic data, for three different runs (mean\u00b1std).\n\nTable 1: The accuracy (ACC) and area under ROC curve (AUC) of the PD/NC classi\ufb01cation on\nPPMI database, compared to the baseline methods.\n\nRFS-LDA RFS-LDA\u2217\n84.1\n0.87\n\n78.3\n0.81\n\nACC\nAUC\n\nMethod\n\nS-RFS-LDA RLDA SFS+RLDA RPCA+LS-LDA LS-LDA\n56.6\n0.59\n\n73.4\n0.80\n\n75.8\n0.80\n\n71.0\n0.79\n\n59.4\n0.64\n\nSVM SFS+SVM\n55.2\n61.5\n0.59\n0.56\n\nMC\n61.5\n68.8\n\nminimize the effect of different scanning protocols. The T1-weighted images were acquired for\n176 sagittal slices with the following parameters: repetition time = 2300 ms, echo time = 2.98 ms,\n\ufb02ip angle = 9\u25e6, and voxel size = 1 \u00d7 1 \u00d7 1 mm3. All the MR images were preprocessed by skull\nstripping [29], cerebellum removal, and then segmented into white matter (WM), gray matter (GM),\nand cerebrospinal \ufb02uid (CSF) tissues [20]. The anatomical automatic labeling atlas [27], parcellated\nwith 90 prede\ufb01ned regions of interest (ROI), was registered using HAMMER3 [25, 30] to each\nsubject\u2019s native space. We further added 8 more ROIs in basal ganglia and brainstem regions, which\nare clinically important ROIs for PD. We then computed WM, GM and CSF tissue volumes in each\nof the 98 ROIs as features. 56 PD and 56 normal control (NC) subjects are used in our experiments.\nThe second dataset is from Alzheimer\u2019s disease neuroimaging initiative (ADNI) study4, including\nMRI and FDG-PET data. For this experiment, we used 93 AD patients, 202 MCI patients and 101\nNC subjects. To process the data, same tools employed in [29] and [32] are used, including spatial\ndistortion, skull-stripping, and cerebellum removal. The FSL package [33] was used to segment\neach MR image into three different tissues, i.e., GM, WM, and CSF. Then, 93 ROIs are parcellated\nfor each subject [25] with atlas warping. The volume of GM tissue in each ROI was calculated as\nthe image feature. For FDG-PET images, a rigid transformation was employed to align it to the\ncorresponding MR image and the mean intensity of each ROI was calculated as the feature. All\nthese features were further normalized in a similar way, as in [32].\nResults: The \ufb01rst experiment is set up on the PPMI database. Table 1 shows the diagnosis accuracy\nof the proposed technique (RFS-LDA) in comparisons with different baseline and state-of-the-art\nmethods, using a 10-fold cross-validation strategy. As can be seen, the proposed method outperforms\nall others. This could be because our method deals with both feature-noises and sample-outliers.\nNote that, subjects and their corresponding feature vectors extracted from MRI data are quite prone\nto noise, because of many possible sources of noise (e.g. the patient\u2019s body movements, RF emission\ndue to thermal motion, overall MR scanner measurement chain, or preprocessing artifacts). There-\nfore, some samples might not be useful (sample-outliers) and some might be contaminated by some\namounts of noise (feature-noises). Our method deals with both types and achieves good results.\nThe goal for the experiments on ADNI database is to discriminate both MCI and AD patients from\nNC subjects, separately. Therefore, NC subjects form our negative class, while the positive class\nis de\ufb01ned as AD in one experiment and MCI in the other. The diagnosis results of the AD vs. NC\nand MCI vs. NC experiments are reported in Tables 2. As it could be seen, in comparisons with\nthe state-of-the-art, our method achieves good results in terms of both accuracy and the area under\ncurve. This is because we successfully discard the sample-outliers and detect the feature-noises.\n\n3Could be downloaded at http://www.nitrc.org/projects/hammerwml\n4http://www.loni.ucla.edu/ADNI\n\n7\n\n\fTable 2: The accuracy (ACC) and the area under ROC curve (AUC) of the Alzheimer\u2019s disease\nclassi\ufb01cation on ADNI database, compared to the baseline methods.\n\nAD/NC ACC\nAUC\nMCI/NC ACC\nAUC\n\nRFS-LDA RFS-LDA\u2217\n91.8\n0.98\n89.8\n0.93\n\n89.1\n0.96\n85.6\n0.90\n\nMethod\n\nS-RFS-LDA RLDA SFS+RLDA RPCA+LS-LDA LS-LDA\n70.9\n0.81\n68.9\n0.75\n\n90.1\n0.98\n88.1\n0.92\n\n87.6\n0.93\n84.5\n0.87\n\n86.3\n0.95\n84.5\n0.90\n\n88.7\n0.96\n85.0\n0.87\n\nSVM SFS+SVM\n76.3\n72.1\n0.83\n0.80\n76.1\n70.1\n0.79\n0.80\n\nMC\n78.2\n0.82\n74.3\n0.78\n\nFigure 3: The top selected ROIs for AD vs. NC (left) and MCI vs. NC (right) classi\ufb01cation problems.\n\nDiscussions: In medical imaging applications, many sources of noise (e.g. patient\u2019s movement,\nradiations and limitation of imaging devices, preprocessing artifacts) contribute to the acquired data\n[13], and therefore methods that deal with noise and outliers are of great interest. Our method\nenjoys from a single optimization objective that can simultaneously suppress sample-outliers and\nfeature-noises, which compared to the competing methods, exhibits a good performance. One of\nthe interesting functions of the proposed method is the regularization on the mapping coef\ufb01cients\nwith the (cid:96)1 norm, which would select a compact set of features to contribute to the learned mapping.\nThe magnitude of the coef\ufb01cients would show the level of contribution of that speci\ufb01c feature to the\nlearned model. In our application, the features from the whole brain regions are extracted, but only a\nsmall number of regions are associated with the disease (e.g., AD, MCI or PD). Using this strategy,\nwe can determine which brain regions are highly associated with a certain disease.\nFig. 3 shows the top regions selected by our algorithm in AD vs. NC and MCI vs. NC classi\ufb01cation\nscenarios. These regions, including middle temporal gyrus, medial front-orbital gyrus, postcentral\ngyrus, caudate nucleus, cuneus, and amygdala have been reported to be associated with AD and\nMCI in the literature [24, 26]. The \ufb01gures show the union of regions selected for both MRI and\nFDG-PET features. The most frequently used regions for the PD/NC experiment are the substantial\nnigra (left and right), putamen (right), middle frontal gyrus (right), superior temporal gyrus (left),\nwhich are also consistent with the literature [4, 31]. This selection of brain regions could be further\nincorporated for future clinical analysis.\nThe semi-supervised setting of the proposed method is also of great interest in the diagnosis of\npatients. When new patients \ufb01rst arrive and are to be diagnosed, the previous set of the patients with\nno certain diagnosis so far (not labeled yet), could still be used to build a more reliable classi\ufb01er.\nIn other words, the current testing samples could contribute the diagnosis of future subjects, as\nunlabeled samples.\n\n4 Conclusion\n\nIn this paper, we proposed an approach for discriminative classi\ufb01cation, which is robust against\nboth sample-outliers and feature-noises. Our method enjoys a semi-supervised setting, where all\nthe labeled training and the unlabeled testing data are used to detect outliers and are de-noised,\nsimultaneously. We have applied our method to the interesting problem of neurodegenerative brain\ndisease diagnosis and directly applied it for the diagnosis of Parkinson\u2019s and Alzheimer\u2019s diseases.\nThe results show that our method outperforms all competing methods. As a direction for the future\nwork, one can develop a multi-task learning reformulation of the proposed method to incorporate\nmultiple modalities for the subjects, or extend the method for the incomplete data case.\n\n8\n\n\fReferences\n[1] E. Adeli-Mosabbeb and M. Fathy. Non-negative matrix completion for action detection. Image Vision\n\nComput., 39:38 \u2013 51, 2015.\n\n[2] N. Bissantz, L. D\u00a8umbgen, A. Munk, and B. Stratmann. Convergence analysis of generalized iteratively\nreweighted least squares algorithms on convex function spaces. SIAM Optimiz., 19(4):1828\u20131845, 2009.\n[3] S. Boyd and et al.. Distributed optimization and statistical learning via the alternating direction method\n\nof multipliers. Found. Trends Mach. Learn., 3(1):1\u2013122, 2011.\n\n[4] Heiko Braak, Kelly Tredici, Udo Rub, Rob de Vos, Ernst Jansen Steur, and Eva Braak. Staging of brain\n\npathology related to sporadic parkinsons disease. Neurobio. of Aging, 24(2):197 \u2013 211, 2003.\n\n[5] D. Cai, X. He, and J. Han. Semi-supervised discriminant analysis. In CVPR, 2007.\n[6] J.-F. Cai, E. Cand`es, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM\n\nOptimiz., 20(4):1956\u20131982, 2010.\n\n[7] E. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM, 58(3), 2011.\n[8] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n[9] C. Croux and C. Dehon. Robust linear discriminant analysis using s-estimators. Canadian J. of Statistics,\n\n29(3):473\u2013493, 2001.\n\n[10] F. De la Torre. A least-squares framework for component analysis. IEEE TPAMI, 34(6):1041\u20131055, 2012.\n[11] E. Elhamifar and R. Vidal. Robust classi\ufb01cation using structured sparse representation. In CVPR, 2011.\n[12] S. Fidler, D. Skocaj, and A. Leonardis. Combining reconstructive and discriminative subspace methods\n\nfor robust classi\ufb01cation and regression by subsampling. IEEE TPAMI, 28(3):337\u2013350, 2006.\n\n[13] V. Fritsch, G. Varoquaux, B. Thyreau, J.-B. Poline, and B. Thirion. Detecting outliers in high-dimensional\n\nneuroimaging datasets with robust covariance estimators. Med. Image Anal., 16(7):1359 \u2013 1370, 2012.\n\n[14] A. Goldberg, X. Zhu, B. Recht, J.-M. Xu, and R. Nowak. Transduction with matrix completion: Three\n\nbirds with one stone. In NIPS, pages 757\u2013765, 2010.\n\n[15] D. Huang, R. Cabral, and F. De la Torre. Robust regression. In ECCV, pages 616\u2013630, 2012.\n[16] A. Joulin and F. Bach. A convex relaxation for weakly supervised classi\ufb01ers. In ICML, 2012.\n[17] S. Kim, A. Magnani, and S. Boyd. Robust Fisher discriminant analysis. In NIPS, pages 659\u2013666, 2005.\n[18] H. Li, T. Jiang, and K. Zhang. Ef\ufb01cient and robust feature extraction by maximum margin criterion. In\n\nNIPS, pages 97\u2013104, 2003.\n\n[19] H. Li, C. Shen, A. van den Hengel, and Q. Shi. Worst-case linear discriminant analysis as scalable\n\nsemide\ufb01nite feasibility problems. IEEE TIP, 24(8), 2015.\n\n[20] K.O. Lim and A. Pfefferbaum. Segmentation of MR brain images into cerebrospinal \ufb02uid spaces, white\n\nand gray matter. J. of Computer Assisted Tomography, 13:588\u2013593, 1989.\n\n[21] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank\n\nrepresentation. IEEE TPAMI, 35(1):171\u2013184, 2013.\n\n[22] C. Lu, J. Shi, and J. Jia. Online robust dictionary learning. In CVPR, pages 415\u2013422, June 2013.\n[23] K. Marek and et al.. The parkinson progression marker initiative (PPMI). Prog. Neurobiol., 95(4):629 \u2013\n\n635, 2011.\n\n[24] B. Pearce, A. Palmer, D. Bowen, G. Wilcock, M. Esiri, and A. Davison. Neurotransmitter dysfunction\n\nand atrophy of the caudate nucleus in alzheimer\u2019s disease. Neurochem Pathol., 2(4):221\u201332, 1985.\n\n[25] D. Shen and C. Davatzikos. HAMMER: Hierarchical attribute matching mechanism for elastic registra-\n\ntion. IEEE TMI, 21:1421\u20131439, 2002.\n\n[26] K.-H. Thung, C.-Y. Wee, P.-T. Yap, and D. Shen. Neurodegenerative disease diagnosis using incomplete\n\nmulti-modality data via matrix shrinkage and completion. NeuroImage, 91:386\u2013400, 2014.\n\n[27] N. Tzourio-Mazoyer and et al.. Automated anatomical labeling of activations in SPM using a macro-\nscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage, 15(1):273\u2013289, 2002.\n[28] A. Wagner, J. Wright, A. Ganesh, Zihan Zhou, and Yi Ma. Towards a practical face recognition system:\n\nRobust registration and illumination by sparse representation. In CVPR, pages 597\u2013604, 2009.\n\n[29] Y. Wang, J. Nie, P.-T. Yap, G. Li, F. Shi, X. Geng, L. Guo, D. Shen, ADNI, et al. Knowledge-guided\nrobust MRI brain extraction for diverse large-scale neuroimaging studies on humans and non-human\nprimates. PLOS ONE, 9(1):e77810, 2014.\n\n[30] Y. Wang, J. Nie, P.-T. Yap, F. Shi, L. Guo, and D. Shen. Robust deformable-surface-based skull-stripping\n\nfor large-scale studies. In MICCAI, volume 6893, pages 635\u2013642, 2011.\n\n[31] A. Worker and et al.. Cortical thickness, surface area and volume measures in parkinson\u2019s disease,\n\nmultiple system atrophy and progressive supranuclear palsy. PLOS ONE, 9(12), 2014.\n\n[32] D. Zhang, Y. Wang, L. Zhou, H. Yuan, D. Shen, ADNI, et al. Multimodal classi\ufb01cation of Alzheimer\u2019s\n\ndisease and mild cognitive impairment. NeuroImage, 55(3):856\u2013867, 2011.\n\n[33] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain MR images through a hidden Markov random\n\n\ufb01eld model and the expectation-maximization algorithm. IEEE TMI, 20(1):45\u201357, 2001.\n\n[34] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,\n\n[35] D. Ziegler and J. Augustinack. Harnessing advances in structural MRI to enhance research on Parkinson\u2019s\n\nUniversity of Wisconsin-Madison, 2005.\n\ndisease. Imaging Med., 5(2):91\u201394, 2013.\n\n9\n\n\f", "award": [], "sourceid": 462, "authors": [{"given_name": "Ehsan", "family_name": "Adeli-Mosabbeb", "institution": "UNC-Chapel Hill"}, {"given_name": "Kim-Han", "family_name": "Thung", "institution": "UNC-Chapel Hill"}, {"given_name": "Le", "family_name": "An", "institution": "UNC-Chapel Hill"}, {"given_name": "Feng", "family_name": "Shi", "institution": "UNC-Chapel Hill"}, {"given_name": "Dinggang", "family_name": "Shen", "institution": "UNC-Chapel Hill"}]}