{"title": "Heterogeneous Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1097, "page_last": 1104, "abstract": null, "full_text": "Heterogeneous Component Analysis\n\nShigeyuki Oba1, Motoaki Kawanabe2, Klaus Robert M\u00a8uller3,2, and Shin Ishii4,1\n\n1. Graduate School of Information Science, Nara Institute of Science and Technology, Japan\n\n2. Fraunhofer FIRST.IDA, Germany\n\n3. Department of Computer Science, Technical University Berlin, Germany\n\n4. Graduate School of Informatics, Kyoto University, Japan\n\nshige-o@is.naist.jp\n\nAbstract\n\nIn bioinformatics it is often desirable to combine data from various measurement\nsources and thus structured feature vectors are to be analyzed that possess different\nintrinsic blocking characteristics (e.g., different patterns of missing values, obser-\nvation noise levels, effective intrinsic dimensionalities). We propose a new ma-\nchine learning tool, heterogeneous component analysis (HCA), for feature extrac-\ntion in order to better understand the factors that underlie such complex structured\nheterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not\nonly on a probabilistic model with block-wise residual variance terms but also on\na Bayesian treatment of a block-wise sparse factor-loading matrix. We study vari-\nous algorithms that implement our HCA concept extracting sparse heterogeneous\nstructure by obtaining common components for the blocks and speci\ufb01c compo-\nnents within each block. Simulations on toy and bioinformatics data underline the\nusefulness of the proposed structured matrix factorization concept.\n\n1 Introduction\n\nMicroarray and other high-throughput measurement devices have been applied to examine speci-\nmens such as cancer tissues of biological and/or clinical interest. The next step is to go towards\ncombinatorial studies in which tissues measured by two or more of such devices are simultaneously\nanalyzed. However, such combinatorial studies inevitably suffer from differences in experimental\nconditions, or, even more complex, from different measurement technologies. Also, when concate-\nnating a data set from different measurement sources, we often observe systematic missing parts\nin a dataset (e.g., Fig 3A). Moreover, the noise levels may vary among different experiments. All\nthese induce a heterogeneous structure in data, that needs to be treated appropriately. Our work will\ncontribute exactly to this topic, by proposing a Bayesian method for feature subspace extraction,\ncalled heterogeneous component analysis (HCA, sections 2 and 3). HCA performs a linear feature\nextraction based on matrix factorization in order to obtain a sparse and structured representation.\nAfter relating to previous methods (section 4), HCA is applied to toy data and more interestingly\nto neuroblastoma data from different measurement techniques (section 5). We obtain interesting\nfactors that may be a \ufb01rst step towards better biological model building.\n\n2 Formulation of the HCA problem\n\nLet a matrix Y = {yij}i=1:M,j=1:N denote a set of N observations of M-dimensional feature\nvectors, where yij \u2208 R is the j-th observation of the i-th feature. In a heterogeneous situation, we\nassume the M-dimensional feature vector is decomposed into L disjoint blocks. Let I (l) denote a\nset of feature indices included in the l-th block, so that I (1) \u222a \u00b7 \u00b7 \u00b7 \u222a I (L) = I and I (l) \u2229 I (l\u2032) = \u2205 for\nl 6= l\u2032.\n\n\fFigure 1: An illustration of a typical dataset and the result by the HCA. The observation matrix\nY consists of multiple samples j = 1, . . . , N with high-dimensional features i \u2208 I. The features\nconsist of multiple blocks, in this case I (1) \u222a I (2) \u222a I (3) = I. There are many missing observations\nwhose distribution is highly structural depending on each block. HCA optimally factorizes the ma-\ntrix Y so that the factor-loading matrix U has structural sparseness; it includes some regions of zero\nelements according to the block structure of the observed data. Each factor may or may not affect all\nthe features within a block, but each block does not necessarily affect all the factors. Therefore, the\nrank of each factor-loading sub-matrix for each block (or any set of blocks) can be different from\nthe others. The resulting block-wise sparse matrix re\ufb02ects a characteristic heterogeneity of features\nover blocks.\n\nWe assume that the matrix Y \u2208 RM \u00d7N is a noisy observation of a matrix of true values X \u2208 RM \u00d7N\nwhose rank is K(< min(M, N )) and has a factorized form:\n\nY = X + E, X = U V T,\n\n(1)\n\nwhere E \u2208 RM \u00d7N , U \u2208 RM \u00d7K, and V \u2208 RN \u00d7K are matrices of residuals, factor-loadings, and\nfactors, respectively. The superscript T denotes the matrix transpose. There may be missing or\nunmeasured observations denoted by a matrix W \u2208 {0, 1}M \u00d7N , which indicates observation yij is\nmissing if wij = 0 or exists otherwise (wij = 1).\nFigure 1 illustrates the concept of HCA. In this example, the observed data matrix (left panel) is\nmade up by three blocks of features. They have block-wise variation in effective dimensionali-\nties, missing rates, observation noise levels, and so on, which we overall call heterogeneity. Such\nheterogeneity affects the effective rank of the observation sub-matrix corresponding to each block,\nand hence leads naturally to different ranks of factor-loading sub-matrix between blocks. In ad-\ndition, there can exist block-wise patterns of missing values (shadowed rectangular regions in the\nleft panel); such a situation would occur, for example in bioinformatics, when some particular genes\nhave been measured in one assay (constituting a block) but not in another assay (constituting another\nblock).\n\nTo better understand the objective data based on the feature extraction by matrix factorization, we\nassume a block-wise sparse factor-loading matrix U (right panel in Fig.1). Namely, the effective\nrank of an observation sub-matrix corresponding to a block is re\ufb02ected by the number of non-zero\ncomponents in the corresponding rows of U . Assuming such a block-wise sparse structure can\ndecrease the model\u2019s effective complexity, and will describe the data better and therefore lead to\nbetter generalization ability, e.g., for missing value prediction.\n\n3 A probabilistic model for HCA\n\nModel For each element of the residual matrix, eij \u2261 yij \u2212PK\n\ndistribution with a common variance \u03c32\n\nl for every feature i in the same block I (l):\n\nk=1 uikvjk, we assume a Gaussian\n\n1\n2\n\n1\n2\n\nln 2\u03c0,\n\n(2)\n\nln p(eij|\u03c32\n\nl(i)) = \u2212\n\n\u03c3\u22122\nl(i)e2\n\nij \u2212\n\nln \u03c32\n\nl(i) \u2212\n\n1\n2\n\n\fwhere l(i) denotes the pre-determined block index to which the i-th feature belongs. For a factor\nmatrix V , we assume a Gaussian prior:\n\nln p(V ) =\n\nN\n\nXj=1\n\nK\n\nXk=1(cid:18)\u2212\n\n1\n2\n\nv2\n\njk \u2212 ln 2\u03c0(cid:19) .\n\n(3)\n\nThe above two assumptions are exactly the same as those for probabilistic PCA that is a special case\nof HCA with a single active block. Another special case where each block contains only one active\nfeature is probabilistic factor analysis (FA). Namely, maximum likelihood (ML) estimation based\non the following log-likelihood includes both the PCA and the FA as special settings of the blocks.\n\nln p(Y , V |U , \u03c32) =\n\n1\n\n2Xij\n\nwij(cid:16)\u2212\u03c3\u22122\n\nl(i)e2\n\nij \u2212 ln \u03c32\n\nl(i) \u2212 ln 2\u03c0(cid:17) +\n\n1\n\n2Xjk (cid:0)\u2212v2\n\njk \u2212 ln 2\u03c0(cid:1) .\n\n(4)\n\n\u03c32 = (\u03c32\n\nl )l=1,...,L is a vector of variances of all blocks. Since wij = 0 iff yij is missing, the\n\nsummationPij is actually taken for all observed values in Y .\n\nAnother character of the HCA model is the block-wise sparse factor-loading matrix, which is im-\nplemented by a prior for U , given by\n\nln p(U |T ) =Xik\n\ntik(cid:18)\u2212\n\n1\n2\n\nu2\nik \u2212\n\n1\n2\n\nln 2\u03c0(cid:19) ,\n\n(5)\n\nwhere T = {tik} is a block-wise mask matrix which de\ufb01nes the block-wise-sparse structure; if\ntik = 0, then uik = 0 with probability 1. Each column vector of the mask matrix takes one of\nthe possible block-wise mask patterns; a binary pattern vector whose dimensionality is the same as\nthe factor-loading vector, and whose values are consistent, either 0 or 1, within each block. When\nthere are L blocks, each column vector of T can take one of 2L possible patterns including the zero\nvector, and hence, the matrix T with K columns can take one of 2LK possible patterns.\n\nParameter estimation We estimated the model parameters U and V by maximum a posteriori\n(MAP) estimation, and \u03c3 by ML estimation; that is, the log-joint: L def= log P (Y , U , V |\u03c3, T ) was\nmaximized w.r.t. U , V and \u03c3.\n\nMaximization of the log-joint L w.r.t U , V , and \u03c3 was performed by the conjugate gradient algo-\nrithm that was available in the NETLAB toolbox [1]. The stationary condition w.r.t. the variance,\n\u2202(\u03c32) = 0, was solved as a closed form of U and V :\n\n\u2202L\n\nl (U , V ) def= mean(i,j|l)[e2\n\u02dc\u03c32\n\nij],\n\n(6)\n\nwhere mean(i,j|l)[.] is the average over all pairs (i, j) not missing in the l-th block. By rede\ufb01ning\nthe objective function with the closed form solution plugged in:\n\n\u02dcL(U , V ) def= L(U , V , \u02dc\u03c32(U , V )),\n\n(7)\n\nthe conjugate gradient of \u02dcL w.r.t. U and V led to faster and more stable optimization than the naive\nmaximization of L w.r.t. U , V , and \u03c32.\n\nModel selection The mask matrix T was determined by maximization of the log-marginal likeli-\n\nhoodR LdU dV which was calculated by Laplace approximation around the MAP estimator:\n\nE(T ) def= L \u2212\n\nlndetH,\n\n(8)\n\n1\n2\n\ndef= \u2202 2\n\nwhere H\n\n\u2202\u03b8\u2202\u03b8T L is the Hessian of log-joint w.r.t. all elements (\u03b8) in the parameters U and V .\nThe log Hessian term, lndetH, which works as a penalty term for maintaining non-zero elements in\nthe factor-loading matrix, was simpli\ufb01ed in order for tractable calculation. Namely, independence\nin the log-joint was assumed:\n\n\u22022L\n\n\u2202uikvjk\u2032\n\n\u2248 0,\n\n\u22022L\n\n\u2202uikuik\u2032\n\n\u2248 0, and\n\n\u22022L\n\n\u2202vjkvjk\u2032\n\n\u2248 0,\n\n(9)\n\n\fwhich enabled a similar tractable computation to variational Bayes (VB) and was expected to pro-\nduce satisfactory results.\n\nTo avoid searching through an exponentially large number of possibilities, we implemented a greedy\nsearch that optimizes each of the column vectors in a step-wise manner, called HCA-greedy algo-\nrithm. In each step of the HCA-greedy algorithm, factor-loading and factor vectors are estimated\nbased on 2L possible settings of block-wise mask vectors, and we accept the one achieving the\nmaximum log-marginal. It terminated if zero vector is accepted as the best mask vector.\n\nHCA with ARD The greedy search still searches 2L possibilities per a factor, whose computation\nincreases exponentially as the number of blocks L increases. The automatic relevance determination\n(ARD) is a hierarchical Bayesian approach for selecting relevant bases, which has been applied to\ncomponent analyzers since its \ufb01rst introduction to Bayesian PCA (BPCA) [2].\n\nThe prior for U is given by\n\nln p(U |\u03b1) =\n\n1\n2\n\nL\n\nXl=1\n\nK\n\nXk=1(Xi\u2208Il(cid:0)\u2212\u03b1lku2\n\nik + ln \u03b1lk \u2212 ln 2\u03c0(cid:1)) ,\n\n(10)\n\nwhere \u03b1lk is an ARD hyper-parameter for the l-th block of the k-th column of U . \u03b1 is a vector of\nall elements of \u03b1lk, l = 1, . . . , L, k = 1, . . . , K. With this prior, the log-joint probability density\nfunction becomes\n\nln p(Y , U , V |\u03c32, \u03b1) =\n\n1\n\nl(i)e2\n\nwij(cid:16)\u2212\u03c3\u22122\n2Xij\n2Xik (cid:0)\u2212\u03b1l(i)ku2\n\nij \u2212 ln \u03c32\n\nl(i) \u2212 ln 2\u03c0(cid:17) +\nik + ln \u03b1l(i)k \u2212 ln 2\u03c0(cid:1) .\n\n+\n\n1\n\n1\n\n2Xjk (cid:0)\u2212v2\n\njk \u2212 ln 2\u03c0(cid:1)\n\n(11)\n\nAccording to this ARD approach, \u03b1 is updated by the conjugate gradient-based optimization si-\nmultaneously with U and V . In each step of the optimization, \u03b1 was updated until the stationary\ncondition of log-marginal w.r.t. \u03b1 approximately held.\n\nIn HCA with ARD, called HCA-ARD, the initial values of U and V were obtained by SVD. We also\nexamined an ARD-based procedure with another initial value setting, i.e., starting from the result\nobtained by HCA-greedy, which is signi\ufb01ed by HCA-g+ARD.\n\n4 Related work\n\nIn this work, the ideas from both probabilistic modeling of linear component analyzers and sparse\nmatrix factorization frameworks are combined into an analytical tool for data with underlying het-\nerogeneous structures.\n\nThe weighted low-rank matrix factorization (WLRMF) [3] has been proposed as a minimization\nproblem of the weighted error:\n\nmin\nU ,V\n\n=Xi,j\n\nwij(yij \u2212Xk\n\nuikvjk)2,\n\n(12)\n\nwhere wij is a weight for the element yij of the observation matrix Y . The weight value is set\nas wij = 0 if the corresponding yij is missing or wij > 0 otherwise. This objective func-\ntion is equivalent to the (negative) log-likelihood of a probabilistic generative model based on\nan assumption that each element of the residual matrix obeys a Gaussian distribution with vari-\nance 1/wij. The WLRMF objective function is equivalent to our log-likelihood function (4) if the\nweight is set at estimated inverse noise variance for each (i, j)-th element. Although the prior term,\nln p(V ) = \u2212 1\njk + const., has been added to eq. (4), it just imposes a constraint on the linear\nindeterminacy between U and V , and hence the resultant low-rank matrix U V T is identical to that\nby WLRMF.\n\n2Pjk v2\n\nBayesian PCA [2] is also a matrix factorization procedure, which includes a characteristic prior\ndensity of factor-loading vectors, ln p(U |\u03b1) = \u2212 1\nik + const.. It is an equivalent prior for\n\n2Pik \u03b1ku2\n\n\f(A)\n\nMissing pattern\n\n(B)\n\nTrue\nfactor loading\n\n(C)\n\nSVD\n\n(D)\n\nWLRMF\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n50\n\n100\n\n2 4 6 8\n\n10\n\n20\n\n2 4 6 8\n\n(E)\n\nBPCA\n\n(F)\n\nHCA-greedy\n\n(G)\n\nHCA-ARD\n\n(H)\n\nHCA-g+ARD\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n50\n\n100\n\n150\n\n2 4 6 8\n\n10\n\n20\n\n2 4 6 8\n\n2 4 6 8\n\n(I)\n\n1\n\n0.9\n\nE\nS\nM\nR\nN\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n \n0\n\n \n\nSVD\nWLRMF\nBPCA\nHCA-greedy\nHCA-ARD\nHCA-g+ARD\n\n5\n\n10\nK\n\n15\n\n20\n\nFigure 2: Experimental results when applied to an arti\ufb01cial data matrix. (A) Missing pattern of the\nobservation matrix. Vertical and horizontal axes correspond to row (typically, genes) and column\n(typically, samples) of the matrix (typically, gene expression matrix). Red cells signify missing\nelements. (B) True factor-loading matrix. Horizontal axis denotes factors. Color and its intensity\ndenote element values and white cells denote zero elements. Panels from (C) to (H) show the\nfactor-loading matrices estimated by SVD, WLRMF, BPCA, HCA-greedy, HCA-ARD, and HCA-\ng+ARD, respectively. The vertical line in panel (F) denotes the automatically determined number\nof components. Panel (I) shows missing value prediction performance obtained by the three HCA\nalgorithms and other methods. The vertical and horizontal axes denote normalized root mean square\nof test errors and dimensionalities of factors, respectively.\n\nHCA-ARD (eq. (10)) if we assume only a single block. Although this prior term obviously a simple\nL2 norm in the WLRMF, it also includes hyper parameter \u03b1 which constitute different regularization\nterm and it leads to automatic model (intrinsic dimensionality) selection when \u03b1 is determined by\nevidence criterion.\n\nComponent analyzers with sparse factor-loadings have recently been investigated as sparse PCA\n(SPCA). In a well established context of SPCA studies (e.g. [4]), the tradeoff problem is solved\nbetween the understandability (sparsity of factor-loadings) and the reproducibility of the covariance\nmatrix from the sparsi\ufb01ed factor-loadings. In our HCA, the block-wise sparse factor-loading matrix\nis useful not only for understandability but also for generalization ability. The latter merit comes\nfrom the assumption that the observation includes uncertainty due to a small sample size, large\nnoises, and missing observations, which have not been considered suf\ufb01ciently in SPCA.\n\n5 Experiments\n\nExperiment 1: an arti\ufb01cial dataset We prepared an arti\ufb01cial data set with an underlying block\nstructure. For this we generated a 170 \u00d7 9 factor-loading matrix U that included a pre-determined\nblock structure (white vs. colored in Fig. 2(B)), and a 100 \u00d7 9 factor matrix V by applying orthog-\nonalization to the factors sampled from a standard Gaussian distribution. The observation matrix Y\nwas produced by U V T + E, where each element of E was generated from a standard Gaussian.\nThen, missing values were arti\ufb01cially introduced according to the pre-determined block structure\n(Fig. 2(A)).\n\n\u2022 Block 1 consisted of 20 features with randomly selected 10 % missing entries.\n\u2022 Block 2 consisted of 50 features whose 50% columns were completely missing and the\n\nremaining columns contained randomly selected 50% missing entries.\n\n\u2022 Block 3 consisted of 100 features whose 20% columns were completely missing and the\n\nremaining columns contained randomly selected 20% missing entries.\n\nWe applied three HCA algorithms: HCA-greedy, HCA-ARD, and HCA-g+ARD, and three existing\nmatrix factorization algorithms: SVD, WLRMF and BPCA.\n\n\fSVD SVD calculated for a matrix whose missing values are imputed to zeros.\n\nWLRMF[3] The weights were set 1 for the value-existing entries or 0 for the missing entries.\n\nBPCA WLRMF with an ARD prior, called here BPCA, which is equivalent to HCA-ARD except\nthat all features are in a single active block (i.e., colored in Fig. 2(B)). We con\ufb01rmed this\nmethod exhibited almost the same performance as VB-EM-based algorithm [5].\n\nThe generalization ability was evaluated on the basis of the estimation performance for arti\ufb01cially\nintroduced missing values. The estimated factor-loading matrices and missing value estimation\naccuracies are shown in Figure 2. Factor-loading matrices based on WLRMF and BPCA were\nobviously almost the same with that by SVD, because these three methods did not assume any\nsparsity in the factor-loading matrix.\n\nThe HCA-greedy algorithm terminated at K = 10. The factor-loading matrix estimated by HCA-\ngreedy showed an identical sparse structure to the one consisting of the top \ufb01ve factors in the true\nfactor-loadings. The sixth factor in the second block was not extracted, possibly because the second\nblock lacked information due to the large rate of missing values. This algorithm also happened to\nextract one factor not included in the original factor-loadings, as the tenth one in the \ufb01rst block.\n\nAlthough the HCA-ARD and HCA-g+ARD algorithms extracted good ones as the top three and four\nfactors, respectively, they failed to completely reconstruct the sparsity structure in other factors. As\nshown in panel (I), however, such a poorly extracted structure did not increase the generalization\nerror, implying that the essential structure underlying the data was extracted well by the three HCA-\nbased algorithms.\n\nReconstruction of missing values was evaluated by normalized root mean square errors: NRMSE def=\n\npmean[(y \u2212 \u02dcy)2]/var[y], where y and \u02dcy denote true and estimated values, respectively, the mean\n\nis the average over all the missing entries and the variance is for all entries of the matrix.\n\nFigure 2(I) shows the generalization ability of missing value predictions. SVD and WLRMF, which\nincurred no penalty on extracting a large number of factors, exhibited the best results around K = 9,\nbut got worse with the increase in the number of K due to over-\ufb01tting. HCA-g+ARD showed the\nbest performance at K = 9, which was better than that obtained by all the other methods. HCA-\ngreedy, HCA-ARD, and BPCA exhibited comparative performance at K = 9. At K = 2, . . . , 8, the\nHCA algorithms performed better than BPCA. Namely, the sparse structure in the factor-loadings\ntended to achieve better performance. HCA-ARD performed less effectively than the other two HCA\nalgorithms at K > 13, because of convergence to local solutions. This reason is supported by the fact\nthat HCA-g+ARD employing good initialization by HCA-greedy exhibited the best performance\namong all the HCA algorithms. Accordingly, HCA showed a better generalization ability with a\nsmaller number of effective parameters than the existing methods.\n\n(A)\n\nMissing entries\n\n(B)\n\nFactor loading (HCA-greedy)\n\n(C)\n\nFactor loading (WLRMF)\n\narray CGH\n\nMicroarray 1\n\nMicroarray 2 \n\n1000\n\n2000\n\n2448\n\n100\n\n200\nSamples\n\n300\n\n1000\n\n2000\n\n2448\n\n5\n\n10\n\nFactors\n\n15\n\n20\n\n5\n\n10\n\nFactors\n\n15\n\n20\n\nFigure 3: Analysis of an NBL dataset. Vertical axes denote high-dimensional features. Features\nmeasured by array CGH technology are sorted in the chromosomal order. Microarray features are\nsorted by correlations to sample\u2019s prognosis, dead or alive at the end of clinical followup.\n(A)\nMissing pattern in the NBL dataset. White and red colors denote observed and missing entries in\nthe data matrix, respectively. (B) and (C) Factor-loading matrices estimated by the HCA-greedy and\nWLRMF algorithms, respectively.\n\n\fExperiment 2: a cross-analysis of neuroblastoma data We next applied our HCA to a neu-\nroblastoma (NBL) dataset consisting of three data blocks taken by three kinds of high-throughput\ngenomic measurement technologies.\n\nArray CGH Chromosomal changes of 2340 DNA segments (using 2340 probes) were measured\nfor each of 230 NBL tumors, by using the array comparative genomic hybridization (array\nCGH) technology. Data for 1000 probes were arbitrarily selected from the whole dataset.\nMicroarray 1 Expression levels of 5340 genes were measured for 136 tumors from NBL patients.\n\nWe selected 1000 genes showing the largest variance over the 136 tumors.\n\nMicroarray 2 Gene expression levels in 25 out of 136 tumors were also measured by a small-sized\n\nmicroarray technology harboring 448 probes.\n\nThe dataset Microarray 1 was the same one as used in the previous study [6], and the other two\ndatasets, array CGH and Microarray 2, were also provided by the same research group for this\nstudy. As seen in Figure 3(A), the set of measured samples was quite different in the three experi-\nments, leading to apparent block-wise missing observations. We normalized the data matrix so that\nthe block-wise variances become unity. We further added 10% missing entries randomly into the\nobserved entries in order to evaluate missing value prediction performance.\n\nWhen HCA-greedy was applied to this dataset, it terminated at K = 23, but we continued to obtain\nfurther factors until K = 80. Figure 3(B) shows the factor-loading matrix from K = 0 to 23.\nHCA-greedy extracted one factor showing the relationship between the three measurement devices\nand three factors between aCGH and Microarray 1. The other factors accounted for either of aCGH\nor Microarray 1. The \ufb01rst factor was strongly correlated with patient\u2019s prognosis as clearly shown\nby the color code in the parts of Microarrays 1 and 2. Note that the features in these two datasets\nare aligned by correlations to the prognosis. This suggests that the dataset Microarray 2 did not\ninclude factors other than the \ufb01rst one as those strongly related to the prognosis. On the other hand,\nWLRMF extracted the identical \ufb01rst factor to HCA-greedy, but extracted much more factors con-\ncerning Microarray 2, all of which may not be trustworthy because the number of samples observed\nin Microarray 2 was as small as 25.\n\n(A)\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n \n\nE\nS\nM\nR\nN\ng\nn\nn\na\nr\nT\n\ni\n\ni\n\nSVD\nBPCA\nWLRMF\nHCA-greedy\nHCA-ARD\nHCA-g+ARD\n\n(B)\n\n \n\n0.9\n\nE\nS\nM\nR\nN\n\n \nt\ns\ne\nT\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n(C)\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\nE\nS\nM\nR\nN\n\n \nt\ns\ne\nT\n\n0.1\n \n0\n\n20\n\n40\nK\n\n60\n\n80\n\n0\n\n20\n\n40\nK\n\n60\n\n80\n\n0\n\n0.5\n\n1\n\n1.5\n\nNum.NonZeroElements\n\n2\n5\nx 10\n\nFigure 4: Missing value prediction performance by the six algorithms. Vertical axis denotes nor-\nmalized root mean square of training errors (A) or test errors (B and C). Horizontal axis denotes the\nnumber of factors (A and B) or the number of non-zero elements in the factor-loading matrices (C).\nEach curve corresponds to one of the six algorithms.\n\nWe also applied SVD, WLRMF, BPCA and other two HCA algorithms to the NBL dataset. For\nWLRMF, BPCA, HCA-ARD, and HCA-g+ARD, the initial numbers of factors were set at K =\n5, 10, 20, . . . , 70, and 80. Missing value prediction performance in terms of NRMSE was obtained\nas a measurement value of generalization performance. Note that the original data matrix included\nmany missing values, but we evaluated the performance by using arti\ufb01cially introduced missing\nvalues. Figure 4 shows the results.\n\nTraining errors almost monotonically decreased as the number of factors increased (Fig. 4A), in-\ndicating the stability of the algorithms. The only exception was HCA-ARD whose error increased\nfrom K = 30 to K = 40; this was due to local solution, because HCA-g+ARD employing the\nsame algorithm but starting from different initialization showed consistent improvements in its per-\nformance.\n\n\fTest errors did not show monotonic pro\ufb01les except that HCA-greedy exhibited monotonically better\nresults for larger K values (Fig. 4B and C). SVD and WLRMF exhibited the best performance\nat K = 22 and K = 60, respectively, and got worse as the number of factors increased due to\nover-\ufb01tting.\n\nOverall, the variants of our new HCA concept have shown good generalization performance as\nmeasured on missing values, much similar to existing methods like WLRMF. We would like to\nemphasize, however, that HCA yields a clearer factor structure that is easier interpretable from the\nbiological point of view.\n\n6 Conclusion\n\nComplex structured data are ubiquitous in practice. For instance, when we should integrate data\nderived from different measurement devices, it becomes critically important to combine the infor-\nmation in each single source optimally \u2014 otherwise no gain can be achieved beyond the individual\nanalyses. Our Bayesian HCA model allows to take into account such structured feature vectors\nthat possess different intrinsic blocking characteristics. The new probabilistic structured matrix\nfactorization framework was applied to toy data and to neuroblastoma data collected by multiple\nhigh-throughput measurement devices which had block-wise missing structures due to different\nexperimental designs. HCA achieved a block-wise sparse factor-loading matrix, representing the\ninformation amount contained in each block of the dataset simultaneously. While HCA provided\na better or similar missing value prediction performance than existing methods such as BPCA or\nWLRMF, the heterogeneous structure underlying the problem was clearly captured much better.\nFurthermore the HCA factors derived are an interesting representation that may ultimately lead to a\nbetter modeling of the neuroblastoma data (see section 5).\nIn the current HCA implementation, block structures were assumed to be known, as for the neu-\nroblastoma data. Future work will go into a fully automatic estimate of structure from measured\nmulti-modal data and the respective model selection techniques to achieve this goal.\nClearly there is an increasing need for methods that are able to reliably extract factors from multi-\nmodal structured data with heterogeneous features. Our future effort will therefore strive towards\napplications beyond bioinformatics and to design novel structured spatio-temporal decomposition\nmethods in applications like electroencephalography (EEG), image and audio analyses.\n\nAcknowledgement This work was supported by a Grant-in-Aid for Young Scientists (B) No.\n19710172 from MEXT Japan.\n\nReferences\n\n[1] I. Nabney and Christopher Bishop.\nhttp://www.ncrg.aston.ac.uk/netlab/, 1995.\n\nNetlab:\n\nNetlab neural network software.\n\n[2] C.M. Bishop. Bayesian PCA. In Proceedings of 11th conference on Advances in neural infor-\n\nmation processing systems, pages 382\u2013388. MIT Press Cambridge, MA, USA, 1999.\n\n[3] N. Srebro and T. Jaakkola. Weighted low rank matrix approximations. In Proceedings of 20th\n\nInternational Conference on Machine Learning, pages 720\u2013727, 2003.\n\n[4] A. d\u2019Aspremont, F. R. Bach, and L. El Ghaoui. Full regularization path for sparse principal\ncomponent analysis. In Proceedings of the 24th International Conference on Machine Learning,\n2007.\n\n[5] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, and S. Ishii. A Bayesian missing\nvalue estimation method for gene expression pro\ufb01le data. Bioinformatics, 19(16):2088\u20132096,\n2003.\n\n[6] M. Ohira, S. Oba, Y. Nakamura, E. Isogai, S. Kaneko, A. Nakagawa, T. Hirata, H. Kubo,\nT. Goto, S. Yamada, Y. Yoshida, M. Fuchioka, S. Ishii, and A. Nakagawara. Expression pro\ufb01ling\nusing a tumor-speci\ufb01c cDNA microarray predicts the prognosis of intermediate risk neuroblas-\ntomas. Cancer Cell, 7(4):337\u2013350, Apr 2005.\n\n\f", "award": [], "sourceid": 440, "authors": [{"given_name": "Shigeyuki", "family_name": "Oba", "institution": null}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Shin", "family_name": "Ishii", "institution": null}]}