{"title": "Feature-aware Label Space Dimension Reduction for Multi-label Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1529, "page_last": 1537, "abstract": "Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets.", "full_text": "Feature-aware Label Space Dimension Reduction for\n\nMulti-label Classi\ufb01cation\n\nYao-Nan Chen\n\nDepartment of Computer Science\n\n& Information Engineering,\nNational Taiwan University\n\nr99922008@csie.ntu.edu.tw\n\nHsuan-Tien Lin\n\nDepartment of Computer Science\n\n& Information Engineering,\nNational Taiwan University\n\nhtlin@csie.ntu.edu.tw\n\nAbstract\n\nLabel space dimension reduction (LSDR) is an ef\ufb01cient and effective paradigm\nfor multi-label classi\ufb01cation with many classes. Existing approaches to LSDR,\nsuch as compressive sensing and principal label space transformation, exploit only\nthe label part of the dataset, but not the feature part. In this paper, we propose\na novel approach to LSDR that considers both the label and the feature parts.\nThe approach, called conditional principal label space transformation, is based\non minimizing an upper bound of the popular Hamming loss. The minimization\nstep of the approach can be carried out ef\ufb01ciently by a simple use of singular\nvalue decomposition. In addition, the approach can be extended to a kernelized\nversion that allows the use of sophisticated feature combinations to assist LSDR.\nThe experimental results verify that the proposed approach is more effective than\nexisting ones to LSDR across many real-world datasets.\n\nIntroduction\n\n1\nThe multi-label classi\ufb01cation problem is an extension of the traditional multiclass classi\ufb01cation\nproblem. In contrast to the multiclass problem, which associates only a single label to each instance,\nthe multi-label classi\ufb01cation problem allows multiple labels for each instance. General solutions\nto this problem meet the demands of many real-world applications for classifying instances into\nmultiple concepts, including categorization of text [1], scene [2], genes [3] and so on. Given the\nwide range of such applications, the multi-label classi\ufb01cation problem has been attracting much\nattention of researchers in machine learning [4, 5, 6].\nLabel space dimension reduction (LSDR) is a new paradigm in multi-label classi\ufb01cation [4, 5].\nBy viewing the set of multiple labels as a high-dimensional vector in some label space, LSDR\napproaches use certain assumed or observed properties of the vectors to \u201ccompress\u201d them. The\ncompression step transforms the original multi-label classi\ufb01cation problem (with many labels) to a\nsmall number of learning tasks. If the compression step, de-compression step, and learning steps\ncan be ef\ufb01cient and effective, LSDR approaches can be useful for multi-label classi\ufb01cation because\nof the appropriate use of joint information within the labels [5]. For instance, a representative LSDR\napproach is the principal label space transformation [PLST; 5]. PLST takes advantage of the key\nlinear correlations between labels to build a small number of regression tasks.\nLSDR approaches are homologous to the feature space dimension reduction (FSDR) approaches and\nshare similar advantages: saving computational power and storage without much loss of prediction\naccuracy and improving performance by removing irrelevant, redundant, or noisy information [7].\nThere are two types of FSDR approaches: unsupervised and supervised. Unsupervised FSDR con-\nsiders only feature information during reduction, while supervised FSDR considers the additional\nlabel information. A typical instance of unsupervised FSDR is principal component analysis [PCA;\n8]. PCA transforms the features into a small number of uncorrelated variables. On the other hand,\nthe supervised FSDR approaches include supervised principal component analysis [9], sliced inverse\nregression [10], and kernel dimension reduction [11]. In particular, for multi-label classi\ufb01cation, a\n\n1\n\n\fleading supervised FSDR approach is canonical correlation analysis [CCA; 6, 12] which is based\non linear projections in both the feature space and the label space. In general, well-tuned super-\nvised FSDR approaches can perform better than unsupervised ones because of the additional label\ninformation.\nPLST can be viewed as the counterpart of PCA in the label space [5] and is feature-unaware. That is,\nit considers only the label information during reduction. Motivated by the superiority of supervised\nFSDR over unsupervised approaches, we are interested in studying feature-aware LSDR: LSDR that\nconsiders feature information.\nIn this paper, we propose a novel feature-aware LSDR approach, conditional principal label space\ntransformation (CPLST). CPLST combines the concepts of PLST (LSDR) and CCA (supervised\nFSDR) and can improve PLST through the addition of feature information. We derive CPLST by\nminimizing an upper bound of the popular Hamming loss and show that CPLST can be accomplished\nby a simple use of singular value decomposition. Moreover, CPLST can be \ufb02exibly extended by the\nkernel trick with suitable regularization, thereby allowing the use of sophisticated feature informa-\ntion to assist LSDR. The experimental results on real-world datasets con\ufb01rm that CPLST can reduce\nthe number of learning tasks without loss of prediction performance. In particular, CPLST is usually\nbetter than PLST and other related LSDR approaches.\nThe rest of this paper is organized as follows. In Section 2, we de\ufb01ne the multi-label classi\ufb01cation\nproblem and review related works. Then, in Section 3, we derive the proposed CPLST approach.\nFinally, we present the experimental results in Section 4 and conclude our study in Section 5.\n\n2 Label Space Dimension Reduction\nThe multi-label classi\ufb01cation problem aims at \ufb01nding a classi\ufb01er from the input vector x to a label\nset Y, where x \u2208 Rd, Y \u2286 {1, 2, . . . , K} and K is the number of classes. The label set Y is often\nconveniently represented as a label vector, y \u2208 {0, 1}K, where y[k] = 1 if and only if k \u2208 Y.\nGiven a dataset D = {(xn, yn)}N\nn=1, which contains N training examples (xn, yn), the multi-label\nclassi\ufb01cation algorithm uses D to \ufb01nd a classi\ufb01er h: X \u2192 2{1,2,\u00b7\u00b7\u00b7 ,K} anticipating that h predicts y\nwell on any future (unseen) test example (x, y).\nThere are many existing algorithms for solving multi-label classi\ufb01cation problems. The simplest and\nmost intuitive one is binary relevance [BR; 13]. BR decomposes the original dataset D into K binary\nclassi\ufb01cation datasets, Dk = {(xn, yn[k])}N\nn=1, and learns K independent binary classi\ufb01ers, each of\nwhich is learned from Dk and is responsible for predicting whether the label set Y includes label k.\nWhen K is small, BR is an ef\ufb01cient and effective baseline algorithm for multi-label classi\ufb01cation.\nHowever, when K is large, the algorithm can be costly in training, prediction, and storage.\nFacing the above challenges, LSDR (Label Space Dimension Reduction) offers a potential solution\nto these issues by compressing the K-dimensional label space before learning. LSDR transforms D\ninto M datasets, where Dm = {(xn, tn[m])}N\nn=1, m = 1, 2, . . . , M, and M (cid:28) K such that the\nmulti-label classi\ufb01cation problem can be tackled ef\ufb01ciently without signi\ufb01cant loss of prediction\nperformance.\nIn particular, LSDR involves solving, predicting with, and storing the models for\nonly M, instead of K, learning tasks.\nFor instance, compressive sensing [CS; 4], a precursor of LSDR, is based on the assumption that the\nlabel set vector y is sparse (i.e., contains few ones) to \u201ccompressed\u201d y to a shorter code vector t by\nprojecting y on M random directions v1,\u00b7\u00b7\u00b7 , vM , where M (cid:28) K can be determined according\nto the assumed sparsity level. CS transforms the original multi-label classi\ufb01cation problem into M\nregression tasks with Dm = {(xn, tn[m])}N\nmyn. After obtaining a multi-\noutput regressor r(x) for predicting the code vector t, CS decodes r(x) to the optimal label set\nvector by solving an optimization problem for each input instance x under the sparsity assumption,\nwhich can be time-consuming.\n\nn=1, where tn[m] = vT\n\n2.1 Principal Label Space Transformation\n\nPrincipal label space transformation [PLST; 5] is another approach to LSDR. PLST \ufb01rst shifts each\nlabel set vector y to z = y \u2212 \u00afy, where \u00afy = 1\nn=1 yn is the estimated mean of the label set\nvectors. Then, PLST takes a matrix V that linearly maps z to the code vector t by t = Vz. Unlike\nCS, however, PLST takes principal directions vm (to be introduced next) rather than the random\nones, and does not need to solve an optimization problem during decoding.\n\nN\n\n(cid:80)N\n\n2\n\n\fIn particular, PLST considers only a matrix V with orthogonal rows, and decodes r(x) to the pre-\ndicted labels by h(x) = round(VT r(x)+ \u00afy), which is called round-based decoding. Tai and Lin [5]\nprove that when using round-based decoding and a linear transformation V that contains orthogonal\nrows, the common Hamming loss for evaluating multi-label classi\ufb01ers [14] is bounded by\n\n(cid:18)(cid:13)(cid:13)(cid:13)r(X) \u2212 ZVT(cid:13)(cid:13)(cid:13)2\n\n+\n\nF\n\n(cid:13)(cid:13)(cid:13)Z \u2212 ZVT V\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n(cid:19)\n\n,\n\n(1)\n\nTraining Hamming Loss \u2264 c\n\nn as rows.\n\nn as rows and c is a constant that depends on K\n\nwhere r(X) contains r(xn)T as rows, Z contains zT\nand N. The matrix ZVT then contains the code vector tT\nThe bound can be divided into two parts. The \ufb01rst part is (cid:107)r(X) \u2212 ZVT(cid:107)2\nF , which represents\nthe prediction error from the regressor r(xn) to the desired code vectors tn. The second part is\n(cid:107)Z \u2212 ZVT V(cid:107)2\nF , which stands for the encoding error for projecting zn into the closest vector in\nspan{v1,\u00b7\u00b7\u00b7 , vM}, which is VT tn.\nPLST is derived by minimizing the encoding error [5] and \ufb01nds the optimal M by K matrix V\nby applying the singular value decomposition on Z and take the M right-singular vectors vm that\ncorrespond to the M largest singular values. The M right-singular vectors are called the principal\ndirections for representing zn.\nPLST can be viewed as a linear case of the kernel dependency estimation (KDE) algorithm [15].\nNevertheless, the general nonlinear KDE must solve a computationally expensive pre-image prob-\nlem for each test input x during the prediction phase. The linearity of PLST avoids the pre-image\nproblem and enjoys ef\ufb01cient round-based decoding. In this paper, we will focus on the linear case\nin order to design ef\ufb01cient algorithms for LSDR during both the training and prediction phases.\n2.2 Canonical Correlation Analysis\n\nA related technique that we will consider in this paper is canonical correlation analysis [CCA;\n6], a well-known statistical technique for analyzing the linear relationship between two multi-\ndimensional variables. Traditionally, CCA is regarded as a FSDR approach in multi-label classi-\n\ufb01cation [12]. In this subsection, we discuss whether CCA can also be viewed as an LSDR approach.\nn (assumed to be zero mean) as\nFormally, given an N by d matrix X with the n-th row being xT\nn (assumed to be zero mean), CCA aims at\nwell as an N by K matrix Z with the n-th row being zT\nz ,\u00b7\u00b7\u00b7 ), such that the correlation\nx , w(2)\n\ufb01nding two lists of basis vectors, (w(1)\ncoef\ufb01cient between the canonical variables c(i)\nis maximized, under the\nfor 1 \u2264 j < i. Kettenring [16] showed\nconstraint that c(i)\nthat CCA is equivalent to simultaneously solving the following constrained optimization problem:\n\nx ,\u00b7\u00b7\u00b7 ) and (w(1)\nx = Xw(i)\nx and c(j)\nz\n\nx is uncorrelated to all other c(j)\n\nz = Zw(i)\nz\n\nx and c(i)\n\nz , w(2)\n\n(cid:13)(cid:13)XWT\n\nx \u2212 ZWT\n\nz\n\n(cid:13)(cid:13)2\n\nF\n\nmin\n\nWx,Wz\n\nsubject to WxXT XWT\n\nx = WzZT ZWT\n\nz = I,\n\n(2)\n\nT\n\nT\n, and Wz is the matrix with the i-th row (w(i)\nz )\n\nn as rows and Z is the shifted label matrix that contains the mean-shifted yT\n\nwhere Wx is the matrix with the i-th row (w(i)\n.\nx )\nWhen CCA is considered in the context of multi-label classi\ufb01cation, X is the matrix that contains the\nmean-shifted xT\nn as rows.\nTraditionally, CCA is used as a supervised FSDR approach that discards Wz and uses only Wx to\nproject features onto a lower-dimension space before learning with binary relevance [12, 17].\nOn the other hand, due to the symmetry between X and Z, we can also view CCA as an ap-\nproach to feature-aware LSDR. In particular, CCA is equivalent to \ufb01rst seeking projection direc-\ntions Wz of Z, and then performing a multi-output linear regression from xn to Wzzn, under the\nconstraints WxXT XWT\nx = I, to obtain Wx. However, it has not been seriously studied how to use\nCCA for LSDR because Wz does not contain orthogonal rows. That is, unlike PLST, round-based\ndecoding cannot be used and it remains to be an ongoing research issue for designing a suitable\ndecoding scheme with CCA [18].\n3 Proposed Algorithm\nInspired by CCA, we \ufb01rst design a variant that involves an appropriate decoding step. As suggested\nin Section 2.2, CCA is equivalent to \ufb01nding a projection that minimizes the squared prediction error\nz = I. If we drop the constraint on Wx in order\nunder the constraints WxXT XWT\nto further decrease the squared prediction error and change WzZT ZWT\nz = I in\n\nz = I to WzWT\n\nx = WzZT ZWT\n\n3\n\n\forder to enable round-based decoding, we obtain\n\n(cid:13)(cid:13)XWT\n\nx \u2212 ZWT\n\nz\n\n(cid:13)(cid:13)2\n\nF\n\nmin\n\nWx,Wz\n\nsubject to WzWT\n\nz = I\n\n(3)\n\nz , OCCA minimizes (cid:107)r(x) \u2212 ZWT\n\nProblem (3) preserves the original objective function of CCA and speci\ufb01es that Wz must con-\ntain orthogonal rows for applying round-based decoding. We call this algorithm orthogonally\nconstrained CCA (OCCA). Then, using the Hamming loss bound (1), when V = Wz and\nz (cid:107) in (1) with the hope that the Hamming loss\nr(x) = XWT\nis also minimized. In other words, OCCA is employed for the orthogonal directions V that are\n\u201ceasy to learn\u201d (of low prediction error) in terms of linear regression.\nFor every \ufb01xed Wz = V in (3), the optimization problem for Wx is simply a linear regression from\nx = X\u2020ZVT ,\nX to ZVT . Then, the optimal Wx can be computed by a closed-form solution WT\nwhere X\u2020 is the pseudo inverse of X. When the optimal Wx is inserted back into (3), the optimiza-\ntion problem becomes min\nVVT =I\n\n(cid:13)(cid:13)XX\u2020ZVT \u2212 ZVT(cid:13)(cid:13)2\n\nF which is equivalent to\n\ntr(cid:0)VZT (I \u2212 H) ZVT(cid:1) .\n\nmin\n\nVVT =I\n\n(4)\nThe matrix H = XX\u2020 is called the hat matrix for linear regression [19]. Similar to PLST, by Eckart-\nYoung theorem [20], we can solve problem (4) by considering the eigenvectors that correspond to\nthe largest eigenvalues of ZT (H \u2212 I)Z.\n3.1 Conditional Principal Label Space Transformation\n\nFrom the previous discussions, OCCA captures the input-output relation to minimize the prediction\nerror in bound (1) with the \u201ceasy\u201d directions. In contrast, PLST minimizes the encoding error in\nbound (1) with the \u201cprincipal\u201d directions. Now, we combine the bene\ufb01ts of the two algorithms, and\nminimize the two error terms simultaneously with the \u201cconditional principal\u201d directions. We begin\nby continuing our derivation of OCCA, which obtains r(x) by a linear regression from X to ZVT .\nIf we minimize both terms in (1) together with such a linear regression, the optimization problem\nbecomes\n\n(cid:13)(cid:13)(cid:13)Z \u2212 ZVT V\n\n(cid:18)(cid:13)(cid:13)(cid:13)XWT \u2212 ZVT(cid:13)(cid:13)(cid:13)2\ntr(cid:0)VZT (I \u2212 H) ZVT \u2212 VT VZT Z \u2212 ZT ZVT V + VT VZT ZVT V(cid:1)\ntr(cid:0)VZT HZVT(cid:1)\n\nc\nW,VVT =I\n\n(cid:13)(cid:13)(cid:13)2\n\n(5)\n\n(6)\n\n\u21d2\n\u21d2\n\n(cid:19)\n\nVVT =I\n\nmin\n\n+\n\nF\n\nmin\n\nF\n\nmax\n\nVVT =I\n\nProblem (6) is derived by a cyclic permutation to eliminate a pair of V and VT and combine the\nlast three terms of (5). The problem can again be solved by taking the eigenvectors with the largest\neigenvalues of ZT HZ as the rows of V. Such a matrix V minimizes the prediction error term and\nthe encoding error term simultaneously. The resulting algorithm is called conditional principal label\nspace transformation (CPLST), as shown in Algorithm 1.\nAlgorithm 1 Conditional Principal Label Space Transformation\n1: Let Z = [z1 . . . zN ]T with zn = yn \u2212 \u00afy.\n2: Preform SVD on ZT HZ to obtain ZT HZ = A\u03a3B with \u03c31 \u2265 \u03c32 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3N . Let VM\n3: Encode {(xn, yn)}N\n4: Learn a multi-dimension regressor r(x) from {(xn, tn)}N\n\n5: Predict the label-set of an instance x by h(x) = round(cid:0)VT\n\nn=1, where tn = VM zn.\nn=1.\n\nM r(x) + \u00afy(cid:1).\n\nn=1 to {(xn, tn)}N\n\ncontain the top M rows of B.\n\nCPLST balances the prediction error with the encoding error and is closely related with bound (1).\nMoreover, in contrast with PLST, which uses the key unconditional correlations, CPLST is feature-\naware and allows the capture of conditional correlations [14].\nWe summarize the three algorithms in Table 1, and we will compare them empirically in Section 4.\nThe three algorithms are similar. They all operate with an SVD (or eigenvalue decomposition)\non a K by K matrix. PLST focuses on the encoding error and does not consider the features\nduring LSDR, i.e. it is feature-unaware. On the other hand, CPLST and OCCA are feature-aware\napproaches, which consider features during LSDR. When using linear regression as the multi-output\n\n4\n\n\fTable 1: Summary of three LSDR algorithms\n\nAlgorithm Matrix for SVD\nZT (H \u2212 I)Z\n\nZT Z\n\nPLST\nOCCA\nCPLST\n\nZT HZ\n\nLSDR\n\nRelation to bound (1)\n\nfeature-unaware minimizes the encoding error\nminimizes the prediction error\nfeature-aware\nfeature-aware\nminimizes both\n\nregressor, CPLST simultaneously minimizes the two terms in bound (1), while OCCA minimizes\nonly one term of the bound.\nIn contrast to PLST, the two feature-aware approaches OCCA and CPLST must calculate the ma-\ntrix H and are thus slower than PLST if the dimension d of the input space is large.\n3.2 Kernelization and Regularization\nKernelization\u2014extending a linear model to a nonlinear one using the kernel trick [21]\u2014and regu-\nlarization are two important techniques in machine learning. The former expands the power of the\nlinear models while the latter regularizes the complexity of the learning model. In this subsection,\nwe show that kernelization and regularization can be applied to CPLST (and OCCA).\nIn Section 3.1, we derive CPLST by using linear regression as the underlying multi-output regression\nmethod. Next, we replace linear regression by its kernelized form with (cid:96)2 regularization, kernel ridge\nregression [22], as the underlying regression algorithm. Kernel ridge regression considers a feature\nmapping \u03a6 : X \u2192 F before performing regularized linear regression. According to \u03a6, the kernel\nfunction k(x, x(cid:48)) = \u03a6(x)T \u03a6(x(cid:48)) is de\ufb01ned as the inner product in the space F. When applying\nkernel ridge regression with a regularization parameter \u03bb to map from X to ZV, if \u03a6(x) can be\nexplicitly computed, it is known that the closed-form solution is [22]\n\n(7)\nwhere \u03a6 is the matrix containing \u03a6(xn)T as rows, and K is the matrix with Kij = k(xi, xj) =\n\u03a6(xi)T \u03a6(xj). That is, K = \u03a6\u03a6T and is called the kernel matrix of X.\nNow, we derive kernel-CPLST by inserting the optimal W into the Hamming loss bound (1). When\nsubstituting (7) into minimizing the loss bound (1) with r(X) = \u03a6W and letting Q = (\u03bbI + K)\u22121,\n\nZVT = \u03a6T (\u03bbI + K)\n\n\u22121 ZVT ,\n\nW = \u03a6T(cid:0)\u03bbI + \u03a6\u03a6T(cid:1)\u22121\n\n(cid:18)(cid:13)(cid:13)\u03a6\u03a6T QZVT \u2212 ZVT(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)Z \u2212 ZVT V\n(cid:18)(cid:13)(cid:13)KQZVT \u2212 ZVT(cid:13)(cid:13)2\n(cid:19)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)Z \u2212 ZVT V\ntr(cid:0)VZT (2KQ \u2212 QKKQ \u2212 I) ZVT(cid:1)\n\nF +\n\nF +\n\nF\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\nmin\n\nc\nVVT =I\n\n\u21d2 min\n\u21d2 max\n\nVVT =I\n\nVVT =I\n\n(8)\n\nNotice that in equation (8), kernel-CPLST do not need to explicitly compute the matrix \u03a6 and only\nneeds the kernel matrix K (that can be computed through the kernel function k). Therefore, a high or\neven an in\ufb01nite dimensional feature transform can be used to assist LSDR in kernel-CPLST through\na suitable kernel function. Problem (8) can again be solved by considering the eigenvectors with the\nlargest eigenvalues of ZT (2KQ \u2212 QKKQ) Z as the rows of V.\n4 Experiment\nIn this section, we conduct experiments on eight real-world datasets, downloaded from Mulan [23],\nto validate the performance of CPLST and other LSDR approaches. Table 2 shows the number of la-\nbels of each dataset. Because kernel ridge regression itself, kernel-CPLST need to invert an N by N\nmatrix, we can only afford to conduct a fair comparison using mid-sized datasets. In each run of the\nexperiment, we randomly sample 80% of the dataset for training and reserve the rest for testing. All\nthe results are reported with the mean and the standard error over 100 different random runs.\n\nDataset\n# Labels (K)\n\nTable 2: The number of labels of each dataset\n\nbib.\n159\n\ncor.\n374\n\nemo.\n6\n\nenr.\n53\n\ngen.\n27\n\nmed.\n45\n\nsce.\n6\n\nyea.\n14\n\nWe take PLST, OCCA, CPLST, and kernel-CPLST in our comparison. We do not include Com-\npressive Sensing [13] in the comparison because earlier work [24] has shown that the algorithm is\nmore sophisticated while being inferior to PLST. We conducted some side experiments on CCA [6]\nfor LSDR (see Subsection 2.2) and found that it is at best comparable to OCCA. Given the space\n\n5\n\n\f(a) Hamming loss\n\n(b) encoding error\n\n(c) prediction error\n\n(d) loss bound\n\nFigure 1: yeast: test results of LSDR algorithm when coupled with linear regression\n\nconstraints, we decide to only report the results on OCCA. In addition to those LSDR approaches,\nwe also consider a simple baseline approach [24], partial binary relevance (PBR). PBR randomly\nselects M labels from the original label set during training and only learns those M binary classi-\n\ufb01ers for prediction. For the other labels, PBR directly predicts \u22121 without any training to match the\nsparsity assumption as exploited by Compressive Sensing [13].\n\n4.1 Label Space Dimension Reduction with Linear Regression\n\nF , the prediction error (cid:107)XWT \u2212 ZVT(cid:107)2\n\nIn this subsection, we couple PBR, OCCA, PLST and CPLST with linear regression. The yeast\ndataset reveals clear differences between the four LSDR approaches and is hence taken for presenta-\ntion here, while similar differences have been observed on other datasets as well. Figure 1(a) shows\nthe test Hamming loss with respect to the possible M (labels) used. It is clear that CPLST is better\nthan the other three approaches. PLST can reach similar performance to CPLST only at a larger M.\nThe other two algorithms, OCCA and PBR, are both signi\ufb01cantly worse than CPLST.\nTo understand the cause of the different performance, we plot the (test) encoding error (cid:107)Z \u2212\nZVT V(cid:107)2\nF , and the loss bound (1) in Figure 1. Figure 1(b)\nshows the encoding error on the test set, which matches the design of PLST. Regardless of the ap-\nproaches used, the encoding error decreases to 0 when using all 14 dimensions because the {vm}\u2019s\ncan span the whole label space. As expected, PLST achieves the lowest encoding error across every\nnumber of dimensions. CPLST partially minimizes the encoding error in its objective function, and\nhence also achieves a decent encoding error. On the other hand, OCCA is blind to and hence worst\nat the encoding error. In particular, its encoding error is even worse than that of the baseline PBR.\nFigure 1(c) shows the prediction error (cid:107)XWT \u2212 ZVT(cid:107)2\nF on the test set, which matches the design\nof OCCA. First, OCCA indeed achieves the lowest prediction error across all number of dimensions.\nPLST, which is blind to the prediction error, reaches the highest prediction error, and is even worse\nthan PBR. The results further reveal the trade-off between the encoding error and the prediction\nerror: more ef\ufb01cient encoding of the label space are harder to predict. PLST takes the more ef\ufb01cient\nencoding to the extreme, and results in worse prediction error; OCCA, on the other hand, is better\nin terms of the prediction error, but leads to the least ef\ufb01cient encoding.\nFigure 1(d) shows the scaled upper bound (1) of the Hamming loss, which equals the sum of the en-\ncoding error and the prediction error. CPLST is designed to knock down this bound, which explains\nits behavior in Figure 1(d) and echoes its superior performance in Figure 1(a). In fact, Figure 1(d)\nshows that the bound (1) is quite indicative of the performance differences in Figure 1(a). The results\n\n6\n\n0510150.20.2050.210.2150.220.2250.230.2350.240.245# of dimensionHamning loss PBROCCAPLSTCPLST05101505001000150020002500# of dimension|Z \u2212 ZVTV|2 PBROCCAPLSTCPLST05101505001000150020002500# of dimension|XWT \u2212 ZVT|2 PBROCCAPLSTCPLST0510151950200020502100215022002250# of dimension|XWT \u2212 ZVT|2 + |Z \u2212 ZVTV|2 PBROCCAPLSTCPLST\fDataset\nbibtex\n\ncorel5k\n\nemotions\n\nenron\n\ngenbase\n\nmedical\n\nscene\n\nyeast\n\nAlgorithm\n\n40%\n\n80%\n\n60%\n\nTable 3: Test Hamming loss of PLST and CPLST with linear regression\n0.0123 \u00b1 0.0000\n0.0123 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.2013 \u00b1 0.0020\n0.2015 \u00b1 0.0021\n0.1006 \u00b1 0.0007\n0.1006 \u00b1 0.0007\n0.0009 \u00b1 0.0001\n0.0008 \u00b1 0.0001\n0.0490 \u00b1 0.0005\n0.0490 \u00b1 0.0005\n0.1321 \u00b1 0.0008\n0.1209 \u00b1 0.0007\n0.2020 \u00b1 0.0009\n0.2020 \u00b1 0.0009\n\nM = 20%K\n0.0129 \u00b1 0.0000\n0.0127 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.2207 \u00b1 0.0020\n0.2189 \u00b1 0.0019\n0.0728 \u00b1 0.0004\n0.0729 \u00b1 0.0004\n0.0169 \u00b1 0.0004\n0.0168 \u00b1 0.0004\n0.0346 \u00b1 0.0004\n0.0346 \u00b1 0.0004\n0.1809 \u00b1 0.0004\n0.1744 \u00b1 0.0004\n0.2150 \u00b1 0.0008\n0.2069 \u00b1 0.0008\n\n0.0125 \u00b1 0.0000\n0.0124 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.2064 \u00b1 0.0023\n0.2059 \u00b1 0.0022\n0.0860 \u00b1 0.0005\n0.0864 \u00b1 0.0005\n0.0040 \u00b1 0.0002\n0.0041 \u00b1 0.0002\n0.0407 \u00b1 0.0005\n0.0406 \u00b1 0.0005\n0.1718 \u00b1 0.0006\n0.1532 \u00b1 0.0005\n0.2052 \u00b1 0.0009\n0.2041 \u00b1 0.0009\n\n0.0124 \u00b1 0.0000\n0.0123 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.1982 \u00b1 0.0022\n0.1990 \u00b1 0.0022\n0.0946 \u00b1 0.0006\n0.0943 \u00b1 0.0006\n0.0012 \u00b1 0.0001\n0.0012 \u00b1 0.0001\n0.0472 \u00b1 0.0005\n0.0471 \u00b1 0.0005\n0.1566 \u00b1 0.0007\n0.1349 \u00b1 0.0005\n0.2033 \u00b1 0.0009\n0.2024 \u00b1 0.0009\n\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\n\n100%\n\n0.0123 \u00b1 0.0000\n0.0123 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.2040 \u00b1 0.0022\n0.2040 \u00b1 0.0022\n0.1028 \u00b1 0.0007\n0.1028 \u00b1 0.0007\n0.0007 \u00b1 0.0001\n0.0007 \u00b1 0.0001\n0.0497 \u00b1 0.0006\n0.0497 \u00b1 0.0006\n0.1106 \u00b1 0.0008\n0.1106 \u00b1 0.0008\n0.2022 \u00b1 0.0009\n0.2022 \u00b1 0.0009\n\nenron\n\n40%\n\n60%\n\n80%\n\n100%\n\nmedical\n\nscene\n\nyeast\n\ncorel5k\n\ngenbase\n\nAlgorithm\n\nemotions\n\nDataset\nbibtex\n\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\nPLST\nCPLST\n\n0.0128 \u00b1 0.0001\n0.0127 \u00b1 0.0001*\n0.0094 \u00b1 0.0000*\n0.0094 \u00b1 0.0000*\n0.2039 \u00b1 0.0029\n0.2004 \u00b1 0.0031*\n0.0489 \u00b1 0.0002*\n0.0490 \u00b1 0.0003\n0.0195 \u00b1 0.0003*\n0.0195 \u00b1 0.0003*\n0.0097 \u00b1 0.0002\n0.0096 \u00b1 0.0002*\n0.1540 \u00b1 0.0008\n0.1428 \u00b1 0.0007*\n0.2071 \u00b1 0.0009\n0.2063 \u00b1 0.0009*\n\n0.0127 \u00b1 0.0001*\n0.0127 \u00b1 0.0001*\n0.0094 \u00b1 0.0000*\n0.0094 \u00b1 0.0000*\n0.2051 \u00b1 0.0029\n0.2020 \u00b1 0.0031*\n0.0490 \u00b1 0.0002*\n0.0490 \u00b1 0.0003*\n0.0194 \u00b1 0.0003*\n0.0195 \u00b1 0.0003\n0.0097 \u00b1 0.0002\n0.0096 \u00b1 0.0002*\n0.1396 \u00b1 0.0011\n0.1289 \u00b1 0.0007*\n0.2064 \u00b1 0.0009*\n0.2064 \u00b1 0.0009*\n\n0.0128 \u00b1 0.0001*\n0.0128 \u00b1 0.0001*\n0.0094 \u00b1 0.0000*\n0.0094 \u00b1 0.0000*\n0.2109 \u00b1 0.0030\n0.2085 \u00b1 0.0032*\n0.0488 \u00b1 0.0002*\n0.0489 \u00b1 0.0003\n0.0202 \u00b1 0.0004*\n0.0202 \u00b1 0.0004*\n0.0099 \u00b1 0.0002*\n0.0099 \u00b1 0.0002*\n0.1688 \u00b1 0.0007\n0.1538 \u00b1 0.0006*\n0.2082 \u00b1 0.0009\n0.2064 \u00b1 0.0009*\n\n(those within one standard error of the lower one are in bold)\nTable 4: Test Hamming loss of LSDR algorithm with M5P\nM = 20%K\n0.0130 \u00b1 0.0001\n0.0129 \u00b1 0.0001*\n0.0094 \u00b1 0.0000*\n0.0094 \u00b1 0.0000*\n0.2213 \u00b1 0.0030\n0.2209 \u00b1 0.0031*\n0.0490 \u00b1 0.0002\n0.0489 \u00b1 0.0003*\n0.0215 \u00b1 0.0004*\n0.0215 \u00b1 0.0004*\n0.0127 \u00b1 0.0002\n0.0126 \u00b1 0.0002*\n0.1802 \u00b1 0.0005\n0.1674 \u00b1 0.0005\n0.2162 \u00b1 0.0008\n0.2083 \u00b1 0.0009*\n\n0.0127 \u00b1 0.0001*\n0.0127 \u00b1 0.0001*\n0.0094 \u00b1 0.0000*\n0.0094 \u00b1 0.0000*\n0.2063 \u00b1 0.0030\n0.2046 \u00b1 0.0031*\n0.0490 \u00b1 0.0002*\n0.0490 \u00b1 0.0003*\n0.0194 \u00b1 0.0003*\n0.0195 \u00b1 0.0003\n0.0097 \u00b1 0.0002\n0.0096 \u00b1 0.0002*\n0.1281 \u00b1 0.0008\n0.1268 \u00b1 0.0008*\n0.2067 \u00b1 0.0009\n0.2066 \u00b1 0.0009*\n(those with the lowest mean are marked with *; those within one standard error of the lowest one are in bold)\ndemonstrate that CPLST explores the trade-off between the encoding error and the prediction error\nin an optimal manner to reach the best performance for label space dimension reduction.\nThe results of PBR and OCCA are consistently inferior to PLST and CPLST across most of the\ndatasets in our experiments [25] and are not reported here because of space constraints. The test\nHamming loss achieved by PLST and CPLST on other datasets with different percentage of used\nlabels are reported in Table 3. In most datasets, CPLST is at least as effective as PLST; in bibtex,\nscene and yeast, CPLST performs signi\ufb01cantly better than PLST.\nNote that in the medical and enron datasets, both PLST and CPLST over\ufb01t when using many\ndimensions. That is, the performance of both algorithms would be better when using fewer dimen-\nsions (than the full binary relevance, which is provably equivalent to either PLST or CPLST with\nM = K when using linear regression). These results demonstrate that LSDR approaches, like their\nfeature space dimension reduction counterparts, can potentially help resolve the issue of over\ufb01tting.\n4.2 Coupling Label Space Dimension Reduction with the M5P Decision Tree\n\nCPLST is designed by assuming a speci\ufb01c regression method. Next, we demonstrate that the input-\noutput relationship captured by CPLST is not restricted for coupling with linear regression, but can\nbe effective for other regression methods in the learning stage (step 4 of Algorithm 1). We do so\nby coupling the LSDR approaches with the M5P decision tree [26]. M5P decision tree is a non-\nlinear regression method. We take the implementation from WEKA [27] for M5P with the default\nparameter setting.\nThe experimental results are shown in Table 4. The relations between PLST and CPLST when\ncoupled with M5P are similar to the ones when coupled with linear regression. In particular, in\nthe yeast, scene, and emotions, CPLST outperforms PLST. The results demonstrate that the\ncaptured input-output relation is also effective for regression methods other than linear regression.\n4.3 Label Space Dimension Reduction with Kernel Ridge Regression\n\nregularization. For kernel-CPLST, we use the Gaussian kernel k(xi, xj) = exp(cid:0)\u2212\u03b3(cid:107)xi \u2212 xj(cid:107)2(cid:1)\n\nIn this subsection, we conduct experiments for demonstrating the performance of kernelization and\n\n7\n\n\fkernel-CPLST\n\nkernel-CPLST\n\nkernel-CPLST\n\nDataset\nbibtex\n\ncorel5k\n\nemotions\n\nenron\n\ngenbase\n\nmedical\n\nscene\n\nyeast\n\n80%\n\nkernel-CPLST\n\nkernel-CPLST\n\n60%\n\nPLST\n\nPLST\n\nPLST\n\nTable 5: Test Hamming loss of LSDR algorithm with kernel ridge regression\nAlgorithm\n0.0151 \u00b1 0.0000\n0.0120 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0092 \u00b1 0.0000\n0.2000 \u00b1 0.0025\n0.1973 \u00b1 0.0027\n0.0468 \u00b1 0.0002\n0.0455 \u00b1 0.0002\n0.0010 \u00b1 0.0001\n0.0009 \u00b1 0.0001\n0.0102 \u00b1 0.0002\n0.0096 \u00b1 0.0002\n0.0932 \u00b1 0.0011\n0.0905 \u00b1 0.0007\n0.1882 \u00b1 0.0009\n0.1869 \u00b1 0.0009\n\nM = 20%K\n0.0151 \u00b1 0.0000\n0.0127 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0092 \u00b1 0.0000\n0.2218 \u00b1 0.0020\n0.2231 \u00b1 0.0020\n0.0460 \u00b1 0.0002\n0.0453 \u00b1 0.0002\n0.0169 \u00b1 0.0004\n0.0170 \u00b1 0.0004\n0.0136 \u00b1 0.0002\n0.0131 \u00b1 0.0002\n0.1713 \u00b1 0.0004\n0.1733 \u00b1 0.0004\n0.2030 \u00b1 0.0008\n0.2018 \u00b1 0.0008\n\n0.0151 \u00b1 0.0000\n0.0121 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0092 \u00b1 0.0000\n0.1983 \u00b1 0.0026\n0.1981 \u00b1 0.0025\n0.0466 \u00b1 0.0002\n0.0455 \u00b1 0.0002\n0.0014 \u00b1 0.0001\n0.0013 \u00b1 0.0001\n0.0103 \u00b1 0.0002\n0.0096 \u00b1 0.0002\n0.1173 \u00b1 0.0008\n0.1179 \u00b1 0.0007\n0.1892 \u00b1 0.0009\n0.1875 \u00b1 0.0009\n\n40%\n\n0.0151 \u00b1 0.0000\n0.0123 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0092 \u00b1 0.0000\n0.2074 \u00b1 0.0023\n0.2071 \u00b1 0.0024\n0.0462 \u00b1 0.0002\n0.0454 \u00b1 0.0002\n0.0039 \u00b1 0.0002\n0.0040 \u00b1 0.0002\n0.0106 \u00b1 0.0002\n0.0098 \u00b1 0.0002\n0.1468 \u00b1 0.0006\n0.1470 \u00b1 0.0006\n0.1913 \u00b1 0.0009\n0.1904 \u00b1 0.0009\n\nPLST\n\nPLST\n\nPLST\n\nPLST\n\nkernel-CPLST\n\nkernel-CPLST\n\nPLST\n\nkernel-CPLST\n\n(those within one standard error of the lower one are in bold)\n\n100%\n\n0.0151 \u00b1 0.0000\n0.0120 \u00b1 0.0000\n0.0094 \u00b1 0.0000\n0.0092 \u00b1 0.0000\n0.2002 \u00b1 0.0025\n0.1988 \u00b1 0.0027\n0.0469 \u00b1 0.0002\n0.0456 \u00b1 0.0002\n0.0008 \u00b1 0.0001\n0.0008 \u00b1 0.0001\n0.0102 \u00b1 0.0002\n0.0096 \u00b1 0.0002\n0.0731 \u00b1 0.0007\n0.0717 \u00b1 0.0007\n0.1881 \u00b1 0.0009\n0.1868 \u00b1 0.0009\n\nduring LSDR and take kernel ridge regression with the same kernel and the same regularization\nparameter as the underlying multi-output regression method. We also couple PLST with kernel ridge\nregression for a fair comparison. We select the Gaussian kernel parameter \u03b3 and the regularization\nparameter \u03bb with a grid search on (log2 \u03bb, log2 \u03b3) using a 5-fold cross validation using the sum of\nthe Hamming loss across all dimensions. The details of the grid search can be found in the Master\u2019s\nThesis of the \ufb01rst author [25].\nWhen coupled with kernel ridge regression, the comparison between PLST and kernel-CPLST in\nterms of the Hamming loss is shown in Table 5. kernel-CPLST performs well for LSDR and out-\nperforms the feature-unaware PLST in most cases. In particular, in \ufb01ve out of the eight datasets,\nkernel-CPLST is signi\ufb01cantly better than PLST regardless of the number of dimensions used. In\naddition, in the medical and enron datasets, the over\ufb01tting problem is eliminated with regular-\nization (and parameter selection), and hence kernel-CPLST not only performs better than PLST with\nkernel ridge regression, but also is better than the (unregularized) linear regression results in Table 3.\nFrom the previous comparison between CPLST and PLST, CPLST is at least as good as, and usually\nbetter than, PLST. The difference between CPLST and PLST is small but consistent, and does sug-\ngest that CPLST is a better choice for label-space dimension reduction. The results provide practical\ninsights on the two types of label correlation [14]: unconditional correlation (feature-unaware) and\nconditional correlation (feature-aware). The unconditional correlation, exploited by PLST and other\nLSDR algorithms, readily leads to promising performance in practice. On the other hand, there is\nroom for some (albeit small) improvements when exploiting the conditional correlation properly like\nCPLST.\n\n5 Conclusion\n\nIn this paper, we studied feature-aware label space dimension reduction (LSDR) approaches, which\nutilize the feature information during LSDR and can be viewed as the counterpart of supervised\nfeature space dimension reduction. We proposed a novel feature-aware LSDR algorithm, condi-\ntional principal label space transformation (CPLST) which utilizes the key conditional correlations\nfor dimension reduction. CPLST enjoys the theoretical guarantee in balancing between the predic-\ntion error and the encoding error in minimizing the Hamming loss bound. In addition, we extended\nCPLST to a kernelized version for capturing more sophisticated relations between features and la-\nbels. We conducted experiments for comparing CPLST and its kernelized version with other LSDR\napproaches. The experimental results demonstrated that CPLST is the best among the LSDR ap-\nproaches when coupled with linear regression or kernel ridge regression. In particular, CPLST is\nconsistently better than its feature-unaware precursor, PLST. Moreover, the input-output relation\ncaptured by CPLST can be utilized by regression method other than linear regression.\n\nAcknowledgments\n\nWe thank the anonymous reviewers of the conference and members of the Computational Learning\nLaboratory at National Taiwan University for valuable suggestions. This work is partially supported\nby National Science Council of Taiwan via the grant NSC 101-2628-E-002-029-MY2.\n\n8\n\n\fReferences\n[1] I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classi\ufb01cation for automated tag suggestion. In\nProceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases 2008 Discovery Challenge, 2008.\n\n[2] M. Boutell, J. Luo, X. Shen, and C. Brown. Learning multi-label scene classi\ufb01cation. Pattern Recognition,\n\n2004.\n\n[3] A. Elisseeff and J. Weston. A kernel method for multi-labelled classi\ufb01cation.\n\nInformation Processing Systems 14, 2001.\n\nIn Advances in Neural\n\n[4] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-Label prediction via compressed sensing.\n\nAdvances in Neural Information Processing Systems 22, 2009.\n\nIn\n\n[5] F. Tai and H.-T. Lin. Multi-Label classi\ufb01cation with principal label space transformation.\n\nComputation, 2012.\n\nIn Neural\n\n[6] H. Hotelling. Relations between two sets of variates. Biometrika, 1936.\n[7] M. Wall, A. Rechtsteiner, and L. Rocha. Singular value decomposition and principal component analysis.\n\nA Practical Approach to Microarray Data Analysis, 2003.\n\n[8] I. Jolliffe. Principal Component Analysis. Springer, second edition, October 2002.\n[9] E. Barshan, A. Ghodsi, Z. Azimifar, and M. Zolghadri Jahromi. Supervised principal component analysis:\n\nVisualization, classi\ufb01cation and regression on subspaces and submanifolds. Pattern Recognition, 2011.\n\n[10] K.-C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Associa-\n\ntion, 1991.\n\n[11] K. Fukumizu, F. Bach, and M. Jordan. Dimensionality reduction for supervised learning with reproducing\n\nkernel hilbert spaces. Journal of Machine Learning Research, 2004.\n\n[12] L. Sun, S. Ji, and J. Ye. Canonical correlation analysis for multilabel classi\ufb01cation: A least-squares\nformulation, extensions, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n2011.\n\n[13] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data Mining and Knowledge\n\nDiscovery Handbook. Springer US, 2010.\n\n[14] K. Dembczynski, W. Waegeman, W. Cheng, and E. H\u00fcllermeier. On label dependence and loss minimiza-\n\ntion in multi-label classi\ufb01cation. Machine Learning, 2012.\n\n[15] J. Weston, O. Chapelle, A. Elisseeff, B. Sch\u00f6lkopf, and V. Vapnik. Kernel dependency estimation. In\n\nAdvances in Neural Information Processing Systems 15, 2002.\n\n[16] J. Kettenring. Canonical analysis of several sets of variables. Biometrika, 1971.\n[17] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel. Multi-output regularized feature projection. IEEE Transactions\n\non Knowledge and Data Engineering, 2006.\n\n[18] Y. Zhang and J. Schneider. Multi-label output codes using canonical correlation analysis. In Proceedings\n\nof the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[19] D. Hoaglin and R. Welsch. The hat matrix in regression and ANOVA. The American Statistician, 1978.\n[20] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika,\n\n1936.\n\n[21] B. Sch\u00f6lkopf and A. Smola. Learning with kernels : support vector machines, regularization, optimiza-\n\ntion, and beyond. The MIT Press, \ufb01rst edition, 2002.\n\n[22] G. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. In\n\nProceedings of the Fifteenth International Conference on Machine Learning, 1998.\n\n[23] G. Tsoumakas, E. Spyromitros-Xiou\ufb01s, J. Vilcek, and I. Vlahavas. Mulan: A java library for multi-label\n\nlearning. Journal of Machine Learning Research, 2011.\n\n[24] B. Datta. Numerical Linear Algebra and Applications, Second Edition. SIAM-Society for Industrial and\n\nApplied Mathematics, 2010.\n\n[25] Y.-N. Chen. Feature-aware label space dimension reduction for multi-label classi\ufb01cation problem. Mas-\n\nter\u2019s thesis, National Taiwan University, 2012.\n\n[26] Y. Wang and I. Witten. Induction of model trees for predicting continuous classes. In Poster Papers of\n\nthe Nineth European Conference on Machine Learning, 1997.\n\n[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The weka data mining software:\n\nan update. SIGKDD Exploration Newsletter, 2009.\n\n9\n\n\f", "award": [], "sourceid": 728, "authors": [{"given_name": "Yao-nan", "family_name": "Chen", "institution": null}, {"given_name": "Hsuan-tien", "family_name": "Lin", "institution": null}]}