{"title": "Maximum Covariance Unfolding : Manifold Learning for Bimodal Data", "book": "Advances in Neural Information Processing Systems", "page_first": 918, "page_last": 926, "abstract": "We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming.", "full_text": "Maximum Covariance Unfolding:\n\nManifold Learning for Bimodal Data\n\nVijay Mahadevan\nDepartment of ECE\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\nvmahadev@ucsd.edu\n\nChi Wah Wong\n\nDepartment of Radiology\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\ncwwong@ieee.org\n\nJose Costa Pereira\nDepartment of ECE\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\njosecp@ucsd.edu\n\nThomas T. Liu\n\nDepartment of Radiology\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\nttliu@ucsd.edu\n\nNuno Vasconcelos\nDepartment of ECE\n\nLa Jolla, CA 92093\n\nLawrence K. Saul\nDepartment of CSE\n\nLa Jolla, CA 92093\n\nUniversity of California, San Diego\n\nUniversity of California, San Diego\n\nnvasconcelos@ucsd.edu\n\nsaul@cs.ucsd.edu\n\nAbstract\n\nWe propose maximum covariance unfolding (MCU), a manifold learning al-\ngorithm for simultaneous dimensionality reduction of data from different in-\nput modalities. Given high dimensional inputs from two different but naturally\naligned sources, MCU computes a common low dimensional embedding that\nmaximizes the cross-modal (inter-source) correlations while preserving the local\n(intra-source) distances. In this paper, we explore two applications of MCU. First\nwe use MCU to analyze EEG-fMRI data, where an important goal is to visualize\nthe fMRI voxels that are most strongly correlated with changes in EEG traces. To\nperform this visualization, we augment MCU with an additional step for metric\nlearning in the high dimensional voxel space. Second, we use MCU to perform\ncross-modal retrieval of matched image and text samples from Wikipedia. To\nmanage large applications of MCU, we develop a fast implementation based on\nideas from spectral graph theory. These ideas transform the original problem for\nMCU, one of semide\ufb01nite programming, into a simpler problem in semide\ufb01nite\nquadratic linear programming.\n\n1 Introduction\n\nRecent advances in manifold learning and nonlinear dimensionality reduction have led to powerful,\nnew methods for the analysis and visualization of high dimensional data [14, 1, 20, 24, 16]. These\nmethods have roots in nonparametric statistics, spectral graph theory, convex optimization, and mul-\ntidimensional scaling. Notwithstanding individual differences in motivation and approach, these\nmethods share certain features that account for their overall popularity: (i) they generally involve\nfew tuning parameters; (ii) they make no strong distributional assumptions; (iii) ef\ufb01cient algorithms\nexist to compute the global minima of their cost functions.\n\n1\n\n\fAll these methods solve variants of the same basic underlying problem: given high dimensional\ninputs, {x1, x2, . . . , xn}, compute low dimensional outputs {y1, y2, . . . , yn} that preserve certain\nnearness relations (e.g., local distances). Solutions to this problem have found applications in many\nareas of science and engineering. However, many real-world applications do not map neatly into this\nframework. For instance, in certain applications, aligned data is acquired from two different modal-\nities \u2014 we refer to such data as bimodal \u2014 and the goal is to \ufb01nd low dimensional representations\nthat capture their interdependencies.\nIn this paper, we investigate the use of maximum variance unfolding (MVU) [24] for the simultane-\nous dimensionality reduction of data from different input modalities. Though the original algorithm\ndoes not solve this problem, we show that it can be adapted to provide a compelling solution. In its\noriginal formulation, MVU computes a low dimensional embedding that maximizes the variance of\nits outputs, subject to constraints that preserve local distances. We explore a modi\ufb01cation of MVU\nthat computes a joint embedding of high dimensional inputs from different data sources. In this\njoint embedding, our goal is to discover a common low dimensional representation of just those\ndegrees of variability that are correlated across different modalities. To achieve this goal, we design\nthe embedding to maximize the inter-source correlation between aligned outputs while preserving\nthe local, intra-source distances. By analogy to MVU, we call our approach maximum covariance\nunfolding (MCU).\nThe optimization for MCU inherits the basic form of the optimization for MVU. In particular, it can\nbe cast as a semide\ufb01nite program (SDP). For applications to large datasets, we can also exploit the\nsame strategies behind recent, much faster implementations of MVU [25]. In particular, using these\nsame strategies, we show how to reformulate the optimization for MCU as a semide\ufb01nite quadratic\nlinear program (SQLP). In addition, for one of our applications\u2014the analysis of EEG-fMRI data\u2014\nwe show how to extend the basic optimization of MCU to visualize the high dimensional correlations\nbetween different input modalities. This is done by adding extra variables to the original SDP; these\nvariables can be viewed as performing a type of metric learning in the high dimensional voxel space.\nIn particular, they indicate which fMRI voxels (in the high dimensional space of fMRI images)\ncorrelate most strongly with observed changes in the EEG recordings.\nAs related work, we mention several other studies that have proposed SDPs to achieve different\nobjectives than those of the original algorithm for MVU. Bowling et al [4, 5] developed a related\napproach known as action-respecting embedding for problems in robot localization. Song et al [18]\nreinterpreted the optimization criterion of MVU, then proposed an extension of the original algo-\nrithm that computes low dimensional embeddings subject to class labels or other side information.\nFinally, Shaw and Jebara [15, 16] have explored related SDPs to produce minimum-volume and\nstructure-preserving embeddings; these SDPs yield much more sensible visualizations of social net-\nworks and large graphs that do not necessarily resemble a discretized manifold. Our work builds\non the successes of these earlier studies and further extends the applicability of SDPs for nonlinear\ndimensionality reduction.\n\n2 Maximum Covariance Unfolding\nWe propose a novel adaptation of MVU, termed maximum covariance unfolding or MCU to perform\nnon-linear correlation between two aligned datasets whose points have a one-to-one correspondence.\nMCU embeds the two datasets, of different dimensions, into a single low dimensional manifold such\nthat the two resulting embeddings are maximally correlated. As in MVU, the embeddings are such\nthat local distances are preserved. The problem formulation is described in detail next.\n\ni=1, x1i \u2208 Rp1 and {x2i}n\n\ni=1, y1i \u2208 Rd and {y2i}n\n\ni=1, x2i \u2208 Rp2 be two aligned datasets belonging to two dif-\ni=1, y2i \u2208 Rd be the corresponding low\n\n2.1 Formulation\nLet {x1i}n\nferent input spaces, and {y1i}n\ndimensional representations (in the output space), with d (cid:191) p1 and d (cid:191) p2.\nAs in MVU [21], we need to \ufb01nd a low dimensional mapping such that the Euclidean distance\nbetween pairs of points in a local neighborhood are preserved. For each dataset s \u2208 {1, 2}, if\npoints xsj and xsk are neighbors or are common neighbors of another point, we denote an indicator\nvariable \u03b7sij = 1. The neighborhood constraints can then be written as\n\n||ysi \u2212 ysj||2 = ||xsi \u2212 xsj||2\n\nif \u03b7sij = 1\n\n(1)\n\n2\n\n\f(cid:189)\n\ni \u2264 n\ni > n\n\ny1i\ny2(i\u2212n)\n\ni=1 containing 2n points, zi =\n\nTo simplify the notation, we concatenate the output points from both datasets into one large set,\n{zi}2n\nWe also de\ufb01ne the inner-product matrix for {zi}, Kij = zi.zj. This allows us to formulate the MCU\nvery similarly to the MVU formulation of [21], and so we omit the details for the sake of brevity.\nThe distance constraint of (1) is written in the matrix form as:\n\nKii \u2212 2Kij + Kjj = D1ij , {(i, j) : i, j \u2264 n and \u03b71ij = 1}\nKii \u2212 2Kij + Kjj = D2(i\u2212n)(j\u2212n), {(i, j) : i, j > n and \u03b72(i\u2212n)(j\u2212n) = 1}\n(cid:80)\ni ysi = 0,\u2200s \u2208 {1, 2}. The equivalent matrix constraints are,\n\nThe centering constraint to ensure that the output points of both datasets are centered at the origin\nrequires that\n\n(2)\n(3)\n\n(cid:88)\n\n(cid:88)\n\nKij = 0, \u2200i, j \u2264 n\n\nKij = 0, \u2200i, j > n\n\nij\n\nij\n\n(4)\n\n(6)\n\nThe objective function is to maximize the covariance between the low dimensional representations\nof the two datasets. We can use the trace of the covariance matrix as a measure of how strongly the\ntwo outputs are correlated. The average covariance can be written as:\n\ntr(cov(y1, y2)) = tr(E(y1yT\n\n2 )) = E(tr(y1yT\n\n2 )) = E(y1.y2) \u2248 1\n\ny1i.y2i\n\n(5)\n\n(cid:88)\n\nn\n\ni\n\nCombining all the constraints together with the objective function, we can write the optimization as:\n\n(cid:184)\n\n(cid:183)\n\nWijKij, with W =\n\n(cid:88)\nKii \u2212 2Kij + Kjj = D1ij , {(i, j) : i, j \u2264 n and \u03b71ij = 1}\nKii \u2212 2Kij + Kjj = D2(i\u2212n)(j\u2212n) , {(i, j) : i, j > n and \u03b72(i\u2212n)(j\u2212n) = 1}\nK (cid:186) 0,\n\nKij = 0, \u2200i, j \u2264 n,\n\nKij = 0, \u2200i, j > n\n\n(cid:88)\n\n(cid:88)\n\n0\nIn\n\nIn\n0\n\nij\n\nMaximize:\n\nsubject to:\n\nij\n\nij\n\nAs in the original MVU formulation [21], this is a semi-de\ufb01nite program (SDP) and can be solved\nusing general-purpose toolboxes such as SeDuMi [19]. The solution returned by the SDP can be\nused to \ufb01nd the coordinates in the low-dimensional embedding, {y1i}n\ni=1, using the\nspectral decomposition method described in [21].\nOne shortcoming of the MCU formulation is that it provides no means to visualize the results.\nWhile the low-dimensional embeddings of the two datasets may be well correlated, there is no way\nto identify which dimensions or covariates of the data points in one modality contribute to high\ncorrelation with the points in the other modality. To address this issue, we include a novel metric\nlearning framework in the MCU formulation, as described in the next section.\n\ni=1 and {y2i}n\n\n2.2 Metric Learning for Visualization\nFor each dimension in one dataset, we need to compute a measure of how much it contributes to the\ncorrelation between the datasets. This can be done using a metric learning type step applied to data\nof one or both modalities within the MCU formulation. In this work we describe this approach for\nthe situation where metric learning is applied to only {x1i}.\nThe MCU formulation of Section 2 assumes that the distances between the points is Euclidean. So in\nthe computation of nearest neighbor distances, each of the p1 dimensions of {x1i} receive the same\nweight, as shown in (1). However, inspired by the recently proposed ideas in metric learning [22],\nwe use a more general distance metric by applying a linear transformation T1 of size p1 \u00d7 p1 in the\nspace, and then perform MCU using the transformed points, T1xi. This allows some distances to\nshrink/expand if that would help in increasing the correlation with {x2i}.\nFor the sake of simplicity, we choose a diagonal weight matrix T1, whose diagonal entries are\n{\u03c3i}p1\nIn order to \ufb01nd the weight vector that produces the maximal correlation between the two datasets,\nthese p1 new variables can be learned within the MCU framework by adding them to the optimization\n\ni=1, \u03c3i \u2265 0, \u2200i. This allows us to weight each dimension of the input space separately.\n\n3\n\n\f(cid:80)\n\nm \u03c3m(xim \u2212 xjm)2.\n\nproblem. As each dimension has a corresponding weight, the optimal weight vector returned would\nbe a map over the dimensions indicating how strongly each is correlated to {x2i}.\nTo modify the MCU formulation to include these new variables, we replace all Euclidean dis-\ntance measurements for the data points in the \ufb01rst dataset in (2) with the weighted distance\nD1ij =\nThis adds a linear function of the new weight variables to the existing distance constraints of (2).\nHowever, if we had to de\ufb01ne the neighborhood of a data point itself using this weighted distance,\nthe formulation would become non-convex. So we assume that the neighborhood is composed of\npoints that are closest in time . An alternative is to use neighbors as computed in the original space\nusing the un-weighted distance. We also add constraints to make the weights positive and sum to p1.\nThe objective function of (6) does not change, but we need to maximize the objective over the p1\nweight variables also. The problem still remains an SDP and can be solved as before. The new\nformulation, denoted MCU-ML, is written as:\n\n(cid:88)\n\nij\n\nMaximize:\n\nsubject to:\n\nWijKij, with W =\n\n\u03c3k \u2265 0, \u2200k \u2208 {1 . . . p1},\n\nand\n\n\u03c3k = p1.\n\n(cid:183)\n\n(cid:184)\n\nIn\n0\n\n0\nIn\n\n(cid:88)\n\nk\n\n(cid:88)\n\nm\n\n\u03c3m(xim \u2212 xjm)2 = 0, {(i, j) : i, j \u2264 n and \u03b71ij = 1}\n\nKii \u2212 2Kij + Kjj \u2212\nKii \u2212 2Kij + Kjj = D2(i\u2212n)(j\u2212n), {(i, j) : i, j > n and \u03b72(i\u2212n)(j\u2212n) = 1}\nK (cid:186) 0,\n\nKij = 0, \u2200i, j \u2264 n,\n\nKij = 0, \u2200i, j > n\n\n(cid:88)\n\n(cid:88)\n\n(7)\n\nij\n\nij\n\nWe next describe how these formulations for MCU can be applied to \ufb01nd optimal representations\nfor high dimensional EEG-fMRI data.\n3 Resting-state EEG-fMRI Data\nIn the absence of an explicit task, temporal synchrony of the blood oxygenation level dependent\n(BOLD) signal is maintained across distinct brain regions. Taking advantage of this synchrony,\nresting-state fMRI has been used to study connectivity. fMRI datasets have high resolution of the\norder of a few millimeters, but offer poor temporal resolution as it measures the delayed haemody-\nnamic response to neural activity. In addition, changes in resting-state BOLD connectivity measures\nare typically interpreted as changes in coherent neural activity across respective brain regions. How-\never, this interpretation may be misleading because the BOLD signal is a complex function of neural\nactivity, oxygen metabolism, cerebral blood \ufb02ow (CBF), and cerebral blood volume (CBV) [3]. To\naddress these shortcomings, simultaneous acquisition of electroencephalographic data (EEG) during\nfunctional magnetic resonance imaging (fMRI) is becoming more popular in brain imaging [13].\nThe EEG recording provides high temporal resolution of neural activity (5kHz), but poor spatial\nresolution due to electric signal distortion by the skull and scalp and the limitations on the num-\nber of electrodes that can be placed on the scalp. Therefore the goal of simultaneous acquisition\nof EEG and fMRI is to exploit the complementary nature of the two imaging modalities to obtain\nspatiotemporally resolved neural signal and metabolic state information [13]. Speci\ufb01cally, using\nhigh temporal resolution EEG data, we are able to examine dynamic changes and non-stationary\nproperties of neural activity at different frequency bands. By correlating with the EEG data with\nthe high resolution BOLD data, we are able to examine the corresponding spatial regions in which\nneural activity occurs.\nConventional approaches to analyzing the joint EEG-fMRI data have relied on linear methods. Most\noften, a simple voxel-wise correlation of the fMRI data with the EEG power time series in a speci\ufb01c\nfrequency band is performed [13]. But this technique does not exploit the rich spatial dependencies\nof the fMRI data. To address this issue, more sophisticated linear methods such as canonical correla-\ntion analysis (CCA) [7], and the partial least squares method [11] have been proposed. However, all\nlinear approaches have a fundamental shortcoming - the space of images, which is highly non-linear\nand thought to form a manifold, may not be well represented by a linear subspace. Therefore, lin-\n\n4\n\n\fear approaches to correlate the fMRI data with the EEG data may not capture any low dimensional\nmanifold structure.\nTo address these limitations we propose the use of MCU to learn low dimensional manifolds for both\nthe fMRI and EEG data such that the output embeddings are maximally correlated. In addition, we\nlearn a metric in the fMRI input space to identify which voxels of the fMRI correlate most strongly\nwith observed changes in the EEG recordings. We \ufb01rst describe the methods used to acquire the\nEEG-fMRI dataset.\n\n3.1 Method for Data Acquisition\nOne 5 minute simultaneous EEG-fMRI resting state run was recorded and processed with eyes\nclosed (EC). Data were acquired using a 3 Tesla GE HDX system and a 64 channel EEG system\nsupplied by Brain Products. EEG signals were recorded at 5kHz sampling rate. Impedances of the\nelectrodes were kept below 20k\u2126. Recorded EEG data were pre-processed using Vision Analyzer\n2.0 software (Brain Products). Subtraction-based MR-gradient and Cardio-ballistic artifact removal\nwere applied. A low pass \ufb01lter with cut off frequency 30Hz was applied to all channels and the\nprocessed signals were down-sampled to 250Hz. fMRI data were acquired with the following pa-\nrameters: echo planar imaging with 150 volumes, 30 slices, 3.438 \u00d7 3.438 \u00d7 5mm3 voxel size,\n64 \u00d7 64 matrix size, TR=2s, TE=30ms. fMRI data were pre-processed using an in-house developed\npackage. The 5 frequency channels of the EEG data were averaged to produce a 63 dimensional\ntime series of 145 time points. The fMRI data consisted of a 122880 (64 \u00d7 64 \u00d7 30) dimensional\ntime series with 145 time points.\n\n3.2 Results on EEG-fMRI Dataset\nThe EEG and fMRI data points described in the previous section are extremely high dimensional.\nHowever, both EEG and fMRI signals are the result of sparse neuronal activity. Therefore, attempts\nto embed these points, especially the fMRI data, into a low dimensional manifold have been made\nusing non-linear dimensionality reduction techniques such as Laplacian eigenmaps [17]. While such\ntechniques may be used to \ufb01nd manifold embeddings for fMRI and EEG data separately, they are\nnot useful for \ufb01nding patterns of correlation between the two. We demonstrate how MCU can be\napplied to this setting below.\nDue to the very high dimensionality of the fMRI dataset, we pre-processed the data as follows. An\nanatomical region of interest mask was used, followed by PCA to project the fMRI samples to a\nsubspace of dimension p1 = 145 (which represented all of the energy of the samples, because there\nare only 145 time points). The EEG data was not subject to any pre-processing, and p2 remained\n63. We applied the MCU-ML approach to learn a visualization map and a joint low dimensional\nembedding for the EEG-fMRI dataset. We compared the results to two other techniques - the voxel-\nwise correlation, and the linear CCA approach inspired by [7]. The MCU-ML solution directly\nreturned a weight vector of length 145. For CCA, the average of the canonical directions (weighted\nusing the canonical correlations) was used as the weight vector. In both cases, the 145 dimensional\nweight vector was projected back to the fMRI voxel space using the principal components of the\nPCA step.\nTwo types of voxel wise correlations maps were computed to assess the performance of MCU-ML.\nFirst, a naive correlation map was generated where each voxel was separately correlated with the av-\nerage EEG power time course from the alpha aband (8-12Hz) (which is known to be correlated with\nthe fMRI resting-state network [13]) from all the 63 electrodes. Second, a functional connectivity\nmap was generated using the knowledge that at rest state (during which the dataset was recorded),\nthe Posterior Cingulate Cortex (PCC) is known to be active [8] and is correlated with the Default\nMode Network (DMN) while anti-correlated with the Task Positive Network (TPN). To achieve this,\na seed region of interest (ROI) was \ufb01rst selected from PCC. The averaged fMRI signal from the ROI\nwas then correlated with the whole brain to obtain a voxel-wise correlation map. Therefore, voxels\nin the PCC region should have high correlation with the EEG data. This information provides a\n\u201csanity-check\u201d version of the fMRI correlation map.\nThe results for the anatomically signi\ufb01cant slice 18, within which both DMN and the TPN are\nlocated, are shown in Figure 1. The functional connectivity map is shown in 2(a), and the correlation\nmap obtained using MCU-ML, overlaid with the relevant anatomical regions appears in 2(b). The\nMCU-ML map shows the activation of Default Mode Network (DMN) and a suppression of Task\nPositive Network (TPN). From the results, it is clear that the MCU-ML approach produces the best\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Comparison of results on the EEG-fMRI dataset. (a) naive correlation map (b) using only PCA (c)\nusing CCA (d) using MCU-ML\n\nmatch, showing well localized regions of positive correlation in the DMN, and regions of negative\ncorrelation in the TPN. The correlation maps for 12 slices overlaid with over a high-resolution T1-\nweighted image for the proposed MCU-ML approach are shown in Figure 3(b).\n\n(a)\n\n(b)\n\nFigure 2: (a) the functional connectivity map, and (b) the MCU-ML correlation map overlaid with information\nabout the anatomical regions relevant during rest state.\n\n(a)\n\n(b)\n\nFigure 3: (a) The plot showing the normalized weights for the 145 dimensions for CCA, MCU-ML and PCA.\n(b) A montage showing the recovered weights for each voxel in the 12 anatomically signi\ufb01cant slices, with the\nMCU overlaid on a high-resolution T1-weighted image.\nTo compare the learned weights using the MCU-ML and CCA, we plot the normalized importance\nof each of the 145 dimensions in Figure 3(a). We also plot the eigenvalues for the 145 dimensions\nobtained using PCA. It is seen that the weights produced by the MCU-ML approach have fewer\ncomponents (around 20) than those of CCA. It is also interesting to see that the weights that produce\nmaximal correlation with the EEG dataset are very different from the eigenvalues of PCA them-\nselves, indicating that the dimensions that are important for correlation are not necessarily the ones\nwith maximum variance.\n4 Fast MCU\nOne of the primary limitations of the SDP based formulation for MCU in Section 2.1, shared with\nMVU, is its inability to scale to problems involving a large number of data points [23]. To address\nthis issue, Weinberger et al. [23] modi\ufb01ed the original formulation using graph laplacian regular-\nization to reduce the size of the SDP. However, recent work has shown that even this reduced for-\nmulation of MVU can be solved more ef\ufb01ciently by reframing it as a semide\ufb01nite quadratic linear\nprogramming (SQLP) [25]. In this section, we show how a fast version of MCU, denoted Fast-MCU,\ncan be implemented using a similar approach.\n\n6\n\n 102030405060102030405060\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8 102030405060102030405060\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8 102030405060102030405060\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8 102030405060102030405060\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8 102030405060102030405060\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.805010015000.20.40.60.81PCA IndexRelative Weights MCUPCACCA\fLet L1 and L2 denote the graph laplacians [6] of the two sets of points, {y1i} and {y2i}, respectively.\nThe graph laplacian depends only on nearest neighbor relations and in MCU these are assumed to\nbe unchanged as the points are embedded from the original space to the low dimensional manifold.\nTherefore, L1 and L2 can be obtained using the graph of data points, {x1i} and {x2i}, in the original\nspace. Let Q1, Q2 \u2208 Rn\u00d7m contain the bottom m eigenvectors of L1 and L2. Then we can write\ni=1 and {u2}m\n2n vectors {y1i} and {y2i} in terms of two new sets of m unknown vectors, {u1}m\ni=1,\nwith m (cid:191) n, using the approximation:\n\ny1i \u2248 m(cid:88)\n\n\u03b1=1\n\nand y2i \u2248 m(cid:88)\n\n\u03b1=1\n\nQ1i\u03b1u1\u03b1\n\nQ2i\u03b1u2\u03b1\n\n(8)\n\n(cid:189)\n\nui =\n\nu1i\nu2(i\u2212m)\n\nAs in Section 2, we concatenate the vectors from both datasets into one larger set, {ui}2m\ning 2m points:\n\ni=1 contain-\n\ni \u2264 m\n(9)\ni > m\ni\u03b1uj\u03b2 \u2200i, j \u2208 {1, 2} \u2200\u03b1, \u03b2 \u2208 {1 . . . m}, and\n\n(cid:184)\n\n(cid:183)\n\nWe de\ufb01ne m \u00d7 m inner product matrices, (Uij)\u03b1\u03b2 = uT\na 2m \u00d7 2m matrix, U\u03b1\u03b2 = uT\nThe 2n \u00d7 2n inner product matrix K can therefore be approximated in terms of the much smaller\nmatrix 2m \u00d7 2m matrix U:\n\n\u03b1u\u03b2 \u2200\u03b1, \u03b2 \u2208 {1 . . . 2m}. Therefore, U =\n\nU11 U12\nU21 U22\n\n.\n\n(cid:184)\n\n(cid:183)\n\nK \u2248\n\nQ1U11QT\nQ2U21QT\n\n1 Q1U21QT\n2\n1 Q2U22QT\n2\n\n(10)\n\nThe formulate MCU as an SQLP, we \ufb01rst\nstraints\n\ninto the objective\n\nfunction using regularization parameters \u03bd1, \u03bd2\n\nrewrite (6) by bringing the distance con-\n0:\n\n>\n\n(cid:88)\n\nMaximize:\n\n(cid:88)\n\n(cid:161)\n\nKii \u2212 2Kij + Kjj \u2212 D1ij\n\nij\n\n\u2212\u03bd2\n\nWijKij \u2212\u03bd1\n\n(cid:88)\n(cid:88)\n\ni\u223cj,\u2200i,j>n\n\n(cid:161)\ni\u223cj,\u2200i,j\u2264n\n(cid:88)\nKii \u2212 2Kij + Kjj \u2212 D2ij\n\nKij = 0, \u2200i, j \u2264 n,\n\n(cid:162)2\n\nsubject to:\n\nK (cid:186) 0,\n\n(cid:162)2\n\nKij = 0, \u2200i, j > n\n\n(11)\n\nij\n\nBy using (10) in (11), and by noting that the centering constraint is automatically satis\ufb01ed [23], we\nget the modi\ufb01ed formulation in terms of U:\n\nij\n\n(cid:88)\n\n(cid:179)\n\n(cid:88)\n\n\u03bdk\n\nk\n\ni\u223ckj\n\n(cid:180)2\n\n(12)\n\n(QkUkkQT\n\nk )ii \u2212 2(QkUkkQT\n\nk )ij + (QkUkkQT\n\nk )jj \u2212 Dkij\n\nMaximize:\n\n2 ) \u2212\n\n2tr(Q1U21QT\nU (cid:186) 0\n\nsubject to:\nwhere i \u223ck j for k \u2208 {1, 2} encodes the neighborhood relationships of the kth dataset.\nThis SDP is similar to the formulation proposed by [23]. In order to obtain further simpli\ufb01cation, let\nU \u2208 R4m2 be the concatenation of the columns of U. Then, (12) can be reformulated by collecting\nthe coef\ufb01cients of all quadratic terms in the objective function in a positive semi-de\ufb01nite matrix\nA \u2208 R4m2\u00d74m2, and those of the linear terms, including the trace term, in a vector b \u2208 R4m2:\n\nMinimize:\nsubject to:\n\nUAU T + bTU\nU (cid:186) 0\n\ni=1 and {u2i}m\n\n(13)\nThis minimization problem can be solved using the SQLP approach of [6]. From the solution of the\nSQLP, the vectors{u1i}m\ni=1, can be obtained using the spectral decomposition method\ndescribed in [21], followed by the low dimensional coordinates {y1i}n\ni=1, using (8).\nFinally, these coordinates are re\ufb01ned using gradient based improvement of the original objective\nfunction of (11) using the procedure described in [23].\n5 Results\nWe apply the Fast-MCU algorithm to n = 1000 points generated from two \u201cSwiss rolls\u201d in three\ndimensions, with m set to 20. Figure 4 shows the embeddings of this data generated by CCA and\n\ni=1 and {y2i}n\n\n7\n\n\fby Fast-MCU. While CCA discovers two signi\ufb01cant dimensions, the Fast-MCU accurately extracts\nthe low dimensional manifold where the embeddings lie in a narrow strip.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: (a) Two \u201cswiss rolls\u201d consisting of 1000 points each in 3D with the aligned pairs of points shown\nin the same color. (b) the 2D embedding obtained using CCA. (c) low dimensional manifolds obtained using\nFast-MCU, before and after the gradient based improvement step. (best viewed in color)\nTo further test the proposed Fast-MCU on real data, we use the recently proposed Wikipedia dataset\ncomposed of text and image pairs [12]. The dataset consists of 2866 text - image pairs, each belong-\ning to one of 10 semantic categories. The corpus is split into a training set with 2173 documents, and\na test set with 693 documents. The retrieval task consists of two parts. In the \ufb01rst, each image in the\ntest set is used as a query, and the goal is to rank all the texts in the test set based on their match to\nthe query image. In the second, a text query is used to rank the images. In both parts, performance\nis measured using the mean average precision (MAP). The MAP score is the average precision at\nthe ranks where recall changes.\nThe experimental evaluation was similar to that of [12]. We \ufb01rst represented the text using an LDA\nmodel [2] with 20 topics, and the image using a histogram over a SIFT [10] codebook of 4096\ncodewords. The common low dimensional manifold was learned from the text-image pairs of the\ntraining set using the SQLP based formulation of (13), with m = 20, followed by a gradient ascent\nstep as described in the previous section. To compare the performance of Fast-MCU, we also used\nCCA and kernel CCA (kCCA) to learn the maximally correlated joint spaces from the training set.\nFor kCCA we used a Gaussian kernel and implemented it using code from the authors of [9].\nGiven a test sample (image or text), it is \ufb01rst projected into the learned subspace or manifold. For\nCCA, this involves a linear transformation to the low dimensional subspace, while for kCCA this\nis achieved by evaluating a linear combination of the kernel functions of the training points [9].\nFor Fast-MCU, the nearest neighbors of the test point among the training samples in the original\nspace are used to obtain a mapping of the point as a weighted combination of these neighbors. The\nsame mapping is then applied to the projection of the neighbors in the learned low dimensional joint\nmanifold to compute the projection of the test point. To perform retrieval, all the test points of both\nmodalities, image and text, are projected to the joint space learned using the training set. For a\ngiven test point of one modality, its distance to all the projected test points of the other modality\nare computed, and these are then ranked. In this work, we used the normalized correlation distance,\nwhich was shown to be the best performing distance metric in [12]. A retrieved sample is considered\nto be correct if it belongs to the same category as the query.\nThe results of the retrieval task are shown in Table 1. The performance of a random retrieval scheme\nis also shown to indicate the baseline chance level.It is clear that Fast-MCU outperforms both CCA\nand kCCA in both image-to-text and text-to-image retrieval tasks. In addition, Fast-MCU produced\nsigni\ufb01cantly lower number of dimensions for the embeddings - CCA produced 19 sign\ufb01cant dimen-\nsions compared to just 3 for Fast-MCU.\n\nTable 1: MAP Scores for image-text retrieval tasks\n\nQuery\nText - Image\nImage - Text\n\nRandom CCA KCCA Fast-MCU\n0.118\n0.118\n\n0.264\n0.198\n\n0.193\n0.154\n\n0.170\n0.172\n\n6 Conclusions\nIn this paper, we describe an adaptation of MVU to analyze correlation of high-dimensional aligned\ndata such as EEG-fMRI data and image-text corpora. Our results on EEG-fMRI data show that\n\n8\n\n\u221210010\u2212505\u221210\u221250510INPUT1 N=1000)\u221210010\u2212505\u221210\u221250510INPUT2 N=1000\u221210010\u221210010OUTPUT1 CCA\u221210010\u221210010OUTPUT2 CCA\u221250050\u221240\u22122002040OUTPUT1 Fast MCU (before fine\u2212tuning)\u221250050\u221240\u22122002040OUTPUT2 Fast MCU (before fine\u2212tuning)\u221250050\u221240\u22122002040OUTPUT1 Fast MCU (after gradient \u2212based improvement)\u221250050\u221240\u22122002040OUTPUT2 Fast MCU (after gradient\u2212based improvement)\fthe proposed approach is able to make anatomically signi\ufb01cant predictions about which voxels of\nthe fMRI are most correlated with changes in EEG signals. Likewise, the results on the Wikipedia\nset demonstrate the ability of MCU to discover the correlations between images and text. In both\nthese applications, it is important to realize that MCU is not only revealing the correlated degrees of\nvariability from different input modalities, but also pruning away the uncorrelated ones. This ability\nof MCU makes it much more broadly applicable because in general we expect inputs from truly\ndifferent modalities to have many independent degrees of freedom: e.g., there are many ways in text\nto describe a single, particular image, just as there are many ways in pictures to illustrate a single,\nparticular word.\n\n7 Acknowledgements\n\nThis work was supported by NSF award CCF-0830535, NIH Grant R01NS051661 and ONR MURI\nAward No. N00014-10-1-0072.\nReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15(6):1373\u20131396, 2003.\n\n[2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR , 3:993\u20131022, 2003.\n[3] R. Buxton, K. Uluda, D. Dubowitz, and T. T Liu. Modeling the hemodynamic response to brain activation.\n\nNeuroimage, 23(1):220-233, 2004.\n\n[4] M. Bowling, A. Ghodsi, and D. Wilkinson. Action respecting embedding. In ICML, pages 65\u201372, 2005.\n[5] M. Bowling, D. Wilkinson, A. Ghodsi, and A. Milstein. Subjective localization with action respecting\n\nembedding. In ISRR, 2005.\n\n[6] F. Chung. Spectral graph theory. Amer Mathematical Society, 1997.\n[7] N. Correa, T. Eichele, T. AdalI, Y. Li, and V. Calhoun. Multi-set canonical correlation analysis for the\n\nfusion of concurrent single trial ERP and functional MRI. NeuroImage, 2010.\n\n[8] M. Greicius, B. Krasnow, A. Reiss, and V. Menon. Functional connectivity in the resting brain: a network\n\nanalysis of the default mode hypothesis. PNAS, 100(1):253, 2003.\n\n[9] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with appli-\n\ncation to learning methods. Neural Computation, 16(12):2639\u20132664, 2004.\n\n[10] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91\u2013110, 2004.\n[11] E. Martinez-Montes, P. Vald\u00b4es-Sosa, F. Miwakeichi, R. Goldman, and M. Cohen. Concurrent EEG/fMRI\n\nanalysis by multiway partial least squares. NeuroImage, 22(3):1023\u20131034, 2004.\n\n[12] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new\n\napproach to cross-modal multimedia retrieval. In ACM Multimedia, pages 251\u2013260, 2010.\n\n[13] P. Ritter and A. Villringer. Simultaneous EEG-fMRI. Neuroscience & Biobehavioral Reviews, 30(6):823\u2013\n\n838, 2006.\n\n290:2323\u20132326, 2000.\n\nRico, 2007.\n\n2008.\n\n[14] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n[15] B. Shaw and T. Jebara. Minimum volume embedding. In AISTATS, pages 460\u2013467, San Juan, Puerto\n\n[16] B. Shaw and T. Jebara. Structure preserving embedding. In ICML, 2009.\n[17] X. Shen and F. Meyer. Low-dimensional embedding of fMRI datasets. Neuroimage, 41(3):886\u2013902,\n\n[18] L. Song, A. Smola, K. Borgwardt, and A. Gretton. Colored maximum variance unfolding. NIPS 2008.\n[19] J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization\n\nmethods and software, 11(1):625\u2013653, 1999.\n\n[20] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290:2319\u20132323, 2000.\n\n[21] K. Weinberger and L. Saul. Unsupervised learning of image manifolds by semide\ufb01nite programming.\n\nIJCV, 70(1):77\u201390, 2006.\n\nJMLR, 10:207\u2013244, 2009.\n\nprogramming. NIPS, 19:1489, 2007.\n\nICML, 2004.\n\n[22] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\n[23] K. Weinberger, F. Sha, Q. Zhu, and L. Saul. Graph laplacian regularization for large-scale semide\ufb01nite\n\n[24] K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction.\n\n[25] X. Wu, A. So, Z. Li, and S. Li. Fast graph laplacian regularized kernel learning via semide\ufb01nite\u2013\n\nquadratic\u2013linear programming. NIPS, 22:1964\u20131972.\n\n9\n\n\f", "award": [], "sourceid": 588, "authors": [{"given_name": "Vijay", "family_name": "Mahadevan", "institution": null}, {"given_name": "Chi", "family_name": "Wong", "institution": null}, {"given_name": "Jose", "family_name": "Pereira", "institution": null}, {"given_name": "Tom", "family_name": "Liu", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}