{"title": "Multimodal Learning with Deep Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 2222, "page_last": 2230, "abstract": "We propose a Deep Boltzmann Machine for learning a generative model of multimodal data. We show how to use the model to extract a meaningful representation of multimodal data. We find that the learned representation is useful for classification and information retreival tasks, and hence conforms to some notion of semantic similarity. The model defines a probability density over the space of multimodal inputs. By sampling from the conditional distributions over each data modality, it possible to create the representation even when some data modalities are missing. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that our model can significantly outperform SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves significant gains.", "full_text": "Multimodal Learning with Deep Boltzmann Machines\n\nNitish Srivastava\n\nRuslan Salakhutdinov\n\nDepartment of Computer Science\n\nDepartment of Statistics and Computer Science\n\nUniversity of Toronto\n\nnitish@cs.toronto.edu\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nAbstract\n\nA Deep Boltzmann Machine is described for learning a generative model of data\nthat consists of multiple and diverse input modalities. The model can be used\nto extract a uni\ufb01ed representation that fuses modalities together. We \ufb01nd that\nthis representation is useful for classi\ufb01cation and information retrieval tasks. The\nmodel works by learning a probability density over the space of multimodal inputs.\nIt uses states of latent variables as representations of the input. The model can\nextract this representation even when some modalities are absent by sampling\nfrom the conditional distribution over them and \ufb01lling them in. Our experimental\nresults on bi-modal data consisting of images and text show that the Multimodal\nDBM can learn a good generative model of the joint space of image and text\ninputs that is useful for information retrieval from both unimodal and multimodal\nqueries. We further demonstrate that this model signi\ufb01cantly outperforms SVMs\nand LDA on discriminative tasks. Finally, we compare our model to other deep\nlearning methods, including autoencoders and deep belief networks, and show that\nit achieves noticeable gains.\n\n1\n\nIntroduction\n\nInformation in the real world comes through multiple input channels. Images are associated with\ncaptions and tags, videos contain visual and audio signals, sensory perception includes simultaneous\ninputs from visual, auditory, motor and haptic pathways. Each modality is characterized by very\ndistinct statistical properties which make it dif\ufb01cult to ignore the fact that they come from different\ninput channels. Useful representations can be learned about such data by fusing the modalities into a\njoint representation that captures the real-world \u2018concept\u2019 that the data corresponds to. For example,\nwe would like a probabilistic model to correlate the occurrence of the words \u2018beautiful sunset\u2019 and\nthe visual properties of an image of a beautiful sunset and represent them jointly, so that the model\nassigns high probability to one conditioned on the other.\nBefore we describe our model in detail, it is useful to note why such a model is required. Different\nmodalities typically carry different kinds of information. For example, people often caption an image\nto say things that may not be obvious from the image itself, such as the name of the person, place,\nor a particular object in the picture. Unless we do multimodal learning, it would not be possible to\ndiscover a lot of useful information about the world (for example, \u2018what do beautiful sunsets look\nlike?\u2019). We cannot afford to have discriminative models for each such concept and must extract this\ninformation from unlabeled data.\nIn a multimodal setting, data consists of multiple input modalities, each modality having a different\nkind of representation and correlational structure. For example, text is usually represented as dis-\ncrete sparse word count vectors, whereas an image is represented using pixel intensities or outputs\nof feature extractors which are real-valued and dense. This makes it much harder to discover rela-\ntionships across modalities than relationships among features in the same modality. There is a lot\nof structure in the input but it is dif\ufb01cult to discover the highly non-linear relationships that exist\n\n1\n\n\fFigure 1: Left: Examples of text generated from a DBM by sampling from P (vtxt|vimg, \u03b8). Right: Exam-\nples of images retrieved using features generated from a DBM by sampling from P (vimg|vtxt, \u03b8).\n\nbetween low-level features across different modalities. Moreover, these observations are typically\nvery noisy and may have missing values.\nA good multimodal learning model must satisfy certain properties. The joint representation must be\nsuch that similarity in the representation space implies similarity of the corresponding \u2018concepts\u2019. It\nis also desirable that the joint representation be easy to obtain even in the absence of some modalities.\nIt should also be possible to \ufb01ll-in missing modalities given the observed ones. In addition, it is also\ndesirable that extracted representation be useful for discriminative tasks.\nOur proposed multimodal Deep Boltzmann Machine (DBM) model satis\ufb01es the above desiderata.\nDBMs are undirected graphical models with bipartite connections between adjacent layers of hid-\nden units [1]. The key idea is to learn a joint density model over the space of multimodal inputs.\nMissing modalities can then be \ufb01lled-in by sampling from the conditional distributions over them\ngiven the observed ones. For example, we use a large collection of user-tagged images to learn a\njoint distribution over images and text P (vimg, vtxt|\u03b8). By drawing samples from P (vtxt|vimg, \u03b8)\nand from P (vimg|vtxt, \u03b8) we can \ufb01ll-in missing data, thereby doing image annotation and image\nretrieval respectively, as shown in Fig. 1.\nThere have been several approaches to learning from multimodal data. In particular, Huiskes et\nal. [2] showed that using captions, or tags, in addition to standard low-level image features signi\ufb01-\ncantly improves classi\ufb01cation accuracy of Support Vector Machines (SVM) and Linear Discriminant\nAnalysis (LDA) models. A similar approach of Guillaumin et al. [3], based on multiple kernel learn-\ning framework, further demonstrated that an additional text modality can improve the accuracy of\nSVMs on various object recognition tasks. However, all of these approaches are discriminative by\nnature and cannot make use of large amounts of unlabeled data or deal easily with missing input\nmodalities.\nOn the generative side, Xing et al. [4] used dual-wing harmoniums to build a joint model of im-\nages and text, which can be viewed as a linear RBM model with Gaussian hidden units together\nwith Gaussian and Poisson visible units. However, various data modalities will typically have very\ndifferent statistical properties which makes it dif\ufb01cult to model them using shallow models. Most\nsimilar to our work is the recent approach of Ngiam et al. [5] that used a deep autoencoder for speech\nand vision fusion. There are, however, several crucial differences. First, in this work we focus on\nintegrating together very different data modalities: sparse word count vectors and real-valued dense\nimage features. Second, we develop a Deep Boltzmann Machine as a generative model as opposed\nto unrolling the network and \ufb01ne-tuning it as an autoencoder. While both approaches have lead\nto interesting results in several domains, using a generative model is important for applications we\nconsider in this paper, as it allows our model to naturally handle missing data modalities.\n\n2\n\n\f2 Background: RBMs and Their Generalizations\nRestricted Boltzmann Machines (RBMs) have been used effectively in modeling distributions over\nbinary-valued data. Recent work on Boltzmann machine models and their generalizations to expo-\nnential family distributions have allowed these models to be successfully used in many application\ndomains. In particular, the Replicated Softmax model [6] has been shown to be effective in model-\ning sparse word count vectors, whereas Gaussian RBMs have been used for modeling real-valued\ninputs for speech and vision tasks. In this section we brie\ufb02y review these models, as they will serve\nas our building blocks for the multimodal model.\n2.1 Restricted Boltzmann Machines\nA Restricted Boltzmann Machine is an undirected graphical model with stochastic visible units\nv \u2208 {0, 1}D and stochastic hidden units h \u2208 {0, 1}F , with each visible unit connected to each\nhidden unit. The model de\ufb01nes the following energy function E : {0, 1}D+F \u2192 R:\n\nE(v, h; \u03b8) = \u2212 D(cid:88)\n\nF(cid:88)\n\nviWijhj \u2212 D(cid:88)\n\nbivi \u2212 F(cid:88)\n\najhj\n\nwhere \u03b8 = {a, b, W} are the model parameters. The joint distribution over the visible and hidden\nunits is de\ufb01ned by:\n(1)\n\nexp (\u2212E(v, h; \u03b8)).\n\nP (v, h; \u03b8) =\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\n1\nZ(\u03b8)\n\n2.2 Gaussian RBM\nConsider modeling visible real-valued units v \u2208 RD, and let h \u2208 {0, 1}F be binary stochastic\nhidden units. The energy of the state {v, h} of the Gaussian RBM is de\ufb01ned as follows:\n\nD(cid:88)\n\ni=1\n\n(vi \u2212 bi)2\n\n2\u03c32\ni\n\n\u2212 D(cid:88)\n\nF(cid:88)\n\ni=1\n\nj=1\n\nvi\n\u03c3i\n\nWijhj \u2212 F(cid:88)\n\nj=1\n\nE(v, h; \u03b8) =\n\najhj,\n\n(2)\n\nwhere \u03b8 = {a, b, W, \u03c3} are the model parameters.\n\n2.3 Replicated Softmax Model\nThe Replicated Softmax Model is useful for modeling sparse count data, such as word count vectors\nin a document. Let v \u2208 NK be a vector of visible units where vk is the number of times word k\noccurs in the document with the vocabulary of size K. Let h \u2208 {0, 1}F be binary stochastic hidden\ntopic features. The energy of the state {v, h} is de\ufb01ned as follows\n\nE(v, h; \u03b8) = \u2212 K(cid:88)\n\nvkWkjhj \u2212 K(cid:88)\nwhere \u03b8 = {a, b, W} are the model parameters and M =(cid:80)\n\nF(cid:88)\n\nk=1\n\nk=1\n\nj=1\n\nk vk is the total number of words in\na document. We note that this replicated softmax model can also be interpreted as an RBM model\nthat uses a single visible multinomial unit with support {1, ..., K} which is sampled M times.\nFor all of the above models, exact maximum likelihood learning is intractable. In practice, ef\ufb01cient\nlearning is performed using Contrastive Divergence (CD) [7].\n\nbkvk \u2212 M\n\najhj\n\n(3)\n\nF(cid:88)\n\nj=1\n\n3 Multimodal Deep Boltzmann Machine\n\nA Deep Boltzmann Machine (DBM) is a network of symmetrically coupled stochastic binary units.\nIt contains a set of visible units v \u2208 {0, 1}D, and a sequence of layers of hidden units h(1) \u2208\n{0, 1}F1, h(2) \u2208 {0, 1}F2,..., h(L) \u2208 {0, 1}FL. There are connections only between hidden units\nin adjacent layers. Let us \ufb01rst consider a DBM with two hidden layers. The energy of the joint\ncon\ufb01guration {v, h} is de\ufb01ned as (ignoring bias terms):\n\nE(v, h; \u03b8) = \u2212v(cid:62)W(1)h(1) \u2212 h(1)(cid:62)W(2)h(2),\n\nwhere h = {h(1), h(2)} represent the set of hidden units, and \u03b8 = {W(1), W(2)} are the model pa-\nrameters, representing visible-to-hidden and hidden-to-hidden symmetric interaction terms. Similar\nto RBMs, this binary-binary DBM can be easily extended to modeling dense real-valued or sparse\ncount data, which we discuss next.\n\n3\n\n\fImage-speci\ufb01c DBM Text-speci\ufb01c DBM\n\nMultimodal DBM\n\nFigure 2: Left: Image-speci\ufb01c two-layer DBM that uses a Gaussian model to model the distribution over real-\nvalued image features. Middle: Text-speci\ufb01c two-layer DBM that uses a Replicated Softmax model to model\nits distribution over the word count vectors. Right: A Multimodal DBM that models the joint distribution over\nimage and text inputs.\nWe illustrate the construction of a multimodal DBM using an image-text bi-modal DBM as our\nrunning example. Let vm \u2208 RD denote an image input and vt \u2208 NK denote a text input. Consider\nmodeling each data modality using separate two-layer DBMs (Fig. 2). The image-speci\ufb01c two-layer\nDBM assigns probability to vm that is given by (ignoring bias terms on the hidden units for clarity):\n(4)\n\nP (vm, h(1), h(2); \u03b8) =\n\n(cid:88)\n\nP (vm; \u03b8) =\n\nh(1),h(2)\n\n(cid:88)\n\n\uf8eb\uf8ed\u2212 D(cid:88)\n\nexp\n\nh(1),h(2)\n\ni=1\n\n(vmi \u2212 bi)2\n\n2\u03c32\ni\n\n+\n\nD(cid:88)\n\nF1(cid:88)\n\ni=1\n\nj=1\n\n=\n\n1\nZ(\u03b8)\n\nF1(cid:88)\n\nF2(cid:88)\n\nj=1\n\nl=1\n\nvmi\n\u03c3i\n\nW (1)\n\nij h(1)\n\nj +\n\nh(1)\nj W (2)\n\njl h(2)\n\nl\n\n\uf8f6\uf8f8 .\n\nNote that we borrow the visible-hidden interaction term from the Gaussian RBM (Eq. 2) and the\nhidden-hidden one from the Binary RBM (Eq. 1). Similarly, the text-speci\ufb01c DBM will use terms\nfrom the Replicated Softmax model for the visible-hidden interactions (Eq. 3) and the hidden-hidden\nones from the Binary RBM (Eq. 1).\nTo form a multimodal DBM, we combine the two models by adding an additional layer of binary\nhidden units on top of them. The resulting graphical model is shown in Fig. 2, right panel. The joint\ndistribution over the multi-modal input can be written as:\n\n(cid:19)(cid:18)(cid:88)\n\n(cid:19)\n\nP (vm, vt; \u03b8)=\n\nP (h(2)\n\nm , h(2)\n\nt\n\n, h(3))\n\nP (vm, h(1)\n\nm , h(2)\nm )\n\nP (vt, h(1)\n\nt\n\n, h(2)\nt )\n\n.\n\nh(2)\n\nm ,h(2)\n\nt\n\n,h(3)\n\nh(1)\nm\n\nh(1)\n\nt\n\n(cid:88)\n\n(cid:18)(cid:88)\n\nt\n\n(cid:18) F1(cid:89)\n\nt\n\n, h(3)}:\nmj|v)\n\nF2(cid:89)\n\nq(h(1)\n\n3.1 Approximate Learning and Inference\nExact maximum likelihood learning in this model is intractable, but ef\ufb01cient approximate learning\ncan be carried out by using mean-\ufb01eld inference to estimate data-dependent expectations, and an\nMCMC based stochastic approximation procedure to approximate the model\u2019s expected suf\ufb01cient\nstatistics [1]. In particular, during the inference step, we approximate the true posterior P (h|v; \u03b8),\nwhere v = {vm, vt}, with a fully factorized approximating distribution over the \ufb01ve sets of hidden\nunits {h(1)\n\nm , h(1)\n\nm , h(2)\n\n, h(2)\n\n(cid:19)(cid:18) F1(cid:89)\n\nF2(cid:89)\n\n(cid:19) F3(cid:89)\n\nQ(h|v; \u00b5) =\n\nq(h(2)\n\nml|v)\n\nq(h(1)\n\ntj |v)\n\nq(h(2)\n\ntl |v)\n\nq(h(3)\n\nk |v),\n\n(5)\n\nj=1\n\nl=1\n\nj=1\n\nl=1\n\nk=1\n\nt\n\nt\n\n, \u00b5(2)\n\nm , \u00b5(1)\n\nm , \u00b5(2)\n\n, \u00b5(3)} are the mean-\ufb01eld parameters with q(h(l)\n\nwhere \u00b5 = {\u00b5(1)\nl = 1, 2, 3.\nLearning proceeds by \ufb01nding the value of \u00b5 that maximizes the variational lower bound for the\ncurrent value of model parameters \u03b8, which results in a set of the mean-\ufb01eld \ufb01xed-point equations.\nGiven the variational parameters \u00b5, the model parameters \u03b8 are then updated to maximize the vari-\national bound using an MCMC-based stochastic approximation [1, 8, 9].\nTo initialize the model parameters to good values, we use a greedy layer-wise pretraining strategy\nby learning a stack of modi\ufb01ed RBMs (for details see [1]).\n\ni = 1) = \u00b5(l)\n\nfor\n\ni\n\n4\n\n\f(a) RBM\n\n(b) Multimodal DBN\n\n(c) Multimodal DBM\n\nFigure 3: Different ways of combining multimodal inputs\n\n3.2 Salient Features\nA Multimodal DBM can be viewed as a composition of unimodal undirected pathways. Each path-\nway can be pretrained separately in a completely unsupervised fashion, which allows us to leverage\na large supply of unlabeled data. Any number of pathways each with any number of layers could\npotentially be used. The type of the lower-level RBMs in each pathway could be different, account-\ning for different input distributions, as long as the \ufb01nal hidden representations at the end of each\npathway are of the same type.\nThe intuition behind our model is as follows. Each data modality has very different statistical prop-\nerties which make it dif\ufb01cult for a single hidden layer model (such as Fig. 3a) to directly \ufb01nd corre-\nlations across modalities. In our model, this difference is bridged by putting layers of hidden units\nbetween the modalities. The idea is illustrated in Fig. 3c, which is just a different way of display-\ning Fig. 2. Compared to the simple RBM (Fig. 3a), where the hidden layer h directly models the\ndistribution over vt and vm, the \ufb01rst layer of hidden units h(1)\nm in a DBM has an easier task to\nperform - that of modeling the distribution over vm and h(2)\nm . Each layer of hidden units in the\nDBM contributes a small part to the overall task of modeling the distribution over vm and vt. In the\nprocess, each layer learns successively higher-level representations and removes modality-speci\ufb01c\ncorrelations. Therefore, the middle layer in the network can be seen as a (relatively) \u201cmodality-free\u201d\nrepresentation of the input as opposed to the input layers which were \u201cmodality-full\u201d.\nAnother way of using a deep model to combine multimodal inputs is to use a Multimodal Deep\nBelief Network (DBN) (Fig. 3b) which consists of an RBM followed by directed belief networks\nleading out to each modality. We emphasize that there is an important distinction between this model\nand the DBM model of Fig. 3c. In a DBN model the responsibility of the multimodal modeling falls\nentirely on the joint layer. In the DBM, on the other hand, this responsibility is spread out over the\nentire network. The modality fusion process is distributed across all hidden units in all layers. From\nthe generative perspective, states of low-level hidden units in one pathway can in\ufb02uence the states\nof hidden units in other pathways through the higher-level layers, which is not the case for DBNs.\n\n3.3 Modeling Tasks\nGenerating Missing Modalities: As argued in the introduction, many real-world applications will\noften have one or more modalities missing. The Multimodal DBM can be used to generate such\nmissing data modalities by clamping the observed modalities at the inputs and sampling the hidden\nmodalities from the conditional distribution by running the standard alternating Gibbs sampler [1].\nFor example, consider generating text conditioned on a given image1 vm. The observed modality\nvm is clamped at the inputs and all hidden units are initialized randomly. P (vt|vm) is a multi-\nnomial distribution over the vocabulary. Alternating Gibbs sampling can be used to sample words\nfrom P (vt|vm). Fig. 1 shows examples of words that have high probability under the conditional\ndistributions.\nInferring Joint Representations: The model can also be used to generate a fused representation\nthat multiple data modalities. This fused representation is inferred by clamping the observed modal-\nities and doing alternating Gibbs sampling to sample from P (h(3)|vm, vt) (if both modalities are\npresent) or from P (h(3)|vm) (if text is missing). A faster alternative, which we adopt in our experi-\nmental results, is to use variational inference (see Sec. 3.1) to approximate posterior Q(h(3)|vm, vt)\nor Q(h(3)|vm). The activation probabilities of hidden units h(3) constitute the joint representation\nof the inputs.\n\n1Generating image features conditioned on text can be done in a similar way.\n\n5\n\n\fThis representation can then be used to do information retrieval for multimodal or unimodal queries.\nEach data point in the database (whether missing some modalities or not) can be mapped to this\nlatent space. Queries can also be mapped to this space and an appropriate distance metric can be\nused to retrieve results that are close to the query.\nDiscriminative Tasks: Classi\ufb01ers such as SVMs can be trained with these fused representations as\ninputs. Alternatively, the model can be used to initialize a feed forward network which can then be\n\ufb01netuned [1]. In our experiments, logistic regression was used to classify the fused representations.\nUnlike \ufb01netuning, this ensures that all learned representations that we compare (DBNs, DBMs and\nDeep Autoencoders) use the same discriminative model.\n4 Experiments\n4.1 Dataset and Feature Extraction\nThe MIR Flickr Data set [10] was used in our experiments. The data set consists of 1 million images\nretrieved from the social photography website Flickr along with their user assigned tags. Among the\n1 million images, 25,000 have been annotated for 24 topics including object categories such as, bird,\ntree, people and scene categories, such as indoor, sky and night. A stricter labeling was done for\n14 of these classes where an image was annotated with a category only if that category was salient.\nThis leads to a total of 38 classes where each image may belong to several classes. The unlabeled\n975,000 images were used only for pretraining. We use 15,000 images for training and 10,000 for\ntesting, following Huiskes et al. [2]. Mean Average Precision (MAP) is used as the performance\nmetric. Results are averaged over 5 random splits of training and test sets.\nEach text input was represented using a vocabulary of the 2000 most frequent tags. The average\nnumber of tags associated with an image is 5.15 with a standard deviation of 5.13. There are 128,501\nimages which do not have any tags, out of which 4,551 are in the labeled set. Hence about 18% of the\nlabeled data has images but is missing text. Images were represented by 3857-dimensional features,\nthat were extracted by concatenating Pyramid Histogram of Words (PHOW) features [11], Gist [12]\nand MPEG-7 descriptors [13] (EHD, HTD, CSD, CLD, SCD). Each dimension was mean-centered\nand normalized to unit variance. PHOW features are bags of image words obtained by extracting\ndense SIFT features over multiple scales and clustering them. We used publicly available code\n( [14, 15]) for extracting these features.\n\n4.2 Model Architecture and Learning\nThe image pathway consists of a Gaussian RBM with 3857 visible units followed by 2 layers of\n1024 hidden units. The text pathway consists of a Replicated Softmax Model with 2000 visible\nunits followed by 2 layers of 1024 hidden units. The joint layer contains 2048 hidden units. Each\nlayer of weights was pretrained using PCD for initializing the DBM model. When learning the\nDBM model, all word count vectors were scaled so that they sum to 5. This avoids running separate\nMarkov chains for each word count to get the model distribution\u2019s suf\ufb01cient statistics.\nEach pathway was pretrained using a stack of modi\ufb01ed RBMs. Each Gaussian unit has unit vari-\nance that was kept \ufb01xed. For discriminative tasks, we perform 1-vs-all classi\ufb01cation using logistic\nregression on the joint hidden layer representation. We further split the 15K training set into 10K\nfor training and 5K for validation.\n\n4.3 Classi\ufb01cation Tasks\nMultimodal Inputs: Our \ufb01rst set of experiments, evaluate the DBM as a discriminative model\nfor multimodal data. For each model that we trained, the fused representation of the data was\nextracted and feed to a separate logistic regression for each of the 38 topics. The text input layer\nin the DBM was left unclamped when the text was missing. Fig. 4 summarizes the Mean Average\nPrecision (MAP) and precision@50 (precision at top 50 predictions) obtained by different models.\nLinear Discriminant Analysis (LDA) and Support Vector Machines (SVMs) [2] were trained using\nthe labeled data on concatenated image and text features that did not include SIFT-based features.\nHence, to make a fair comparison, our model was \ufb01rst trained using only labeled data with a similar\nset of features (i.e., excluding our SIFT-based features). We call this model DBM-Lab. Fig. 4\nshows that the DBM-Lab model already outperforms its competitor SVM and LDA models. DBM-\nLab achieves a MAP of 0.526, compared to 0.475 and 0.492, achieved by SVM and LDA models.\n\n6\n\n\fMultimodal Inputs\n\nModel\n\nRandom\nLDA [2]\nSVM [2]\nDBM-Lab\nDBM-Unlab\nDBN\nAutoencoder (based on [5])\nDBM\n\nMAP\n\n0.124\n0.492\n0.475\n0.526\n0.585\n0.599\n0.600\n0.609\n\nUnimodal Inputs\n\nPrec@50\n\n0.124\n0.754\n0.758\n0.791\n0.836\n0.867\n0.875\n0.873\n\nModel\n\nImage-SVM [2]\nImage-DBN\nImage-DBM\nDBM-ZeroText\nDBM-GenText\n\nMAP\n\n0.375\n0.463\n0.469\n0.522\n0.531\n\nPrec@50\n\n-\n\n0.801\n0.803\n0.827\n0.832\n\nFigure 4: Classi\ufb01cation Results. Left: Mean Average Precision (MAP) and precision@50 obtained by differ-\nent models. Right: MAP using representations from different layers of multimodal DBMs and DBNs.\nTo measure the effect of using unlabeled data, a DBM was trained using all the unlabeled examples\nthat had both modalities present. We call this model DBM-Unlab. The only difference between the\nDBM-Unlab and DBM-Lab models is that DBM-Unlab used unlabeled data during its pretraining\nstage. The input features for both models remained the same. Not surprisingly, the DBM-Unlab\nmodel signi\ufb01cantly improved upon DBM-Lab achieving a MAP of 0.585. Our third model, DBM,\nwas trained using additional SIFT-based features. Adding these features improves the MAP to 0.609.\nWe compared our model to two other deep learning models: Multimodal Deep Belief Network\n(DBN) and a deep Autoencoder model [5]. These models were trained with the same number of\nlayers and hidden units as the DBM. The DBN achieves a MAP of 0.599 and the autoencoder\ngets 0.600. Their performance was comparable but slightly worse than that of the DBM. In terms\nof precision@50, the autoencoder performs marginally better than the rest. We also note that the\nMultiple Kernel Learning approach proposed in Guillaumin et. al. [3] achieves a MAP of 0.623 on\nthe same dataset. However, they used a much larger set of image features (37,152 dimensions).\nUnimodal Inputs: Next, we evaluate the ability of the model to improve classi\ufb01cation of unimodal\ninputs by \ufb01lling in other modalities. For multimodal models, the text input was only used during\ntraining. At test time, all models were given only image inputs.\nFig. 4 compares the Multimodal DBM model with an SVM over image features alone (Image-\nSVM) [2], a DBN over image features (Image-DBN) and a DBM over image features (Image-\nDBM). All deep models had the same depth and same number of hidden units in each layer. Results\nare reported for two different settings for the Multimodal DBM at test time. In one case (DBM-\nZeroText), the state of the joint hidden layer was inferred keeping the missing text input clamped at\nzero. In the other case (DBM-GenText), the text input was not clamped and the model was allowed\nto update the state of the text input layer when performing mean-\ufb01eld updates. In doing so, the\nmodel effectively \ufb01lled-in the missing text modality (some examples of which are shown in Fig. 1).\nThese two settings helped to ascertain the contribution to the improvement that comes from \ufb01lling\nin the missing modality.\nThe DBM-GenText model performs better than all other models, showing that the DBM is able to\ngenerate meaningful text that serves as a plausible proxy for missing data. Interestingly, the DBM-\nZeroText model does better than any unimodal model. This suggests that learning multimodal fea-\ntures helps even when some modalities are absent at test time. Having multiple modalities probably\nregularizes the model and makes it learn much better features. Moreover, this means that we do not\nneed to learn separate models to handle each possible combination of missing data modalities. One\njoint model can be deployed at test time and used for any situation that may arise.\nEach layer of the DBM provides a different representation of the input. Fig. 4, right panel, shows\nthe MAP obtained by using each of these representations for classi\ufb01cation using logistic regression.\nThe input layers, shown at the extreme ends, are not very good at representing useful features. As\nwe go deeper into the model from either input layer towards the middle, the internal representations\nget better. The joint layer in the middle serves as the most useful feature representation. Observe\nthat the performance of any DBM layer is always better than the corresponding DBN layer, though\nthey get close at the joint layer.\n\n7\n\n\f(a) Multimodal Queries\n\n(b) Unimodal Queries\n\nFigure 5: Precision-Recall curves for Retrieval Tasks\n\nFigure 6: Retrieval Results for Multimodal Queries from the DBM model\n\n4.4 Retrieval Tasks\nMultimodal Queries: The next set of experiments was designed to evaluate the quality of the\nlearned joint representations. A database of images was created by randomly selecting 5000 image-\ntext pairs from the test set. We also randomly selected a disjoint set of 1000 images to be used as\nqueries. Each query contained both image and text modalities. Binary relevance labels were created\nby assuming that if any of the 38 class labels overlapped between a query and a data point, then that\ndata point is relevant to the query.\nFig. 5a shows the precision-recall curves for the DBM, DBN, and Autoencoder models (averaged\nover all queries). For each model, all queries and all points in the database were mapped to the joint\nhidden representation under that model. Cosine similarity function was used to match queries to data\npoints. The DBM model performs the best among the compared models achieving a MAP of 0.622.\nThe autoencoder and DBN models perform worse with a MAP of 0.612 and 0.609 respectively.\nFig. 6 shows some examples of multimodal queries and the top 4 retrieved results. Note that even\nthough there is little overlap in terms of text, the model is able to perform well.\nUnimodal Queries: The DBM model can also be used to query for unimodal inputs by \ufb01lling in\nthe missing modality. Fig. 5b shows the precision-recall curves for the DBM model along with\nother unimodal models, where each model received the same image queries as input. By effectively\ninferring the missing text, the DBM model was able to achieve far better results than any unimodal\nmethod (MAP of 0.614 as compared to 0.587 for an Image-DBM and 0.578 for an Image-DBN).\n\n5 Conclusion\nWe proposed a Deep Boltzmann Machine model for learning multimodal data representations. Large\namounts of unlabeled data can be effectively utilized by the model. Pathways for each modality can\nbe pretrained independently and \u201cplugged in\u201d together for doing joint training. The model fuses\nmultiple data modalities into a uni\ufb01ed representation. This representation captures features that are\nuseful for classi\ufb01cation and retrieval. It also works nicely when some modalities are absent and\nimproves upon models trained on only the observed modalities.\nAcknowledgments: This research was supported by OGS, NSERC and by Early Researcher Award.\n\n8\n\n\fReferences\n[1] R. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 12, 2009.\n\nIn Proceedings of the\n\n[2] Mark J. Huiskes, Bart Thomee, and Michael S. Lew. New trends and ideas in visual concept\ndetection: the MIR \ufb02ickr retrieval evaluation initiative. In Multimedia Information Retrieval,\npages 527\u2013536, 2010.\n\n[3] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image\nclassi\ufb01cation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference\non, pages 902 \u2013909, june 2010.\n\n[4] Eric P. Xing, Rong Yan, and Alexander G. Hauptmann. Mining associated text and images\n\nwith dual-wing harmoniums. In UAI, pages 633\u2013641. AUAI Press, 2005.\n\n[5] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng.\nMultimodal deep learning. In International Conference on Machine Learning (ICML), Belle-\nvue, USA, June 2011.\n\n[6] Ruslan Salakhutdinov and Geoffrey E. Hinton. Replicated softmax: an undirected topic model.\n\nIn NIPS, pages 1607\u20131614. Curran Associates, Inc., 2009.\n\n[7] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):1711\u20131800, 2002.\n\n[8] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In ICML. ACM, 2008.\n\n[9] L. Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing\n\nergodicity rates, March 17 2000.\n\n[10] Mark J. Huiskes and Michael S. Lew. The MIR Flickr retrieval evaluation.\n\nIn MIR \u201908:\nProceedings of the 2008 ACM International Conference on Multimedia Information Retrieval,\nNew York, NY, USA, 2008. ACM.\n\n[11] A Bosch, Andrew Zisserman, and X Munoz. Image classi\ufb01cation using random forests and\n\nferns. IEEE 11th International Conference on Computer Vision (2007), 23:1\u20138, 2007.\n\n[12] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation\n\nof the spatial envelope. International Journal of Computer Vision, 42:145\u2013175, 2001.\n\n[13] B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada. Color and texture descriptors.\n\nCircuits and Systems for Video Technology, IEEE Transactions on, 11(6):703 \u2013715, 2001.\n\n[14] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algo-\n\nrithms, 2008.\n\n[15] Muhammet Bastan, Hayati Cam, Ugur Gudukbay, and Ozgur Ulusoy. Bilvideo-7: An mpeg-7-\n\ncompatible video indexing and retrieval system. IEEE Multimedia, 17:62\u201373, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1105, "authors": [{"given_name": "Nitish", "family_name": "Srivastava", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}