{"title": "Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling", "book": "Advances in Neural Information Processing Systems", "page_first": 12136, "page_last": 12145, "abstract": "Tensor-based multimodal fusion techniques have exhibited great predictive performance. However, one limitation is that existing approaches only consider bilinear or trilinear pooling, which fails to unleash the complete expressive power of multilinear fusion with restricted orders of interactions. More importantly, simply fusing features all at once ignores the complex local intercorrelations, leading to the deterioration of prediction. In this work, we first propose a polynomial tensor pooling (PTP) block for integrating multimodal features by considering high-order moments, followed by a tensorized fully connected layer. Treating PTP as a building block, we further establish a hierarchical polynomial fusion network (HPFN) to recursively transmit local correlations into global ones. By stacking multiple PTPs, the expressivity capacity of HPFN enjoys an exponential growth w.r.t. the number of layers, which is shown by the equivalence to a very deep convolutional arithmetic circuits. Various experiments demonstrate that it can achieve the state-of-the-art performance.", "full_text": "Deep Multimodal Multilinear Fusion with\n\nHigh-order Polynomial Pooling\n\nMing Hou1,\u2217, Jiajia Tang2,1,\u2217, Jianhai Zhang2, Wanzeng Kong2, Qibin Zhao1,\u2020\n1 Tensor Learning Unit, Center for Advanced Intelligence Project, RIKEN, Japan\n\njhzhang@hdu.edu.cn, kongwanzeng@hdu.edu.cn, qibin.zhao@riken.jp\n\n2 College of Computer Science, Hangzhou Dianzi University, China\n\nming.hou@riken.jp, hdutangjiajia@163.com\n\nAbstract\n\nTensor-based multimodal fusion techniques have exhibited great predictive perfor-\nmance. However, one limitation is that existing approaches only consider bilinear\nor trilinear pooling, which fails to unleash the complete expressive power of mul-\ntilinear fusion with restricted orders of interactions. More importantly, simply\nfusing features all at once ignores the complex local intercorrelations, leading\nto the deterioration of prediction. In this work, we \ufb01rst propose a polynomial\ntensor pooling (PTP) block for integrating multimodal features by considering\nhigh-order moments, followed by a tensorized fully connected layer. Treating\nPTP as a building block, we further establish a hierarchical polynomial fusion\nnetwork (HPFN) to recursively transmit local correlations into global ones. By\nstacking multiple PTPs, the expressivity capacity of HPFN enjoys an exponential\ngrowth w.r.t. the number of layers, which is shown by the equivalence to a very\ndeep convolutional arithmetic circuits. Various experiments demonstrate that it can\nachieve the state-of-the-art performance.\n\n1\n\nIntroduction\n\nMultimodal representation learning has been a very actively growing research \ufb01eld in arti\ufb01cial\nintelligence and human communication analysis. Its applications have proliferated across human\nmultimodal tasks such as emotion recognition [2], personality traits recognition [22] and sentiment\nanalysis [18]. The multimodal signals collected from diverse modalities (spoken language, visual and\nacoustic signals) exhibit properties of consistency and complementarity [28]. Extensive studies are\ndedicated to modelling the multiple modalities and their complex interactions [28, 14, 15, 13]. These\ninteractions are hard to model due to the factors like non-trivial multimodal alignment and unreliable\nor contradictory information among modalities. It yet remains a major challenge on improving the\ngeneralization ability of the model by exploring heterogeneous properties of the multimodal data.\nThe very key step of multimodal modelling is referred as multimodal fusion, with the aim at integrating\nfeatures of multiple modalities for yielding more robust predictions. Typically, the multimodal feature\nfusion can be categorized as early, late and hybrid fusion [1]. Among those, early fusion utilizes\nthe concatenated signals from different sources as the model input [7]. Late fusion, on the other\nhand, attempts to model each modality separately and thus merge them at the decision level, either by\nvoting or averaging [19, 25]. In hybrid fusion, the output depends on both the predictions of unimodal\nand the early fusion. Despite being simple, aforementioned conventional fusion techniques are all\nrestricted to the concatenation or averaging of, or more generally, linear combination of multimodal\nfeatures. And the linear modelling may not be suf\ufb01cient to capture the complicated intercorrelations.\n\n\u2217The authors contribute equally\n\u2020The corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The scheme of 5-order polynomial tensor pooling (PTP) block for fusing z1 and z2.\n\nBy leveraging tensor product representations, recent fusion models [16, 27] are geared towards\nmodelling bilinear/trilinear cross-modal interactions and boost the performance signi\ufb01cantly. Never-\ntheless, such representations suffer from an exponential growth in feature dimensions, with regard to\nboth the unimodal\u2019s dimensionality and the number of modalities, producing a tremendous amount of\nparameters. To tackle this, the work [17] ef\ufb01ciently reduces fusion parameters by learning low-rank\ntensor factors, while preserving the capacity of expressing the trimodal (trilinear) interactions.\nHowever, their model fails to unleash the full representation power of multilinear feature intercor-\nrelations by restricting the order of interactions. In other words, the interaction is linear w.r.t. each\nmodality, e.g., only up to trilinear interactions for three modalities. More importantly, their framework\nfocuses on simply fusing multimodal features all at once, totally ignoring the local dynamics of\ninteractions that are crucial to the \ufb01nal prediction. The evolving temporal-modality correlations thus\ncannot be grasped, which may lead to a deteriorated prediction, especially when long time series are\ninvolved.\nIn this work, we start by proposing a polynomial tensor pooling (PTP) block that can fuse locally\nmixed temporal-modality features. PTP allows for the higher order moments to capture complex\nnonlinear multimodal correlations. Building upon the basic PTP block, we further establish a hierar-\nchical architecture that recursively integrates and transmits the local temporal-modality correlations\ninto global ones. This way, fusing multimodal time series data becomes feasible. We refer to the\nproposed framework as hierarchical polynomial fusion network (HPFN). Using our HPFN brings\ndual bene\ufb01ts: 1) the local interactions can be grasped at a much \ufb01ner granularity, and the dominant\nlocal correlations can be ef\ufb01ciently transmitted to the global scale. 2) an exponential growth of\nthe expressivity capacity can be achieved by stacking PTPs into multiple layers, which is shown\nby a connection of HPFN to a very deep convolutional arithmetic circuits. We verify the superior\nperformance of HPFN on two multimodal tasks.\n\n2 Preliminaries\n\nWe refer multiway arrays of real numbers as tensors [12]. We denote a P -order tensor W \u2208\nRI1\u00d7\u00b7\u00b7\u00b7\u00d7IP with P modes. The (i1, ..., iP )-th entry of W is denoted as Wi1,...,iP with ip \u2208 [Ip]\nfor all p \u2208 [P ], in which the expression [P ] represents the set {1, 2, ..., P}. The tensor product\ndenoted as \u2297 is a fundamental operator in tensor analysis. Given two tensors A \u2208 RI1,...,IP and\nB \u2208 RIP +1,...,IP +Q, the tensor product produces a (P + Q)-order tensor A \u2297 B \u2208 RI1,...,IP +Q as\n(1)\n\nA \u2297 B = AI1,...,IP \u00b7 BIP +1,...,IP +Q.\n\nW =(cid:80)R\n\nThe tensor product reduces to the standard outer product for the vector inputs. A tensor product\nof P vectors v(p) \u2208 RIp for p \u2208 [P ] yields a rank-1 tensor A = w(1) \u2297 \u00b7\u00b7\u00b7 \u2297 w(P ). The CAN-\nDECOMP/PARAFAC (CP) decomposition [3] of W can be written as a sum of rank-1 tensors as\n, where R is de\ufb01ned as tensor rank. Tensor networks (TNs) [4] gener-\nalize tensor decompositions by factorizing a higher order tensor into a set of sparsely interconnected\nlower order tensors. TN representation greatly diminishes the effect of curse of dimensionality related\nto high-order dense tensors. TNs include a number of special cases such as CP, Tucker [26], tensor\ntrain (TT) [21] and tensor ring (TR) [32] formats.\n\nr\n\nr=1 w(1)\n\nr \u2297 \u00b7\u00b7\u00b7 \u2297 w(P )\n\n3 Methodology\n\nWe start this section by presenting a product pooling strategy named polynomial tensor pooling\n(PTP) that serves as a basic building block for our hierarchical polynomial fusion framework (HPFN).\n\n2\n\nzz1z2\u2326\u00b7Low-rank tensor networkConcatenateP-order tensor productW\u21e5\u21e5\u21e5\u21e5W1W2W3W4\u21e1\u21e5W5WoutTensor contractionfT=[1,zT1,zT2]fffffzz1z2\u2326\u00b7Low-rank tensor networkConcatenateP-order tensor productW\u21e1Tensor contractionfT=[1,zT1,zT2]fffff\fFigure 2: (a) An illustrative example of a fusion network with a single PTP block, whose receptive \u2018window\u2019\nsize is [8 \u00d7 3]. (b) An example of two-layer HPFN. For the input layer, the overlapped \u2018window\u2019 has size [4 \u00d7 3]\nwith stride step size 2 along time dimension. For the hidden layer, the \u2018window\u2019 with size [3 \u00d7 1] covers all the\nintermediate features from the previous layer. H1-1 stands for the \u20181st\u2019 column index of feature nodes in the \u20181st\u2019\nhidden layer.\nThe motivations for PTP are twofold: 1) it explicitly model high-order nonlinear intra-modal and\ncross-modal interactions; 2) for multimodal time series, it can directly model local interactions within\na scanning receptive \u2018window\u2019 across both temporal and modality dimensions.\n\n3.1 High-order polynomial tensor pooling (PTP)\nThe objective of a PTP block is to ef\ufb01ciently merge a collection of features {zm}M\nm=1 into a joint\ncompact representation z by exploiting the explicit interactions of high-order moments. Figure 1\ndepicts the \ufb02owchart of operations in a PTP block. More speci\ufb01cally, a set of M feature vectors\n{zm}M\n\nm=1 are \ufb01rst concatenated together into a long feature vector f:\n\n(2)\nThen, a degree of P polynomial feature tensor F is formulated using a P -order tensor product of the\nconcatenated feature vector f as\n(3)\n\n2 ,\u00b7\u00b7\u00b7 , zT\nM ].\n(cid:125)\n(cid:123)(cid:122)\nF = f \u2297 f \u2297 \u00b7\u00b7\u00b7 \u2297 f\n\nfT = [1, zT\n\n1 , zT\n\n(cid:124)\n\n,\n\nP -order\n\nwhere \u2297 is the tensor product operator. Notice F is capable of representing all possible polynomial\nexpansions up to order P due to the incorporation of the constant term \u20181\u2019 in (2). The effect\nof P polynomial interaction between features is transformed by a pooling weight tensor W =\n[W 1, ...,W h, ...,W H ] as:\n\n(cid:88)\n\n(4)\n\nzh =\n\ni1,i2,\u00b7\u00b7\u00b7 ,iP\n\nW h\n\ni1i2\u00b7\u00b7\u00b7iP \u00b7 Fi1i2\u00b7\u00b7\u00b7iP ,\n\nwhere zh indicates the h-th element of the H-dimensional fused vector z, while ip indices the\nhigh-order terms in p-th mode. Unfortunately, the number of parameters of W h in (4) grows\nexponentially with the polynomial order P . To tackle this issue, we adopt the low-rank TNs to\nef\ufb01ciently approximate the W h. Suppose W h admits a rank-R CP format, then (4) becomes\n\n(cid:33)(cid:32) P(cid:89)\n\n(cid:33)(cid:35)\n\nR(cid:88)\n\nP(cid:89)\n\nI(cid:88)\n\n(cid:88)\n\n(cid:34)(cid:32) R(cid:88)\n\nP(cid:89)\n\nzh =\n\ni1,i2,\u00b7\u00b7\u00b7 ,iP\n\nah\nr\n\nwh(p)\nr;ip\n\nfip\n\n=\n\nah\nr\n\nwh(p)\nr;ip\n\nfip .\n\n(5)\n\nr=1\n\np=1\n\np=1\n\nr=1\n\np=1\n\nip\n\nr\n\nSince the explicitly constructed feature tensor is super symmetric, it then makes sense to assume\nwh\nr = wh(p)\nh=1 are the collection of fusion parameters to\nestimate. If W h admits a TR format, then the following formula can be derived from (4) as\n(cid:33)\n\nfor all p \u2208 [P ]. Hence, the {{ah\n(cid:88)\n(cid:88)\n\n(cid:33)(cid:32) P(cid:89)\n(cid:88)\nP(cid:89)\n\n(cid:34)(cid:32) (cid:88)\nI(cid:88)\nP(cid:89)\n\n(cid:32) P(cid:89)\n\n(cid:33)(cid:35)\n\nr1,r2,\u00b7\u00b7\u00b7 ,rP\n\nr=1}H\n\nP(cid:89)\n\ni1,i2,\u00b7\u00b7\u00b7 ,iP\n\nGh(p)\n\nr , wh\n\nr}R\n\nrp;ip;rp+1\n\nzh =\n\nfip\n\n(6)\n\np=1\n\np=1\n\n=\n\nr1,r2,\u00b7\u00b7\u00b7 ,rP\n\np=1\n\nip\n\nGh(p)\n\nrp;ip;rp+1\n\nfip =\n\n\u02dcGh(p)\n\nrp;rp+1 = Trace\n\n\u02dcGh(p)\n\n,\n\np=1\n\nr1,r2,\u00b7\u00b7\u00b7 ,rP\n\np=1\n\n3\n\n(a) one-layer HPFNAudioVideoTextT1T2T3T4T5T6T7T8T8PTPInput modalityInput layerTimeOutput layerOOutput modality(b) two-layer HPFNAudioVideoTextT1T2T3T4T5T6T7T8H1-1H1-2H1-3T2T4T6T8H2-1H2-2H2-3T4T8OT8PTPPTPPTPTimeInput modalityInput layerIntermediate modalityOutput modalityHidden layer 1Hidden layer 2Output layerAudioVideoTextTimeT1T2T3T4T5T6T7T8H1-1T4T6T8Input layerHidden layer 1OT8Output layerPTPPTPIntermediate modalityOutput modalityInput modalityIntermediate modality\fFigure 3: An example of a three-layer HPFN.\n\nwhere the 3rd-order core tensors {{Gh(p)}P\np=1 are de\ufb01ned\nas TR-ranks with rP +1 = r1. It is also reasonable to assume a shared Gh = Gh(p) for all p \u2208 [P ]. In\nthis manner, the fusion computations can be ef\ufb01ciently carried out along each dimension implicitly,\nthus avoiding the curse of dimensionality on both feature and weight tensors.\n\nh=1 are the fusion parameters. {rp}P\n\np=1}H\n\n3.2 Hierarchical polynomial fusion network (HPFN)\n\nHaving introduced our basic pooling block, we move on to present the general framework for fusing\nmultimodal data. Generally, if we rearrange multimodal time series as a \u20182D feature map\u2019, the\npatterns of correlations may manifest themselves in a receptive \u2018window\u2019 covering a local mixture of\ntemporal-modality features across both dimensions. Then, interactions can be gauged by associating\na single PTP block to that local \u2018window\u2019. Using a hierarchical architecture, the local temporal-\nmodality patterns of correlations can be recursively integrated via stacking PTPs in multiple layers.\nAt the end, signi\ufb01cant correlations are identi\ufb01ed and transmitted to the global scale.\nFigure 2 (a) shows a simple one-layer fusion network, with a single PTP operating on one receptive\n\u2018window\u2019 that covers features across all 8 time steps and 3 modalities. This way, PTP makes it\nfeasible to capture the high-order nonlinear interactions among the total 24 mixed features within\nthe \u2018window\u2019. We observe a PTP naturally characterizes local correlations if it is linked to a small\nreceptive \u2018window\u2019. And several PTP blocks can be placed on the local \u2018windows\u2019 of mixed features\nat distinct locations in a \u20182D feature map\u2019. It is then straightforward to distribute the fusion process\ninto a number of layers by attaching PTP blocks to small \u2018windows\u2019 at each layer. In fact, the fused\nnode in higher layer corresponds to a larger effect receptive \u2018window\u2019 of features at the lower layer.\nAs a result, more expressive local and global correlations can be ef\ufb01ciently modelled with a great\n\ufb02exibility. The proposed framework is termed as hierarchical polynomial fusion network (HPFN).\nFigure 3 displays an instance of three-layer HPFN. At the \ufb01rst hidden layer, each PTP attempts to\nmodel local interactions in a \u2018window\u2019 of 2 time steps and 2 modalities. For instance, the audio and\nvideo features spanning time T1 and T2 are merged into the resulting hidden node H1-1 at time T2;\nlikewise, the hidden node H1-3 at time T2 is outputted by fusing audio and text features of T1 and T2.\nThe second hidden layer is fed with intermediate features of the previous layer. At the output layer,\nthe \ufb01nal feature is obtained by employing PTP on the intermediate features of 3 modalities in second\nhidden layer for the time T4 and T8.\nDue to the \ufb02exibility of our HPFN, various choices for the architecture design are possible. In\nprinciple, adding more intermediate layers leads to more complicated and higher-order interactions\nwithin a much larger effective receptive \u2018window\u2019. More complex interactions can also be modelled\nby allowing the \u2018windows\u2019 to be overlapped. Figure 2 (b) demonstrates an architecture of two-layer\nHPFN where the fusing \u2018windows\u2019 of [4 \u00d7 3] are overlapped at a stride size of 2 along the time\ndimension. More variations can be realized by making an analogy of our PTP to a convolution \ufb01lter.\nJust like CNN, a PTP operator can viewed as a \u2018fusion \ufb01lter\u2019. In this way, our HPFN may also\nborrow some similar bene\ufb01ts from the architecture of regular CNN. More precisely, at each layer the\nPTP \u2018fusion \ufb01lter\u2019 could be shared when the scanning \u2018window\u2019 slides along the time dimension, so\nas to catch the important patterns of correlations repeated in time series. Furthermore, associating\nseveral PTP \u2018fusion \ufb01lters\u2019 with one \u2018window\u2019 at the same time is able to capture multiple patterns of\ncorrelations existing in that \u2018window\u2019.\nThe empirical success of densely connected networks (DenseNets) [11] serves another inspiration to\nextend HPFN architecture. The incorporation of dense connectivity enhances the expressive capacity\n\n4\n\n(a) one-layer HPFNAudioVideoTextT1T2T3T4T5T6T7T8T8PTPInput modalityInput layerTimeOutput layerOOutput modality(b) two-layer HPFNAudioVideoTextT1T2T3T4T5T6T7T8H1-1H1-2H1-3T2T4T6T8H2-1H2-2H2-3T4T8OT8PTPPTPPTPTimeInput modalityInput layerIntermediate modalityOutput modalityHidden layer 1Hidden layer 2Output layerAudioVideoTextTimeT1T2T3T4T5T6T7T8H1-1T4T6T8Input layerHidden layer 1OT8Output layerPTPPTPIntermediate modalityOutput modalityInput modalityIntermediate modality\fFigure 4: An example of densely connected fourth-layer HPFN with growth rate k = 1.\n\nof the fusion model. Adding dense inter-connections could be bene\ufb01cial in dealing with sequential\nsignals. Speci\ufb01cally, dense connectivity is realizable via the direct inclusion of the features from\nprevious layers into the current layer. The number of previous layers k \u2208 N involved in connections\nis de\ufb01ned as the growth rate. Figure 4 depicts an instance of dense HPFN with growth rate k = 1.\n\n3.3 Connections to convolutional arithmetic circuits\n\nIt is interesting to observe that equation (5) suggests PTP actually conducts a combined operations of\nconvolution, pooling and linear transformation. This is quite analogous to convolutional arithmetic\ncircuits (ConvACs) [5] which can be seen as special variants of CNNs. Rather than the recti\ufb01er\nactivation and average/max pooling, ConACs are equipped with linear activation and product pooling\nlayers. The authors of [5] analyze the expressivity capacity of the deep ConACs by deriving their\nequivalence with the hierarchical tucker decomposition (HTD) [9]. It has been proved that deep\nConvACs enjoy a greater expressive power than the regular recti\ufb01er based CNNs [5]. In fact, a single\nPTP block corresponds to a shallow ConvAC if the CP format is utilized, and further corresponds to\na deep ConAC if the HTD is adopted for the pooling weight tensor. The major difference between\nConAC and PTP is that, the product pooling of the standard ConAC is conducted over the locations\nof features, whereas the product pooling of PTP is over the polynomial orders of concatenated\nfeatures. Stacking PTP blocks into multiple layers is essentially equivalent to employing multiple\nHTDs in a recursive manner, resulting in a correspondence of our HPFN to a even deeper ConAC. As\na consequence, more \ufb02exible higher-order local and global intercorrelations can be explicitly and\nimplicitly captured by HPFN, whose great expressive power can be implied by the connection of\nHPFN to a very deep ConAC.\n\n3.4 Model complexity\n\ntotal number of PTP \u2018windows\u2019(cid:80)L\n\nThis section compares the model complexity of HPFN with two other tensor based models: TFN\n[27] and LMF [17]. As for PTP, exploiting the symmetry property of the feature tensor, the number\nof parameters in weight tensor is independent of order P , and linearly scales with the concatenated\nmixed features in \u2018windows\u2019. For L-layer HPFN, the amount of parameters is linearly related to the\nl=1 Nl, where Nl is the number of \u2018windows\u2019 at layer l \u2208 [L]. In\npractice, Nl is usually small and decreasing along layers, e.g. N1 > N2 > \u00b7\u00b7\u00b7 > NL. Adopting the\nsharing strategy along the time dimension makes Nl even smaller. In principle, as referred in Table 1,\nthe parameter of HPFN is larger than or comparable to LMF, but signi\ufb01cantly less than that of TFN.\nTable 1: Model complexity comparisons of TFN, LMF and our HPFN. Iy is the output feature length. M is\nthe number of modalities. R is the tensor rank. For PTP and HPFN, [ T , S ] is the local \u2018window\u2019 size with\nS \u2264 M . It,m is the dimension of features from modality m at time t.\n\nModel\nParam.\n\n(cid:81)M\n\nTFN [non-temporal] LMF [non-temporal]\nO(Iy\n\nm=1 Im)\n\nO(IyR((cid:80)M\n\nm=1 Im)) O(IyR((cid:80)T\n\nm=1 It,m)) O(IyR((cid:80)L\n(cid:80)S\n\nl=1 Nl)((cid:80)T\n\nt=1\n\nPTP [temporal]\n\nHPFN (L layers) [temporal]\n\n(cid:80)S\n\nt=1\n\nm=1 It,m))\n\n4 Related work\n\nThere exist two major lines of multimodal fusion research: non-temporal models summarize the\nobservations of each unimodal by averaging the features along the temporal dimension. These models\nhave found their utility in the early work of multimodal sentiment analysis [18, 31]. Recently, tensor\nfusion network (TFN) [27] exploits tensor product to model non-temporal unimodal, bimodal and\n\n5\n\nAudioVideoTextT1T2T3T4T5T6T7T8T2T4T6T8H1-1T4T8T8Output layerAudioVideoTextT1T2T3T4T5T6T7T8Input layerH1-1Hidden layer 1H2-1H2-1H3-1OCombined modalityInput modalityCombined modalityCombined modalityOutput modalityHidden layer 2Hidden layer 3TimePTPPTPPTPPTP\fTable 2: Speci\ufb01cations of network architecture for non-temporal version of HPFN. [-] indicates the con\ufb01guration\nof a speci\ufb01c layer. PTPk\n\nm denotes the \u2018m\u2019th fused feature node in the layer \u2018k\u2019.\n\nModel\nHPFN\n\nHPFN-L2\n\nHPFN-L2-S1\nHPFN-L2-S2\n\nHPFN-L3\n\nHPFN-L4\n\n[PTPh1\n\n1 (a, v), PTPh1\n\n1 , PTPh1\n\n2 , PTPh1\n\n3 )]\n\n[PTPh1\n\n1 (a, v), PTPh1\n\n3 (a, l)] \u2013 [PTPo\n\n1 , PTPh1\n\n2 , PTPh1\n\n3 , a, v, l)]\n\nDescription of Layer-wise Con\ufb01guration\n\n1(a, v, l)]\n\n[PTPo\n2 (v, l), PTPh1\n\n1 (a, v, l)] \u2013 [PTPo\n\n3 (a, l)] \u2013 [PTPo\n1(PTPh1\n\n1(PTPh1\n1 , a, v, l)]\n1(PTPh1\n\n2 (v, l), PTPh1\n\n3 (a, l)] \u2013\n\n2 (v, l), PTPh1\n\n3 (a, l)] \u2013\n\n[PTPh1\n2 (v, l), PTPh1\n[PTPh1\n2 (PTPh1\n[PTPh1\n1 , PTPh1\n2 (PTPh2\n\n1 (a, v), PTPh1\n1 , PTPh1\n1 (a, v), PTPh1\n2 ), PTPh2\n1 , PTPh2\n\n[PTPh2\n\n1 (PTPh1\n\n1 , PTPh1\n\n2 ), PTPh2\n\n3 ), PTPh2\n\n3 (PTPh1\n\n2 , PTPh1\n\n3 )] \u2013 [PTPo\n\n1(PTPh2\n\n1 , PTPh2\n\n2 , PTPh2\n\n3 )]\n\n[PTPh3\n\n1 (PTPh2\n\n[PTPh2\n1 , PTPh2\n\n1 (PTPh1\n2 ), PTPh3\n\n2 (PTPh1\n3 ), PTPh3\n\n1 , PTPh1\n3 (PTPh2\n\n3 ), PTPh2\n2 , PTPh2\n\n3 (PTPh1\n3 )] \u2013 [PTPo\n\n2 , PTPh1\n\n3 )] \u2013\n1(PTPh3\n\n1 , PTPh3\n\n2 , PTPh3\n\n3 )]\n\ntrimodal interactions between modalities. To handle the curse of dimensionality issue, low-rank\nmultimodal fusion network (LMF) [17] further enhances the scalability of non-temporal fusion with\nmodality-speci\ufb01c low-rank factors. All those approaches, with the averaged statistics of features,\nattempt to identify the correlations all at once without using temporal information. Although being\nsimple, they are unable to learn the intra-modal and cross-modal dynamics evolving along the time\nsequence, thus suffering from the accuracy loss for prediction.\nMultimodal temporal models, on the other hand, handle multimodal interactions at a much \ufb01ner\ngranularity along the time dimension. The long-short term memory (LSTM) [10] has been extensively\nused for the sequential multimodal setting. Among them, multi-view LSTM (MV-LSTM) [24]\npartitions the memory cell corresponding to speci\ufb01c modality to capture both view-speci\ufb01c and\ncross-view interactions; bidirectional contextual LSTM (BC-LSTM) [23] is proposed to conduct\ncontext-dependent sentiment analysis and emotion recognition with multimodal time series. The\nmemory fusion network (MFN) [28] stores cross-modal and intra-modal interactions along time\ndomain with a multi-view gated memory, while the multi-attention recurrent network (MARN) [29]\nemploys a multi-attention block to discover cross-modality dynamics with attention coef\ufb01cients. More\nrecently, the recurrent multistage fusion network (RMFN) [14] decomposes the fusion into multiple\nstages, with each focusing on a subset of signals whose fusion outcomes build upon intermediate\nrepresentations of previous stages. However, compared with tensor based multimodal fusions, all\nabove approaches are limited to model only the linear interactions, unable to identify complicated\nmultimodal correlations.\n\n5 Experiments\n\n5.1 Experiment setups\n\nDatasets. CMU-MOSI dataset [30] consists of 2, 199 opinion video clips from YouTube movie\nreviews. Each clip is assigned with a speci\ufb01ed sentiment in the range [\u22123, 3] from high negative\nto high positive. There are 1, 284 segments in the train set, 229 segments in the validation set and\n686 segments in the test set. IEMOCAP dataset [2] contains a total number of 302 videos. The\nsegments from videos are annotated with discrete emotions (neutral, fear, happy, angry, disappointed,\nsad, frustrated, excited, surprised), as well as dominance, valance and arousal. The division of the\ntrain, validation and test sets is 6, 373, 1, 775 and 1, 807, respectively. The splits of two datasets are\nspeaker-independent, ensuring the speci\ufb01ed speaker can only belong to one of the three sets.\nFeatures. For IEMOCAP, we adopt the preprocessed non-temporal inputs following the work of\nLMF [17], in which the acoustic and visual features are obtained by averaging out the time dimension.\nFor CMU-MOSI: temporal features are utilized in the same way as MFN [28], where the extracted\nfeatures of three modalities are synchronized at word level in accordance with the text modality.\nComparisons. We include the following cutting-edge tensor and non-tensor based models into our\ncomparisons with HPFN: memory fusion network (MFN) [28], multi-attention recurrent network\n(MARN) [29], tensor fusion network (TFN) [27] and low-rank multimodal fusion network (LMF)\n[17], as well as some other baselines. We report mean absolute error (MAE) and Pearson correlation,\naccuracy and F1 measure. For our HPFN, the evaluations are repeated 5 times for the optimal settings.\nModel architectures. The architectures of HPFN adopted in our experiments are described in Table\n2, including the two-layer densely connected variants HPFN-L2-S1 and HPFN-L2-S2 .\n\n6\n\n\fTable 3: Results for sentiment analysis on CMU-MOSI and emotion recognition on IEMOCAP.\n\nModels\n\nSVM [6]\nDF [20]\n\nBC-LSTM [23]\nMV-LSTM [24]\n\nMARN [29]\nMFN [28]\nTFN [27]\nLMF [17]\n\nHPFN, P=[4] (audio)\nHPFN, P=[4] (video)\nHPFN, P=[4] (text)\n\nHPFN, P=[4]\nHPFN, P=[8]\n\nHPFN-L2, P=[2, 2]\n\nCMU-MOSI\n\nIEMOCAP\n\nMAE\n1.864\n1.143\n1.079\n1.019\n0.968\n0.965\n0.970\n0.912\n1.404\n1.409\n0.975\n0.965\n0.968\n0.945\n\nCorr Acc-2\n50.2\n0.057\n72.3\n0.518\n0.581\n73.9\n73.9\n0.601\n77.1\n0.625\n77.4\n0.632\n73.9\n0.633\n0.668\n76.4\n57.3\n0.223\n57.0\n0.221\n76.4\n0.634\n77.5\n0.650\n77.2\n0.648\n0.672\n77.5\n\nF1\n50.1\n72.1\n73.9\n74.0\n77.0\n77.3\n73.4\n75.7\n57.4\n57.1\n76.4\n77.4\n77.2\n77.4\n\nAcc-7\n17.5\n26.8\n28.7\n33.2\n34.7\n34.1\n32.1\n32.8\n19.0\n20.6\n35.1\n36.0\n36.9\n36.7\n\nF1-Happy\n\n81.5\n81.0\n81.7\n81.3\n83.6\n84.0\n83.6\n85.8\n79.4\n83.2\n85.3\n85.7\n85.7\n86.2\n\nF1-Sad\n78.8\n81.2\n81.7\n74.0\n81.2\n82.1\n82.8\n85.9\n81.8\n73.2\n83.0\n86.4\n86.5\n86.6\n\nF1-Angry\n\nF1-Neutral\n\n82.4\n65.4\n84.2\n84.3\n84.2\n83.7\n84.2\n89.0\n84.9\n72.3\n85.6\n88.3\n87.9\n88.8\n\n64.9\n44.0\n64.1\n66.7\n65.9\n69.2\n65.4\n71.7\n63.6\n58.5\n70.8\n72.1\n71.8\n72.5\n\nImplementation details. Following LMF [17], we use CP format as the \u2018workhorse\u2019 low-rank TN\nin our experiments for weight compression in PTP. The candidate CP ranks are {1, 4, 8, 16}. Other\nTNs variants will be investigated in future work. Since HPFN involves high-order moments when\ncalculating element-wise multiplication, the values of intermediate features may vary drastically and\nhence lead to unstable predictions. To make the model numerically more stable, similar to [8], we\ncould optionally apply power normalization (element-wise signed squared root) or l2 normalization.\n\n5.2 Experimental results\n\nPerformance comparison with state-of-the-art models. We \ufb01rst compare with the baselines and\nthe cutting-edge models on the tasks of sentiment analysis and emotion recognition. The bottom rows\nin Table 3 record the performance of our model. We see that ours (on multimodal data) outperform\nthe competitors in most of the metrics. Particularly, on the sentiment task, our HPFN at 8th order\nexceeds the previous best MARN on the \u2018Acc-7\u2019 by a margin of 2.2%. The overall best results are\nachieved by HPFN-L2, which implies the superior expressive power and ef\ufb01cacy of the hierarchical\nfusion structures. It is also interesting to notice that, even fed with unimodal input (text), our HPFN of\norder 4 obtains much better \u2018Acc-7\u2019 (35.1) and \u2018F1-Neutral\u2019 (70.8) precisions than almost all other\nmethods, indicating the bene\ufb01ts brought by modelling high-order interactions.\n\nFigure 5: Results of the effect of orders of polynomial interactions on IEMOCAP and CMU-MOSI.\n\nEffect of the order of polynomial fusion. As high-order moments play a critical role in our fusion\nstrategy, we are interested to examine how distinct orders affect the predictive performance. For\nsimplicity, we directly apply HPFN with power normalization to the non-temporal multimodal features\n(via averaging out the time dimension). The order P varies from 1 to 10. In Figure 5, HPFN is able\nto achieve fairly good accuracies w.r.t. the tested orders. In particular, we can see HPFN maximizes\npredictions at the order 4 for the case of CMU-MOSI. For IEMOCAP, we observe the relatively\nhigher performance peak at the orders of 3 and 4 in the \u2018neutral\u2019 and \u2018angry\u2019 emotions. As for the\nrest emotions, the desirable orders range 5 from 8. These observations signify the necessity and\neffectiveness of exploring high-order interactions in fusing multimodal features.\nEffect of the depth and dense connectivity. In this part, we investigate the impact of various\narchitecture designs, i.e., depth and dense connectivity, on the predictive performance. To focus on\nthe change of the depth, we apply architectures to non-temporal multimodal features. For the depth\n\n7\n\n12345678910Polynomial order76.57777.5PredictionCMU-MOSIF1Acc-212345678910Polynomial order7171.57272.573PredictionIEMOCAP-Neutral12345678910Polynomial order8787.58888.5PredictionIEMOCAP-Angry12345678910Polynomial order8585.58686.58787.588PredictionIEMOCAP-Happy12345678910Polynomial order85.58686.58787.5PredictionIEMOCAP-Sad\fTable 4: Results of HPFN on non-temporal multimodal features w.r.t. the depth and dense connectivity.\n\nModels\n\nHPFN, P=[2]\n\nHPFN-L2, P=[2, 2]\n\nHPFN-L2-S1, P=[2, 2]\nHPFN-L2-S2, P=[2, 2]\nHPFN-L3, P=[2, 2, 1]\nHPFN-L4, P=[2, 2, 2, 1]\n\nF1-Happy\n\n85.7\n86.2\n86.2\n86.2\n86.1\n85.8\n\nIEMOCAP\n\nF1-Sad\n86.2\n86.6\n86.7\n86.7\n86.8\n86.4\n\nF1-Angry\n\n87.8\n88.8\n88.9\n89.0\n88.3\n88.1\n\nF1-Neutral MAE\n0.973\n0.958\n0.959\n0.957\n0.960\n0.992\n\n71.9\n72.5\n72.6\n72.7\n72.7\n72.5\n\nCMU-MOSI\n\nCorr Acc-2\n77.1\n0.635\n77.1\n0.652\n77.3\n0.654\n0.656\n77.3\n76.8\n0.651\n0.634\n76.6\n\nF1\n77.0\n77.1\n77.2\n77.3\n76.8\n76.5\n\nAcc-7\n35.9\n36.3\n36.5\n36.5\n36.0\n34.6\n\nTable 5: Results on the modelling of locally mixed temporal-modality features.\n\nModels\n\nHPFN-L2, P=[2, 2] (non-temporal)\nHPFN-L2, P=[2, 2] (temporal-overlapped, audio)\nHPFN-L2, P=[2, 2] (temporal-overlapped, video)\nHPFN-L2, P=[2, 2] (temporal-overlapped, text)\nHPFN-L2, P=[2, 2] (temporal-overlapped)\nHPFN-L2, P=[2, 2] (weight-shared)\n\nCMU-MOSI\n\nMAE\n0.958\n1.407\n1.358\n0.933\n0.944\n0.955\n\nCorr Acc-2\n77.1\n0.652\n0.229\n57.4\n61.2\n0.183\n76.7\n0.677\n77.5\n0.678\n0.667\n77.0\n\nF1\n77.1\n56.2\n61.3\n76.6\n77.4\n76.9\n\nAcc-7\n36.3\n20.1\n20.3\n35.4\n36.7\n35.7\n\nvariants, we validate on HPFN, HPFN-L2, HPFN-L3 and HPFN-L4. We also compare with two densely\nconnected variants: HPFN-L2-S1 and HPFN-L2-S2. In Table 4, we \ufb01nd two-layer and three-layer based\narchitectures attain the better overall results than their both one-layer and four-layer counterparts. In\nparticular, HPFN-L2-S2 reach the best precisions on both datasets. The HPFN is too simple to learn\nthe complex interactions while HPFN-L4 containing too many intermediate nodes is likely to over\ufb01t\nfor this speci\ufb01c architecture design. Allowing skip connections further enhances the performance\nof HPFN-L2, which may be due to the incorporation of the guidance from the more discriminative\nunimodal signals without adding more intermediate layers.\nEffect of the modelling mixed temporal-modality features. Being able to deal with a local mixture\nof temporal-modality features is one desirable property of our model. In this test, we examine how\nthe model behaves by considering both temporal and modality domains. We adapt HPFN-L2 to the\ntemporal context with \u2018window\u2019 size of [4 \u00d7 2] for the input layer, and set the stride step as 2 along\nthe time dimension. The non-temporal HPFN-L2 only considers modality domain, by averaging out\nthe time dimension. Table 5 indicates the superiority of the temporal HPFN-L2 over the non-temporal\none. We further attempt to share the PTPs by scanning the \u2018window\u2019 along the temporal direction. It\nturns out that sharing a single PTP unit for multiple windows does not bring the extra performance\ngain for this setting. Figure 6 displays the predictions w.r.t. the \u2018window\u2019 size in the temporal domain.\nFor non-weight-shared case, moderate \u2018windows\u2019 (sizes of 5 and 10) reach the peak performance. In\ncontrast, weight-shared modal gets the relatively high performance under the largest window size\n(20). This again implies sharing with a single PTP may not be able to capture the local, evolving\ndynamics of interactions.\n\nFigure 6: Results on predictions w.r.t.\nno-weight-shared model. The right two \ufb01gures: weight-shared model.\n\nthe \u2018window\u2019 size along the time domain. The left two \ufb01gures:\n\n6 Conclusion\n\nIn this paper, we proposed a high-order polynomial multilinear pooling block for multimodal feature\nfusion. Based on this, we established a hierarchical polynomial fusion network (HPFN) which can\n\n8\n\n12451020Window size in time domain0.920.930.940.950.96MAETemporal12451020Window size in time domain0.350.3550.360.3650.370.375Acc-7Temporal12451020Window size in time domain0.930.940.950.960.97MAEWeight-shared12451020Window size in time domain0.3550.360.3650.370.375Acc-7Weight-shared\f\ufb02exibly fuse the mixed features across both time and modality domains. The proposed model is\neffective in capturing much complex temporal-modality correlations from local scale to global scale.\nThe various experiments on real multimodal fusion tasks validate the superior performance of the\nproposed model. For future work, we like to further examine how the architecture designs affect\nthe prediction performance. For example, attaching multiple PTP blocks to a single \u2018window\u2019, and\nsharing those multiple PTP \u2018fusion \ufb01lters\u2019 along the time dimension to model more complex patterns\nof correlations.\n\nAcknowledgments\n\nThis work was partially supported by JSPS KAKENHI (Grant No. 17K00326), the national key\nresearch and development program intergovernmental international science and technology innovation\ncooperation project (MOST-RIKEN) under Grant 2017YFE0116800 and the national natural science\nfoundation of China (Grant No. 61633010).\n\nReferences\n[1] Tadas Baltru\u0161aitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: a survey\nand taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423\u2013443, 2019.\n\n[2] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N\nChang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture\ndatabase. Language resources and evaluation, 42(4):335, 2008.\n\n[3] J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an\n\nn-way generalization of \u201ceckart-young\u201d decomposition. Psychometrika, 35(3):283\u2013319, 1970.\n\n[4] Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, Danilo P Mandic, et al. Tensor\nnetworks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions.\nFoundations and Trends in Machine Learning, 9(4-5):249\u2013429, 2016.\n\n[5] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis.\n\nIn Conference on Learning Theory, pages 698\u2013728, 2016.\n\n[6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\n[7] Sidney K D\u2019mello and Jacqueline Kory. A review and meta-analysis of multimodal affect detection systems.\n\nACM Computing Surveys (CSUR), 47(3):43, 2015.\n\n[8] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach.\nMultimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings\nof the Conference on Empirical Methods in Natural Language Processing, pages 457\u2013468, 2016.\n\n[9] Wolfgang Hackbusch and Stefan K\u00fchn. A new scheme for the tensor representation. Journal of Fourier\n\nanalysis and applications, 15(5):706\u2013722, 2009.\n\n[10] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 4700\u20134708, 2017.\n\n[12] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455\u2013500,\n\n2009.\n\n[13] Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov, and Louis-Philippe Morency.\nStrong and simple baselines for multimodal utterance embeddings. In Proceedings of the Conference of\nthe North American Chapter of the Association for Computational Linguistics, pages 2599\u20132609, 2019.\n\n[14] Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. Multimodal language\nanalysis with recurrent multistage fusion. In Proceedings of the Conference on Empirical Methods in\nNatural Language Processing, pages 150\u2013161, 2018.\n\n9\n\n\f[15] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Multimodal local-global ranking fusion for\nemotion recognition. In Proceedings of the International Conference on Multimodal Interaction, pages\n472\u2013476. ACM, 2018.\n\n[16] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\nrecognition. In Proceedings of the IEEE international conference on computer vision, pages 1449\u20131457,\n2015.\n\n[17] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh,\nand Louis-Philippe Morency. Ef\ufb01cient low-rank multimodal fusion with modality-speci\ufb01c factors. In\nProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 2247\u20132256,\n2018.\n\n[18] Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. Towards multimodal sentiment analysis:\nharvesting opinions from the web. In Proceedings of the international conference on multimodal interfaces,\npages 169\u2013176. ACM, 2011.\n\n[19] Emilie Morvant, Amaury Habrard, and St\u00e9phane Ayache. Majority vote of diverse classi\ufb01ers for late\nfusion. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and\nStructural and Syntactic Pattern Recognition (SSPR), pages 153\u2013162. Springer, 2014.\n\n[20] Behnaz Nojavanasghari, Deepak Gopinath, Jayanth Koushik, Tadas Baltru\u0161aitis, and Louis-Philippe\nMorency. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the ACM International\nConference on Multimodal Interaction, pages 284\u2013288. ACM, 2016.\n\n[21] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scienti\ufb01c Computing, 33(5):2295\u20132317,\n\n2011.\n\n[22] Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. Com-\nputational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction\napproach. In Proceedings of the International Conference on Multimodal Interaction, pages 50\u201357. ACM,\n2014.\n\n[23] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe\nMorency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the Annual\nMeeting of the Association for Computational Linguistics, pages 873\u2013883, 2017.\n\n[24] Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrusaitis, and Roland Goecke. Extending\nlong short-term memory for multi-view structured learning. In European Conference on Computer Vision,\npages 338\u2013353. Springer, 2016.\n\n[25] Ekaterina Shutova, Douwe Kiela, and Jean Maillard. Black holes and white rabbits: Metaphor identi\ufb01cation\nwith visual features. In Proceedings of the Conference of the North American Chapter of the Association\nfor Computational Linguistics: Human Language Technologies, pages 160\u2013170, 2016.\n\n[26] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279\u2013311,\n\n1966.\n\n[27] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion\nnetwork for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in\nNatural Language Processing, pages 1103\u20131114, 2017.\n\n[28] Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe\nMorency. Memory fusion network for multi-view sequential learning. In AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n[29] Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency.\nMulti-attention recurrent network for human communication comprehension. In AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[30] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of\nsentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259,\n2016.\n\n[31] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity\n\nanalysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 31(6):82\u201388, 2016.\n\n[32] Qibin Zhao, Masashi Sugiyama, Longhao Yuan, and Andrzej Cichocki. Learning ef\ufb01cient tensor represen-\ntations with ring-structured networks. In Processing of the IEEE International Conference on Acoustics,\nSpeech and Signal, pages 8608\u20138612. IEEE, 2019.\n\n10\n\n\f", "award": [], "sourceid": 6542, "authors": [{"given_name": "Ming", "family_name": "Hou", "institution": "RIKEN AIP"}, {"given_name": "Jiajia", "family_name": "Tang", "institution": "Hangzhou Dianzi University / RIKEN AIP"}, {"given_name": "Jianhai", "family_name": "Zhang", "institution": "Hangzhou Dianzi University"}, {"given_name": "Wanzeng", "family_name": "Kong", "institution": "Hangzhou Dianzi University"}, {"given_name": "Qibin", "family_name": "Zhao", "institution": "RIKEN AIP"}]}