{"title": "Generalised Coupled Tensor Factorisation", "book": "Advances in Neural Information Processing Systems", "page_first": 2151, "page_last": 2159, "abstract": "We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie's distributions corresponding to $\\beta$-divergences. By bounding the step size of the Fisher Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem.", "full_text": "Generalised Coupled Tensor Factorisation\n\nY. Kenan Y\u0131lmaz\n\nA. Taylan Cemgil\n\nUmut S\u00b8ims\u00b8ekli\n\nDepartment of Computer Engineering\nBo\u02d8gazic\u00b8i University, Istanbul, Turkey\n\nkenan@sibnet.com.tr, {taylan.cemgil, umut.simsekli}@boun.edu.tr\n\nAbstract\n\nWe derive algorithms for generalised tensor factorisation (GTF) by building upon\nthe well-established theory of Generalised Linear Models. Our algorithms are\ngeneral in the sense that we can compute arbitrary factorisations in a message\npassing framework, derived for a broad class of exponential family distribu-\ntions including special cases such as Tweedie\u2019s distributions corresponding to \u03b2-\ndivergences. By bounding the step size of the Fisher Scoring iteration of the GLM,\nwe obtain general updates for real data and multiplicative updates for non-negative\ndata. The GTF framework is, then extended easily to address the problems when\nmultiple observed tensors are factorised simultaneously. We illustrate our coupled\nfactorisation approach on synthetic data as well as on a musical audio restoration\nproblem.\n\n1\n\nIntroduction\n\nA fruitful modelling approach for extracting meaningful information from highly structured mul-\ntivariate datasets is based on matrix factorisations (MFs). In fact, many standard data processing\nmethods of machine learning and statistics such as clustering, source separation, independent com-\nponents analysis (ICA), nonnegative matrix factorisation (NMF), latent semantic indexing (LSI)\ncan be expressed and understood as MF problems. These MF models also have well understood\nprobabilistic interpretations as probabilistic generative models. Indeed, many standard algorithms\nmentioned above can be derived as maximum likelihood or maximum a-posteriori parameter esti-\nmation procedures. It is also possible to do a full Bayesian treatment for model selection [1].\n\nTensors appear as a natural generalisation of matrix factorisation, when observed data and/or a latent\nrepresentation have several semantically meaningful dimensions. Before giving a formal de\ufb01nition,\nconsider the following motivating example\n\nZ j,r\n2 Z q,r\n\n5\n\n(1)\n\nX i,j,k\n\n1 \u2248Xr\n\nZ i,r\n1 Z j,r\n\n2 Z k,r\n\n3\n\nX j,p\n\n2 \u2248Xr\n\nZ j,r\n2 Z p,r\n\n4\n\nX j,q\n\n3 \u2248Xr\n\nwhere X1 is an observed 3-way array and X2, X3 are 2-way arrays, while Z\u03b1 for \u03b1 = 1 . . . 5 are\nthe latent 2-way arrays. Here, the 2-way arrays are just matrices but this can be easily extended to\nobjects having arbitrary number of indices. As the term \u2019N-way array\u2019 is awkward, we prefer using\nthe more convenient term tensor. Here, Z2 is a shared factor, coupling all models. As the \ufb01rst model\nis a CP (Parafac) while the second and the third ones are MF\u2019s, we call the combined factorization\nas CP/MF/MF model. Such models are of interest when one can obtain different \u2019views\u2019 of the\nsame piece of information (here Z2) under different experimental conditions. Singh and Gordon\n[2] focused on a similar problem called as collective matrix factorisation (CMF) or multi-matrix\nfactorisation, for relational learning but only for matrix factors and observations. In addition, their\ngeneralised Bregman divergence minimisation procedure assumes matching link and loss functions.\nFor coupled matrix and tensor factorization (CMTF), recently [3] proposed a gradient-based all-\nat-once optimization method as an alternative to alternating least square (ALS) optimization and\n\n1\n\n\fdemonstrated their approach for a CP/MF coupled model. Similar models are used for protein-\nprotein interactions (PPI) problems in gene regulation [4].\n\nThe main motivation of the current paper is to construct a general and practical framework for\ncomputation of tensor factorisations (TF), by extending the well-established theory of Generalised\nLinear Models (GLM). Our approach is also partially inspired by probabilistic graphical models:\nour computation procedures for a given factorisation have a natural message passing interpretation.\nThis provides a structured and ef\ufb01cient approach that enables very easy development of application\nspeci\ufb01c custom models, priors or error measures as well as algorithms for joint factorisations where\nan arbitrary set of tensors can be factorised simultaneously. Well known models of multiway analysis\n(Parafac, Tucker [5]) appear as special cases and novel models and associated inference algorithms\ncan be automatically be developed. In [6], the authors take a similar approach to tensor factorisations\nas ours, but that work is limited to KL and Euclidean costs, generalising MF models of [7] to the\ntensor case. It is possible to generalise this line of work to \u03b2-divergences [8] but none of these works\naddress the coupled factorisation case and consider only a restricted class of cost functions.\n\n2 Generalised Linear Models for Matrix/Tensor Factorisation\n\nTo set the notation and our approach, we brie\ufb02y review GLMs following closely the original notation\nof [9, ch 5]. A GLM assumes that a data vector x has conditionally independently drawn components\nxi according to an exponential family density\n\nxi \u223c exp(cid:16) xi\u03b3i \u2212 b(\u03b3i)\n\n\u03c4 2\n\n\u2212 c(xi, \u03c4 )(cid:17)\n\nhxii = \u02c6xi =\n\n\u2202b(\u03b3i)\n\n\u2202\u03b3i\n\nvar(xi) = \u03c4 2 \u22022b(\u03b3i)\n\u2202\u03b32\ni\n\n(2)\n\ni z where l\u22a4\ni\n\nHere, \u03b3i are canonical parameters and \u03c4 2 is a known dispersion parameter. hxii is the expectation of\nxi and b(\u00b7) is the log partition function, enforcing normalization. The canonical parameters are not\ndirectly estimated, instead one assumes a link function g(\u00b7) that \u2019links\u2019 the mean of the distribution\nis the ith row vector of a known model matrix L and\n\u02c6xi and assumes that g(\u02c6xi) = l\u22a4\nz is the parameter vector to be estimated, A\u22a4 denotes matrix transpose of A. The model is linear\nin the sense that a function of the mean is linear in parameters, i.e., g(\u02c6x) = Lz . A Linear Model\n(LM) is a special case of GLM that assumes normality, i.e. xi \u223c N (xi; \u02c6xi, \u03c32) as well as linearity\nthat implies identity link function as g(\u02c6xi) = \u02c6xi = l\u22a4\ni z assuming li are known. Logistic regression\nassumes a log link, g(\u02c6xi) = log \u02c6xi = l\u22a4\nThe goal in classical GLM is to estimate the parameter vector z. This is typically achieved via\na Gauss-Newton method (Fisher Scoring). The necessary objects for this computation are the log\nlikelihood, the derivative and the Fisher Information (the expected value of negative of the Fisher\nScore). These are easily derived as:\n\ni z; here log \u02c6xi and z have a linear relationship [9].\n\nL =Xi\n\n[xi\u03b3i \u2212 b(\u03b3i)]/\u03c4 2 \u2212Xi\n1\n\u03c4 2 L\u22a4DG(x \u2212 \u02c6x)\n\n=\n\n\u2202L\n\u2202z\n\nc(xi, \u03c4 )\n\n\u2202L\n\u2202z\n\n=\n\n(cid:28) \u22022L\n\u2202z2(cid:29) =\n\n(xi \u2212 \u02c6xi)wig\u02c6x(\u02c6xi)l\u22a4\ni\n\n1\n\n\u03c4 2 Xi\n1\n\u03c4 2 L\u22a4DL\n\n(3)\n\n(4)\n\nwhere w is a vector with elements wi, D and G are the diagonal matrices as D = diag(w), G =\ndiag(g\u02c6x(\u02c6xi)) and\n\nwi =(cid:16)v(\u02c6xi)g2\n\n\u02c6x(\u02c6xi)(cid:17)\u22121\n\ng\u02c6x(\u02c6xi) =\n\n\u2202g(\u02c6xi)\n\n\u2202 \u02c6xi\n\n(5)\n\nwith v(\u02c6xi) being the variance function related to the observation variance by var(xi) = \u03c4 2v(\u02c6xi).\nVia Fisher Scoring, the general update equation in matrix form is written as\n\nz \u2190 z +(cid:16)L\u22a4DL(cid:17)\u22121\n\nL\u22a4DG(x \u2212 \u02c6x)\n\n(6)\n\nAlthough this formulation is somewhat abstract, it covers a very broad range of model classes that\nare used in practice. For example, an important special case appears when the variance functions\nare in the form of v(\u02c6x) = \u02c6xp. By setting p = {0, 1, 2, 3} these correspond to Gaussian, Poisson,\nExponential/Gamma, and Inverse Gaussian distributions [10, pp.30], which are special cases of the\nexponential family of distributions for any p named Tweedie\u2019s family [11]. Those for p = {0, 1, 2},\nin turn, correspond to EU, KL and IS cost functions often used for NMF decompositions [12, 7].\n\n2\n\n\f2.1 Tensor Factorisations (TF) as GLM\u2019s\n\nThe key observation for expressing a TF model as a GLM is to identify the multilinear structure\nand using an alternating optimization approach. To hide the notational complexity, we will give an\nexample with a simple matrix factorisation model; extension to tensors will require heavier notation,\nbut are otherwise conceptually straightforward. Consider a MF model\n\ng( \u02c6X) = Z1Z2\n\nin scalar\n\ng( \u02c6X)i,j =Xr\n\nZ i,r\n1 Z j,r\n\n2\n\n(7)\n\nwhere Z1, Z2 and g( \u02c6X) are matrices of compatible sizes. Indeed, by applying the vec operator\n(vectorization, stacking columns of a matrix to obtain a vector) to both sides of (7) we obtain two\nequivalent representation of the same system\n\nvec(g( \u02c6X)) = (I|j| \u2297 Z1) vec(Z2) =\n\n\u2202(Z1Z2)\n\n\u2202Z2\n\nvec(Z2) =\n\n\u2202g( \u02c6X)\n\u2202Z2\n\nvec(Z2) \u2261 \u22072 ~Z2\n\n(8)\n\nwhere I|j| denotes the |j| \u00d7 |j| identity matrix, \u2297 denotes the Kronecker product [13], and vec Z \u2261\n~Z. Clearly, this is a GLM where \u22072 plays the role of a model matrix and ~Z2 is the parameter\nvector. By alternating between Z1 and Z2, we can maximise the log-likelihood iteratively; indeed\nthis alternating maximisation is standard for solving matrix factorisation problems. In the sequel, we\nwill show that a much broader range of algorithms can be readily derived in the GLM framework.\n\n2.2 Generalised Tensor Factorisation\n\nWe de\ufb01ne a tensor \u039b as a multiway array with an index set V = {i1, i2, . . . , i|\u03b1|} where each index\nin for n = 1 . . . |\u03b1| runs as in = 1 . . . |in|. An element of the tensor \u039b is a scalar that we denote\nby \u039b(i1, i2, . . . , i|\u03b1|) or \u039bi1,i2,...,i|\u03b1| or as a shorthand notation by \u039b(v) with v being a particular\ncon\ufb01guration. |v| denotes number of all distinct con\ufb01gurations for V, and e.g. if V = {i1, i2} then\n|v| = |i1||i2|. We call the form \u039b(v) as element-wise; the notation [ ] yields a tensor by enumerating\nall the indices, i.e., \u039b = [\u039bi1,i2,...,i|\u03b1|] or \u039b = [\u039b(v)]. For any two tensors X and Y of compatible\norder, X \u25e6 Y is an element-wise multiplication and if not explicitly stressed X/Y is an element-wise\ndivision. 1 is an object of all ones whose order depends on the context where it is used.\n\nA generalised tensor factorisation problem is speci\ufb01ed by an observed tensor X (with possibly\nmissing entries, to be treated later) and a collection of latent tensors to be estimated, Z1:|\u03b1| = {Z\u03b1}\nfor \u03b1 = 1 . . . |\u03b1|, and by an exponential family of form (2). The index set of X is denoted by V0 and\n\u03b1=1 V\u03b1. We use v\u03b1 (or v0)\nto denote a particular con\ufb01guration of the indices for Z\u03b1 (or X) while \u00afv\u03b1 denoting a con\ufb01guration\nof the compliment \u00afV\u03b1 = V/V\u03b1. The goal is to \ufb01nd the latent Z\u03b1 that maximize the likelihood\np(X|Z1:\u03b1) where hXi = \u02c6X is given via\n\nthe index set of each Z\u03b1 by V\u03b1. The set of all model indices is V = S|\u03b1|\n\ng( \u02c6X(v0)) =X\u00afv0 Y\u03b1\n\nZ\u03b1(v\u03b1)\n\n(9)\n\nTo clarify our notation with an example, we express the CP (Parafac) model, de\ufb01ned as \u02c6X(i, j, k) =\n\nPr Z1(i, r)Z2(j, r)Z3(k, r). In our notation, we take identity link g( \u02c6X) = \u02c6X and the index sets\n\nwith V = {i, j, k, r}, V0 = {i, j, k}, \u00afV0 = {r}, V1 = {i, r}, V2 = {j, r} and V3 = {k, r}. Our\nnotation deliberately follows that of graphical models; the reader might \ufb01nd it useful to associate\nindices with discrete random variables and factors with probability tables [14]. Obviously, while a\nTF model does not represent a discrete probability measure, the algebraic structure is nevertheless\nanalogous.\n\nTo extend the discussion in Section 2.1 to the tensor case, we need the equivalent of the model\nmatrix, when updating Z\u03b1. This is obtained by summing over the product of all remaining factors\n\ng( \u02c6X(v0)) = X\u00afv0\u2229v\u03b1\nL\u03b1(o\u03b1) = X\u00afv0\u2229\u00afv\u03b1 Y\u03b1\u20326=\u03b1\n\nZ\u03b1(v\u03b1) X\u00afv0\u2229\u00afv\u03b1 Y\u03b1\u20326=\u03b1\n\nZ\u03b1\u2032 (v\u03b1\u2032 ) = X\u00afv0\u2229v\u03b1\n\nZ\u03b1(v\u03b1)L\u03b1(o\u03b1)\n\nwith o\u03b1 \u2261 (v0 \u222a v\u03b1) \u2229 (\u00afv0 \u222a \u00afv\u03b1)\n\nZ\u03b1\u2032 (v\u03b1\u2032 )\n\n3\n\n\fOne related quantity to L\u03b1 is the derivative of the tensor g( \u02c6X) wrt the latent tensor Z\u03b1 denoted as\n\u2207\u03b1 and is de\ufb01ned as (following the convention [13, pp 196])\n\n\u2207\u03b1 =\n\n\u2202g( \u02c6X)\n\u2202Z\u03b1\n\n= I|v0\u2229v\u03b1| \u2297 L\u03b1\n\nwith L\u03b1 \u2208 R|v0\u2229\u00afv\u03b1|\u00d7|\u00afv0\u2229v\u03b1|\n\n(10)\n\nThe importance of L\u03b1 is that, all the update rules can be formulated by a product and subsequent\ncontraction of L\u03b1 with another tensor Q having exactly the same index set of the observed tensor\nX. As a notational abstraction, it is useful to formulate the following function,\nDe\ufb01nition 1. The tensor valued function \u2206\u03b1(Q) : R|v0| \u2192 R|v\u03b1| is de\ufb01ned as\n\n\u2206\u03b5\n\n\u03b1(Q) =h Xv0\u2229\u00afv\u03b1\n\nQ(v0) L\u03b1(o\u03b1)\u03b5i\n\n(11)\n\nwith \u2206\u03b1(Q) being an object of the same order as Z\u03b1 and o\u03b1 \u2261 (v0 \u222a v\u03b1) \u2229 (\u00afv0 \u222a \u00afv\u03b1). Here, on\nthe right side, the nonnegative integer \u03b5 denotes the element-wise power, not to be confused with an\nindex. On the left, it should be interpreted as a parameter of the \u2206 function. Arguably, \u2206 function\nabstracts away all the tedious reshape and unfolding operations [5]. This abstraction has also an\nimportant practical facet: the computation of \u2206 is algebraically (almost) equivalent to computation\nof marginal quantities on a factor graph, for which ef\ufb01cient message passing algorithms exist [14].\n\nExample 1. TUCKER3 is de\ufb01ned as \u02c6X i,j,k = Pp,q,r Ai,pBj,qC k,rGp,q,r with V =\n\n{i, j, k, p, q, r}, V0 = {i, j, k}, VA = {i, p}, VB = {j, q}, VC = {k, r}, VG = {p, q, r}. Then\nfor the \ufb01rst factor A, the objects LA and \u2206\u03b5\n\nA() are computed as follows\n\nLA =\"Xq,r\nA(Q) =\uf8ee\n\uf8f0Xj,k\n\nBj,qC k,rGp,q,r# =h((C \u2297 B)G\u22a4)p\nk,j\uf8f9\nA(cid:1)p\nA(cid:1)p\n\uf8fb =(cid:2)(cid:0)QL\u03b5\ni (cid:0)L\u03b5\ni(cid:3)\n\nQk,j\n\n\u2206\u03b5\n\nk,ji =h(cid:0)LA(cid:1)p\nk,ji\n\n(12)\n\n(13)\n\nThe index sets marginalised out for LA and \u2206A are \u00afV0 \u2229 \u00afVA = {p, q, r} \u2229 {j, q, k, r} = {q, r} and\nV0 \u2229 \u00afVA = {i, j, k} \u2229 {j, q, k, r} = {j, k}. Also we verify the order of the gradient \u2207A (10) as\nI i\ni \u2297 LA\n\ni,k,j that conforms the matrix derivation convention [13, pp.196].\n\nk,j = \u2207i,p\n\np\n\n2.3\n\nIterative Solution for GTF\n\nAs we have now established a one to one relationship between GLM and GTF objects such as the\nobservation x \u2261 vec X, the mean (and the model estimate) \u02c6x \u2261 vec \u02c6X, the model matrix L \u2261 L\u03b1\nand the parameter vector z \u2261 vec Z\u03b1, we can write directly from (6) as\n\n~Z\u03b1 \u2190 ~Z\u03b1 +(cid:16)\u2207\u22a4\n\n\u03b1 D\u2207\u03b1(cid:17)\u22121\n\n\u03b1 DG( ~X \u2212 ~\u02c6X)\n\u2207\u22a4\n\nwith \u2207\u03b1 =\n\n\u2202g( \u02c6X)\n\u2202Z\u03b1\n\n(14)\n\nThere are at least two ways that this update can further simpli\ufb01ed. We may assume an identity\nlink function, or alternatively we may choose a matching link and lost functions such that they\ncancel each other smoothly [2]. In the sequel we consider identity link g( \u02c6X) = \u02c6X that results to\ng \u02c6X ( \u02c6X) = 1. This implies G to be identity, i.e. G = I. We de\ufb01ne a tensor W , that plays the same\nrole as w in (5), which becomes simply the precision (inverse variance function), i.e. W = 1/v( \u02c6X)\nwhere for the Gaussian, Poisson, Exponential and Inverse Gaussian distributions we have simply\nW = \u02c6X \u2212p with p = {0, 1, 2, 3} [10, pp 30]. Then, the update (14) is reduced to\n\n~Z\u03b1 \u2190 ~Z\u03b1 +(cid:16)\u2207\u22a4\n\n\u03b1 D\u2207\u03b1(cid:17)\u22121\n\n\u03b1 D( ~X \u2212 ~\u02c6X)\n\u2207\u22a4\n\n(15)\n\nAfter this simpli\ufb01cation we obtain two update rules for GTF for non-negative and real data.\n\nThe update (15) can be used to derive multiplicative update rules (MUR) popularised by [15] for the\nnonnegative matrix factorisation (NMF). MUR equations ensure the non-negative parameter updates\nas long as starting some non-negative initial values.\n\n4\n\n\fTheorem 1. The update equation (15) for nonnegative GTF is reduced to multiplicative form as\n\nZ\u03b1 \u2190 Z\u03b1 \u25e6\n\n\u2206\u03b1(W \u25e6 X)\n\u2206\u03b1(W \u25e6 \u02c6X)\n\ns.t. Z\u03b1(v\u03b1) > 0\n\n(16)\n\n(Proof sketch) Due to space limitation we leave the full details of the proof, but idea is that inverse\nof H = \u2207\u22a4D\u2207 is identi\ufb01ed as step size and by use of the results of the Perron-Frobenious theorem\n[16, pp 125] we further bound it as\n\n\u03b7 =\n\n~Z\u03b1\n\u2207\u22a4D\n\n~\u02c6X\n\n<\n\n2 ~Z\u03b1\n~\u02c6X\n\u2207\u22a4D\n\n\u2264\n\n2\n\n\u03bbmax(\u2207\u22a4D\u2207)\n\nsince \u03bbmax(H) \u2264 max\n\nv\u03b1 (cid:0)H ~Z\u03b1(cid:1)(v\u03b1)\n\nZ\u03b1(v\u03b1)\n\n(17)\n\nFor the special case of the Tweedie family where the precision is a function of the mean as W =\n\u02c6X \u2212p for p = {0, 1, 2, 3} the update (15) is reduced to\n\nZ\u03b1 \u2190 Z\u03b1 \u25e6\n\n\u2206\u03b1( \u02c6X \u2212p \u25e6 X)\n\n\u2206\u03b1( \u02c6X 1\u2212p)\n\n(18)\n\nFor example, to update Z2 for the NMF model \u02c6X = Z1Z2, \u22062 is \u22062(Q) = Z \u22a4\nGaussian (p = 0) this reduces to NMF-EU as Z2 \u2190 Z2 \u25e6 (Z \u22a4\n\n1 Q. Then for the\n\u02c6X). For the Poisson (p = 1)\n\n1 X)/(Z \u22a4\n1\n\nit reduces to NMF-KL as Z2 \u2190 Z2 \u25e6(cid:0)Z \u22a4\n\n1 (X/ \u02c6X)(cid:1)/(cid:0)Z \u22a4\n\n1 1(cid:1) [15].\n\nBy dropping the non-negativity requirement we obtain the following update equation:\nTheorem 2. The update equation for GTF with real data can be expressed as\n\nZ\u03b1 \u2190 Z\u03b1 +\n\n2\n\n\u2206\u03b1(W \u25e6 (X \u2212 \u02c6X))\n\n\u03bb\u03b1/0\n\n\u22062\n\n\u03b1(W )\n\nwith \u03bb\u03b1/0 = |v\u03b1 \u2229 \u00afv0|\n\n(19)\n\n(Proof sketch) Again skipping the full details, as part of the proof we set Z\u03b1 = 1 in (17) speci\ufb01cally,\nand replacing matrix multiplication of \u2207\u22a4D\u22071 by \u2207\u22a42\nD1\u03bb\u03b1/0 completes the proof. Here the\nmultiplier \u03bb\u03b1/0 is the cardinality arising from the fact that only \u03bb\u03b1/0 elements are non-zero in a row\nof \u2207\u22a4D\u2207. Note the example for \u03bb\u03b1/0 that if V\u03b1 \u2229 \u00afV0 = {p, q} then \u03bb\u03b1/0 = |p||q| which is number\nof all distinct con\ufb01gurations for the index set {p, q}.\nMissing data can be handled easily by dropping the missing data terms from the likelihood [17]. The\n\nnet effect of this is the addition of an indicator variable mi to the gradient \u2202L/\u2202z = \u03c4 \u22122Pi(xi \u2212\n\ni with mi = 1 if xi is observed otherwise mi = 0. Hence we simply de\ufb01ne a mask\n\u02c6xi)miwig\u02c6x(\u02c6xi)l\u22a4\ntensor M having the same order as the observation X, where the element M (v0) is 1 if X(v0) is\nobserved and zero otherwise. In the update equations, we merely replace W with W \u25e6 M.\n\n3 Coupled Tensor Factorization\n\nHere we address the problem when multiple observed tensors X\u03bd for \u03bd = 1 . . . |\u03bd| are factorised\nsimultaneously. Each observed tensor X\u03bd now has a corresponding index set V0,\u03bd and a particular\ncon\ufb01guration will be denoted by v0,\u03bd \u2261 u\u03bd. Next, we de\ufb01ne a |\u03bd| \u00d7 |\u03b1| coupling matrix R where\n\nR\u03bd,\u03b1 =(cid:26) 1\n\n0\n\nX\u03bd and Z\u03b1 connected\notherwise\n\n\u02c6X\u03bd(u\u03bd) =X\u00afu\u03bd Y\u03b1\n\nZ\u03b1(v\u03b1)R\u03bd,\u03b1\n\n(20)\n\nFor the coupled factorisation, we get the following expression as the derivative of the log likelihood\n\n\u2202L\n\n\u2202Z\u03b1(v\u03b1)\n\n=X\u03bd\n\nR\u03bd,\u03b1 Xu\u03bd \u2229\u00afv\u03b1(cid:16)X\u03bd(u\u03bd) \u2212 \u02c6X\u03bd(u\u03bd)(cid:17)W\u03bd(u\u03bd)\n\n\u2202 \u02c6X\u03bd(u\u03bd)\n\u2202Z\u03b1(v\u03b1)\n\n(21)\n\nwhere W\u03bd \u2261 W ( \u02c6X\u03bd(u\u03bd)) are the precisions. Then proceeding as in section 2.3 (i.e. getting the\nHessian and \ufb01nding Fisher Information) we arrive at the update rule in vector form as\n\n~Z\u03b1 \u2190 ~Z\u03b1 +(cid:16)X\u03bd\n\nR\u03bd,\u03b1\u2207\u22a4\n\n\u03b1,\u03bdD\u03bd\u2207\u03b1,\u03bd(cid:17)\u22121(cid:16)X\u03bd\n\nR\u03bd,\u03b1\u2207\u22a4\n\n\u03b1,\u03bdD\u03bd(cid:0) ~X\u03bd \u2212 ~\u02c6X\u03bd(cid:1)(cid:17)\n\n(22)\n\n5\n\n\f. . .\n\nZ\u03b1\n\n. . .\n\nZ|\u03b1|\n\nZ1\n\nA\n\nB\n\nC\n\nD\n\nE\n\n. . .\n\nX\u03bd\n\n. . .\n\nX|\u03bd|\n\nX1\n\nX1\n\nX2\n\nX3\n\nFigure 1: (Left) Coupled factorisation structure where the arrow indicates the existence of the in\ufb02u-\nence of latent tensor Z\u03b1 onto the observed tensor X\u03bd. (Right) The CP/MF/MF coupled factorisation\nproblem in 1.\n\nwhere \u2207\u03b1,\u03bd = \u2202g( \u02c6X\u03bd)/\u2202Z\u03b1. The update equations for the coupled case are quite intuitive; we\ncalculate the \u2206\u03b1,\u03bd functions de\ufb01ned as\n\n\u2206\u03b5\n\n\u03b1,\u03bd(Q) =h Xu\u03bd \u2229\u00afv\u03b1\n\nfor each submodel and add the results:\nLemma 1. Update for non-negative CTF\n\nQ(u\u03bd)(cid:16) Y\u03b1\u20326=\u03b1\n\nZ\u03b1\u2032 (v\u03b1\u2032 )R\u03bd,\u03b1(cid:17)\u03b5i\n\n(23)\n\n(24)\n\n(25)\n\n(27)\n\nIn the special case of a Tweedie family, i.e. for the distributions whose precision as W\u03bd = \u02c6X \u2212p\n\n\u03bd , the\n\nupdate is Z\u03b1 \u2190 Z\u03b1 \u25e6(cid:16)P\u03bd R\u03bd,\u03b1\u2206\u03b1,\u03bd(cid:16) \u02c6X \u2212p\n\nLemma 2. General update for CTF\n\nZ\u03b1 \u2190 Z\u03b1 \u25e6 P\u03bd R\u03bd,\u03b1\u2206\u03b1,\u03bd(W\u03bd \u25e6 X\u03bd)\nP\u03bd R\u03bd,\u03b1\u2206\u03b1,\u03bd(cid:16)W\u03bd \u25e6 \u02c6X\u03bd(cid:17)\n\u03bd (cid:17)(cid:17).\n\u03bd \u25e6 X\u03bd(cid:17)(cid:17) /(cid:16)P\u03bd R\u03bd,\u03b1\u2206\u03b1,\u03bd(cid:16) \u02c6X 1\u2212p\n\u03bb\u03b1/0P\u03bd R\u03bd,\u03b1\u2206\u03b1,\u03bd(cid:16)W\u03bd \u25e6(cid:0)X\u03bd \u2212 \u02c6X\u03bd(cid:1)(cid:17)\n\n\u03b1,\u03bd(W\u03bd)\n\n2\n\nP\u03bd R\u03bd,\u03b1\u22062\n\nZ\u03b1 \u2190 Z\u03b1 +\n\nFor the special case of the Tweedie family we plug W\u03bd = \u02c6X \u2212p\n\n\u03bd\n\nand get the related formula.\n\n4 Experiments\n\nHere we want to solve the CTF problem introduced (1), which is a coupled CP/MF/MF problem\n\n\u02c6X i,j,k\n\n1 =Xr\n\nAi,rBj,rC k,r\n\nBj,rDp,r\n\n\u02c6X j,p\n\n2 =Xr\n\n\u02c6X j,q\n\n3 =Xr\n\nBj,rEq,r\n\n(26)\n\nwhere we employ the symbols A : E for the latent tensors instead of Z\u03b1. This factorisation problem\nhas the following R matrix with |\u03b1| = 5, |\u03bd| = 3\n\nR =\" 1\n\n0\n0\n\n1 1\n1 0\n1 0\n\n0 0\n1 0\n\n0 1 #\n\nwith\n\n\u02c6X1 =P A1B1C 1D0E0\n\u02c6X2 =P A0B1C 0D1E0\n\u02c6X3 =P A0B1C 0D0E1\n\n\u03b1,\u03bd() for \u03bd = 1 (CP)\nWe want to use the general update equation (25). This requires derivation of \u2206\u03b5\nand \u03bd = 2 (MF) but not for \u03bd = 3 since that \u2206\u03b1,3() has the same shape as \u2206\u03b1,2(). Here we show\nthe computation for B, i.e. for Z2, which is the common factor\n\n\u2206\u03b5\n\n\u2206\u03b5\n\nB,1(Q) =\"Xik\nB,2(Q) =\"Xp\n\nQi,j,k(cid:16)Ai,rC k,r(cid:17)\u03b5# = Q(1)(C \u03b5 \u2299 A\u03b5)\nQj,p(cid:0)Dp,r(cid:1)\u03b5# = QD\u03b5\n\n6\n\n(28)\n\n(29)\n\n\fwith Q(n) being mode-n unfolding operation that turns a tensor into matrix form [5]. In addition,\nfor \u03bd = 1 the required scalar value \u03bbB/0 is |r| here since VB \u2229 \u00afV0 = {j, r} \u2229 {r} = {r} noting that\nvalue \u03bbB/0 is the same for \u03bd = 2, 3. The simulated data size for observables is |i| = |j| = |k| =\n|p| = |q| = 30 while the latent dimension is |r| = 5. The number of iterations is 1000 with the\nEuclidean cost while the experiment produced similar results for KL cost as shown in Figure 2.\n\nA\n\n5\nD\n\n5\n\n10\n\n5\n\n0\n\n0\n\n10\n\n5\n\n \n0\n\n10\n\n10\n\nB\n\n5\nE\n\n5\n\n5\n\n0\n\n0\n\n10\n\n5\n\n0\n\n0\n\n5\n\n0\n\n0\n\n10\n\n \n\n10\n\nC\n\n5\n\nOrginal\nInitial\nFinal\n\n10\n\nFigure 2: The \ufb01gure compares the original, the initial (start up) and the \ufb01nal (estimate) factors for\nZ\u03b1 = A, B, C, D, E. Only the \ufb01rst column, i.e. Z\u03b1(1 : 10, 1) is plotted. Note that CP factorisation\nis unique up to permutation and scaling [5] while MF factorisation is not unique, but when coupled\nwith CP it recovers the original data as shown in the \ufb01gure. For visualisation, to \ufb01nd the correct\npermutation, for each of Z\u03b1 the matching permutation between the original and estimate are found\nby solving an orthogonal Procrustes problem [18, pp 601].\n\n4.1 Audio Experiments\n\nIn this section, we illustrate a real data application of our approach, where we reconstruct missing\nparts of an audio spectrogram X(f, t), that represents the STFT coef\ufb01cient magnitude at frequency\nbin f and time frame t of a piano piece, see top left panel of Fig.3. This is a dif\ufb01cult matrix\ncompletion problem: as entire time frames (columns of X) are missing, low rank reconstruction\ntechniques are likely to be ineffective. Yet such missing data patterns arise often in practice, e.g.,\nwhen packets are dropped during digital communication. We will develop here a novel approach,\nexpressed as a coupled TF model. In particular, the reconstruction will be aided by an approximate\nmusical score, not necessarily belonging to the played piece, and spectra of isolated piano sounds.\n\nPioneering work of [19] has demonstrated that, when a audio spectrogram of music is decomposed\n\nusing NMF as X1(f, t) \u2248 \u02c6X(f, t) = Pi D(f, i)E(i, t), the computed factors D and E tend to be\n\nsemantically meaningful and correlate well with the intuitive notion of spectral templates (harmonic\npro\ufb01les of musical notes) and a musical score (reminiscent of a piano roll representation such as a\nMIDI \ufb01le). However, as time frames are modeled conditionally independently, it is impossible to\nreconstruct audio with this model when entire time frames are missing.\n\nIn order to restore the missing parts in the audio, we form a model that can incorporates musical\ninformation of chords structures and how they evolve in time. In order to achieve this, we hierarchi-\ncally decompose the excitation matrix E as a convolution of some basis matrices and their weights:\n\nE(i, t) = Pk,\u03c4 B(i, \u03c4, k)C(k, t \u2212 \u03c4 ). Here the basis tensor B encapsulates both vertical and tem-\n\nporal information of the notes that are likely to be used in a musical piece; the musical piece to\nbe reconstructed will share B, possibly played at different times or tempi as modelled by G. After\nreplacing E with the decomposed version, we get the following model (eq 30):\n\nD(f, i)B(i, \u03c4, k)C(k, d)Z(d, t, \u03c4 )\n\nTest \ufb01le\n\n(30)\n\nB(i, \u03c4, k)G(k, m)Y (m, n, \u03c4 )\n\nD(f, i)F (i, p)T (i, p)\n\nMIDI \ufb01le\n\n(31)\n\nMerged training \ufb01les\n\n(32)\n\n\u02c6X1(f, t) = Xi,\u03c4,k,d\n\u02c6X2(i, n) = X\u03c4,k,m\n\u02c6X3(f, p) =Xi\n\n7\n\n\fHere we have introduced new dummy indices d and m, and new (\ufb01xed) factors Z(d, t, \u03c4 ) = \u03b4(d \u2212\nt + \u03c4 ) and Y (m, n, \u03c4 ) = \u03b4(m \u2212 n + \u03c4 ) to express this model in our framework. In eq 32, while\nforming X3 we concatenate isolated recordings corresponding to different notes. Besides, T is a\n0 \u2212 1 matrix, where T (i, p) = 1(0) if the note i is played (not played) during the time frame p and\nF models the time varying amplitudes of the training data. R matrix for this model is de\ufb01ned as\n\nR =\" 1 1\n\n0 1\n1 0\n\n1 1\n0 0\n0 0\n\n0 0 0\n1 1 0\n0 0 1\n\n0\n0\n\n1 #\n\nwith\n\n\u02c6X1 =P D1B1C 1Z 1G0Y 0F 0T 0\n\u02c6X2 =P D0B1C 0Z 0G1Y 1F 0T 0\n\u02c6X3 =P D1B0C 0Z 0G0Y 0F 1T 1\n\n(33)\n\nFigure 3 illustrates the performance the model, using KL cost (W = \u02c6X \u22121) on a 30 second piano\nrecording where the 70% of the data is missing; we get about 5dB SNR improvement, gracefully\ndegrading from 10% to 80% missing data: the results are encouraging as quite long portions of audio\nare missing, see bottom right panel of Fig.3.\n\nX1\n\nX2 (Transcription Data)\n\nX3 (Isolated Recordings)\n\n)\nz\nH\n\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n)\nz\nH\n\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n80\n\n60\n\ns\ne\n\nt\n\no\nN\n\n40\n\n20\n\n)\nz\nH\n\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n5\n\n10\n\n15\n\nTime (sec)\n\n20\n\nX1hat (Restored)\n\n25\n\n5\n\n10\n\n15\n\nTime (sec)\n\n20\n\n25\n\n50\n\n100\nTime (sec)\n\nGround Truth\n\n150\n\n5\n\n10\n\n15\n\nTime (sec)\n\n20\n\n25\n\n)\nz\nH\n\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n)\n\nB\nd\n(\n \n\nR\nN\nS\n\n15\n\n10\n\n5\n\n0\n\n \n\n100\n\n200\n\nTime (sec)\n\nPerformance\n\n300\n\n \n\nReconst. SNR\nInitial SNR\n\n20\n\n60\nMissing Data Percentage (%)\n\n40\n\n80\n\nFigure 3: Top row, left to right: Observed matrices X1: spectrum of the piano performance, darker\ncolors imply higher magnitude (missing data (70%) are shown white), X2, a piano roll obtained\nfrom a musical score of the piece, X3, spectra of 88 isolated notes from a piano. Bottom Row:\nReconstructed X1, the ground truth, and the SNR results with increasing missing data. Here, initial\nSNR is computed by substituting 0 as missing values.\n\n5 Discussion\n\nThis paper establishes a link between GLMs and TFs and provides a general solution for the compu-\ntation of arbitrary coupled TFs, using message passing primitives. The current treatment focused on\nML estimation; as immediate future work, the probabilistic interpretation is to be extended to a full\nBayesian inference with appropriate priors and inference methods. A powerful aspect, which we\nhave not been able to summarize here is assigning different cost functions, i.e. distributions, to dif-\nferent observation tensors in a coupled factorization model. This requires only minor modi\ufb01cations\nto the update equations. We believe that, as a whole, the GCTF framework covers a broad range\nof models that can be useful in many different application areas beyond audio processing, such as\nnetwork analysis, bioinformatics or collaborative \ufb01ltering.\nAcknowledgements: This work is funded by the T \u00a8UB\u02d9ITAK grant number 110E292, Bayesian\nmatrix and tensor factorisations (BAYTEN) and Bo\u02d8gazic\u00b8i University research fund BAP5723. Umut\nS\u00b8ims\u00b8ekli is also supported by a Ph.D. scholarship from T \u00a8UB\u02d9ITAK. We also would like to thank to\nEvrim Acar for the fruitful discussions.\n\n8\n\n\fReferences\n\n[1] A. T. Cemgil, Bayesian inference for nonnegative matrix factorisation models, Computational Intelligence\n\nand Neuroscience 2009 (2009) 1\u201317.\n\n[2] A. P. Singh, G. J. Gordon, A uni\ufb01ed view of matrix factorization models, in: ECML PKDD\u201908, Part II,\n\nno. 5212, Springer, 2008, pp. 358\u2013373.\n\n[3] E. Acar, T. G. Kolda, D. M. Dunlavy, All-at-once optimization for coupled matrix and tensor factoriza-\n\ntions, CoRR abs/1105.3422. arXiv:1105.3422.\n\n[4] Q. Xu, E. W. Xiang, Q. Yang, Protein-protein interaction prediction via collective matrix factorization,\n\nin: In Proc. of the IEEE International Conference on BIBM, 2010, pp. 62\u201367.\n\n[5] T. G. Kolda, B. W. Bader, Tensor decompositions and applications, SIAM Review 51 (3) (2009) 455\u2013500.\n[6] Y. K. Y\u0131lmaz, A. T. Cemgil, Probabilistic latent tensor factorization, in: Proceedings of the 9th interna-\ntional conference on Latent variable analysis and signal separation, LVA/ICA\u201910, Springer-Verlag, 2010,\npp. 346\u2013353.\n\n[7] C. Fevotte, A. T. Cemgil, Nonnegative matrix factorisations as probabilistic inference in composite mod-\n\nels, in: Proc. 17th EUSIPCO, 2009.\n\n[8] Y. K. Y\u0131lmaz, A. T. Cemgil, Algorithms for probabilistic latent tensor factorization, Signal Process-\n\ning(2011),doi:10.1016/j.sigpro.2011.09.033.\n\n[9] C. E. McCulloch, S. R. Searle, Generalized, Linear, and Mixed Models, Wiley, 2001.\n[10] C. E. McCulloch, J. A. Nelder, Generalized Linear Models, 2nd Edition, Chapman and Hall, 1989.\n[11] R. Kaas, Compound poisson distributions and glm\u2019s, tweedie\u2019s distribution, Tech. rep., Lecture, Royal\n\nFlemish Academy of Belgium for Science and the Arts, (2005).\n\n[12] A. Cichocki, R. Zdunek, A. H. Phan, S. Amari, Nonnegative Matrix and Tensor Factorization, Wiley,\n\n2009.\n\n[13] J. R. Magnus, H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Economet-\n\nrics, 3rd Edition, Wiley, 2007.\n\n[14] M. Wainwright, M. I. Jordan, Graphical models, exponential families, and variational inference, Founda-\n\ntions and Trends in Machine Learning 1 (2008) 1\u2013305.\n\n[15] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization, in: NIPS, Vol. 13, 2001, pp.\n\n556\u2013562.\n\n[16] M. Marcus, H. Minc, A Survey of Matrix Theory and Matrix Inequalities, Dover, 1992.\n[17] R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization, in: Advances in Neural Information Pro-\n\ncessing Systems, Vol. 20, 2008.\n\n[18] G. H. Golub, C. F. V. Loan, Matrix computations, 3rd Edition, Johns Hopkins UP, 1996.\n[19] P. Smaragdis, J. C. Brown, Non-negative matrix factorization for polyphonic music transcription, in:\n\nWASPAA, 2003, pp. 177\u2013180.\n\n9\n\n\f", "award": [], "sourceid": 1189, "authors": [{"given_name": "Kenan", "family_name": "Y\u0131lmaz", "institution": null}, {"given_name": "Ali", "family_name": "Cemgil", "institution": null}, {"given_name": "Umut", "family_name": "Simsekli", "institution": null}]}