{"title": "Learning Disentangled Representations for Recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 5711, "page_last": 5722, "abstract": "User behavior data in recommender systems are driven by the complex interactions of many latent factors behind the users\u2019 decision making processes. The factors are highly entangled, and may range from high-level ones that govern user intentions, to low-level ones that characterize a user\u2019s preference when executing an intention. Learning representations that uncover and disentangle these latent factors can bring enhanced robustness, interpretability, and controllability. However, learning such disentangled representations from user behavior is challenging, and remains largely neglected by the existing literature. In this paper, we present the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE) for learning disentangled representations from user behavior. Our approach achieves macro disentanglement by inferring the high-level concepts associated with user intentions (e.g., to buy a shirt or a cellphone), while capturing the preference of a user regarding the different concepts separately. A micro-disentanglement regularizer, stemming from an information-theoretic interpretation of VAEs, then forces each dimension of the representations to independently reflect an isolated low-level factor (e.g., the size or the color of a shirt). Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. We further demonstrate that the learned representations are interpretable and controllable, which can potentially lead to a new paradigm for recommendation where users are given fine-grained control over targeted aspects of the recommendation lists.", "full_text": "Learning Disentangled Representations for\n\nRecommendation\n\nJianxin Ma1,2\u2217, Chang Zhou1\u2217, Peng Cui2, Hongxia Yang1, Wenwu Zhu2\n\n1Alibaba Group, 2Tsinghua University\n\nmajx13fromthu@gmail.com, ericzhou.zc@alibaba-inc.com,\n\ncuip@tsinghua.edu.cn, yang.yhx@alibaba-inc.com, wwzhu@tsinghua.edu.cn\n\nAbstract\n\nUser behavior data in recommender systems are driven by the complex interactions\nof many latent factors behind the users\u2019 decision making processes. The factors are\nhighly entangled, and may range from high-level ones that govern user intentions,\nto low-level ones that characterize a user\u2019s preference when executing an intention.\nLearning representations that uncover and disentangle these latent factors can bring\nenhanced robustness, interpretability, and controllability. However, learning such\ndisentangled representations from user behavior is challenging, and remains largely\nneglected by the existing literature. In this paper, we present the MACRo-mIcro\nDisentangled Variational Auto-Encoder (MacridVAE) for learning disentangled\nrepresentations from user behavior. Our approach achieves macro disentanglement\nby inferring the high-level concepts associated with user intentions (e.g., to buy\na shirt or a cellphone), while capturing the preference of a user regarding the\ndifferent concepts separately. A micro-disentanglement regularizer, stemming\nfrom an information-theoretic interpretation of VAEs, then forces each dimension\nof the representations to independently re\ufb02ect an isolated low-level factor (e.g.,\nthe size or the color of a shirt). Empirical results show that our approach can\nachieve substantial improvement over the state-of-the-art baselines. We further\ndemonstrate that the learned representations are interpretable and controllable,\nwhich can potentially lead to a new paradigm for recommendation where users are\ngiven \ufb01ne-grained control over targeted aspects of the recommendation lists.\n\n1\n\nIntroduction\n\nLearning representations that re\ufb02ect users\u2019 preference, based chie\ufb02y on user behavior, has been a\ncentral theme of research on recommender systems. Despite their notable success, the existing user\nbehavior-based representation learning methods, such as the recent deep approaches [49, 32, 31, 52,\n11, 18], generally neglect the complex interaction among the latent factors behind the users\u2019 decision\nmaking processes. In particular, the latent factors can be highly entangled, and range from macro\nones that govern the intention of a user during a session, to micro ones that describe at a granular level\na user\u2019s preference when implementing a speci\ufb01c intention. The existing methods fail to disentangle\nthe latent factors, and the learned representations are consequently prone to mistakenly preserve the\nconfounding of the factors, leading to non-robustness and low interpretability.\nDisentangled representation learning, which aims to learn factorized representations that uncover\nand disentangle the latent explanatory factors hidden in the observed data [3], has recently gained\nmuch attention. Not only can disentangled representations be more robust, i.e., less sensitive to the\nmisleading correlations presented in the limited training data, the enhanced interpretability also \ufb01nds\n\n\u2217Equal contribution. Work done at Alibaba.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our framework. Macro disentanglement is achieved by learning a set of prototypes, based\non which the user intention related with each item is inferred, and then capturing the preference of a\nuser about the different intentions separately. Micro disentanglement is achieved by magnifying the\nKL divergence, from which a term that penalizes total correlation can be separated, with a factor of \u03b2.\n\ndirect application in recommendation-related tasks, such as transparent advertising [33], customer-\nrelationship management, and explainable recommendation [51, 17]. Moreover, the controllability\nexhibited by many disentangled representations [19, 14, 10, 8, 9, 25] can potentially bring a new\nparadigm for recommendation, by giving users explicit control over the recommendation results and\nproviding a more interactive experience. However, the existing efforts on disentangled representation\nlearning are mainly from the \ufb01eld of computer vision [28, 15, 20, 30, 53, 14, 10, 39, 19].\nLearning disentangled representations based on user behavior data, a kind of discrete relational\ndata that is fundamentally different from the well-researched image data, is challenging and largely\nunexplored. Speci\ufb01cally, it poses two challenges. First, the co-existence of macro and micro factors\nrequires us to to separate the two levels when performing disentanglement, in a way that preserves\nthe hierarchical relationships between an intention and the preference about the intention. Second,\nthe observed user behavior data, e.g., user-item interactions, are discrete and sparse in nature, while\nthe learned representations are continuous. This implies that the majority of the points in the high-\ndimensional representation space will not be associated with any behavior, which is especially\nproblematic when one attempts to investigate the interpretability of an isolated dimension by varying\nthe value of the dimension while keeping the other dimensions \ufb01xed.\nIn this paper, we propose the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE)\nfor learning disentangled representations based on user behavior. Our approach explicitly models\nthe separation of macro and micro factors, and performs disentanglement at each level. Macro\ndisentanglement is achieved by identifying the high-level concepts associated with user intentions,\nand separately learning the preference of a user regarding the different concepts. A regularizer\nfor micro disentanglement, derived by interpreting VAEs [27, 44] from an information-theoretic\nperspective, is then strengthened so as to force each individual dimension to re\ufb02ect an independent\nmicro factor. A beam-search strategy, which handles the con\ufb02ict between sparse discrete observations\nand dense continuous representations by \ufb01nding a smooth trajectory, is then proposed for investi-\ngating the interpretability of each isolated dimension. Empirical results show that our approach can\nachieve substantial improvement over the state-of-the-art baselines. And the learned disentangled\nrepresentations are demonstrated to be interpretable and controllable.\n\n2 Method\n\nIn this section, we present our approach for learning disentangled representations from user behaivor.\n\n2.1 Notations and Problem Formulation\nA user behavior dataset D consists of the interactions between N users and M items. The interaction\nbetween the uth user and the ith item is denoted by xu,i \u2208 {0, 1}, where xu,i = 1 indicates that user\nu explicitly adopts item i, whereas xu,i = 0 means there is no recorded interaction between the two.\nFor convenience, we use xu = {xu,i : xu,i = 1} to represent the items adopted by user u. The goal\nis to learn user representations {zu}N\nu=1 that achieves both macro and micro disentanglement. We\nuse \u03b8 to denote the set that contains all the trainable parameters of our model.\n\n2\n\nEncoder621477325172643535461\u00b5(3)\u00b5(2)(1)(2)(3)z(1)\u00b5(1)z(3)N(\u00b5(1),(1))N(\u00b5(3),(3))z(2)N(\u00b5(2),(2))N(0,20I)N(0,20I)N(0,20I)DKLDKLDKLReconstruction LossEncoderEncoderDecoderDecoderDecoderPrototypeyy Positive ItemNegative ItemNot Encoded\fMacro disentanglement Users may have very diverse interests, and interact with items that belong\nto many high-level concepts, e.g., product categories. We aim to achieve macro disentanglement,\nu ] \u2208 Rd(cid:48)\nu ; . . . ; z(K)\nby learning a factorized representation of user u, namely zu = [z(1)\n, where\nd(cid:48) = Kd, assuming that there are K high-level concepts. The kth component z(k)\nu \u2208 Rd is for\ncapturing the user\u2019s preference regarding the kth concept. Additionally, we infer a set of one-hot\nvectors C = {ci}M\ni=1 for the items, where ci = [ci,1; ci,2; . . . ; ci,K]. If item i belongs to concept k,\nthen ci,k = 1 and ci,k(cid:48) = 0 for any k(cid:48)\n\nu=1 and C unsupervisedly.\n\nu ; z(2)\n\n(cid:54)= k. We jointly infer {zu}N\n\nMicro disentanglement High-level concepts correspond to the intentions of a user, e.g., to buy\nclothes or a cellphone. We are also interested in disentangling a user\u2019s preference at a more granular\nlevel regarding the various aspects of an item. For example, we would like the different dimensions\nof z(k)\n\nu to individually capture the user\u2019s preferred sizes, colors, etc., if concept k is clothing.\n\n2.2 Model\n\nWe start by proposing a generative model that encourages macro disentanglement. For a user u, our\ngenerative model assumes that the observed data are generated from the following distribution:\n\n(cid:21)\n\n,\n\n(1)\n\n(2)\n\n(cid:20)(cid:90)\n\n(cid:89)\n\np\u03b8(xu) = Ep\u03b8 (C)\n\np\u03b8 (xu | zu, C) p\u03b8(zu) dzu\n\np\u03b8 (xu | zu, C) =\n\np\u03b8(xu,i | zu, C).\n\nxu,i\u2208xu\n\nThe meanings of xu, zu, C are described in the previous subsection. We have assumed p\u03b8(zu) =\np\u03b8(zu | C) in the \ufb01rst equation, i.e., zu and C are generated by two independent sources. Note that\nZu =(cid:80)M\nci = [ci,1; ci,2; . . . ; ci,K] is one-hot, since we assume that item i belongs to exactly one concept. And\np\u03b8(xu,i | zu, C) = Z\u22121\n\u03b8 (z(k)\nu ) is a categorical distribution over the M items, where\nu \u00b7\n\u03b8 : Rd \u2192 R+ is a shallow neural network that estimates\nhow much a user with a given preference is interested in item i. We use sampeld softmax [23] to\nestimate Zu based on a few sampled items when M is very large.\n\n(cid:80)K\n(cid:80)K\nk=1 ci,k \u00b7 g(i)\nu ) and g(i)\n\u03b8 (z(k)\nk=1 ci,k \u00b7 g(i)\n\ni=1\n\nMacro disentanglement We assume above that the user representation zu is suf\ufb01cient for predict-\ning how the user will interact with the items. And we further assume that using the kth component\nz(k)\nu alone is already suf\ufb01cient if the prediction is about an item from concept k. This design explicitly\nencourages z(k)\nu to capture preference regarding only the kth concept, as long as the inferred concept\nassignment matrix C is meaningful. We will describe later the implementation details of p\u03b8(C),\np\u03b8(zu) and g(i)\nu ). Nevertheless, we note that p\u03b8(C) requires careful design to prevent mode\ncollapse, i.e., the degenerate case where almost all items are assigned to a single concept.\n\n\u03b8 (z(k)\n\noptimize \u03b8 by maximizing a lower bound of(cid:80)\nln p\u03b8(xu) \u2265 Ep\u03b8 (C)\n\n(cid:2)Eq\u03b8 (zu|xu,C)[ln p\u03b8(xu | zu, C)] \u2212 DKL(q\u03b8(zu | xu, C)(cid:107)p\u03b8(zu))(cid:3) .\n\nVariational inference We follow the variational auto-encoder (VAE) paradigm [27, 44], and\nu ln p\u03b8(xu), where ln p\u03b8(xu) is bounded as follows:\n(3)\nSee the supplementary material for the derivation of the lower bound. Here we have introduced\na variational distribution q\u03b8(zu | xu, C), whose implementation also encourages macro disentan-\nglement and will be presented later. The two expectations, i.e., Ep\u03b8 (C)[\u00b7] and Eq\u03b8 (zu|xu,C)[\u00b7], are\nintractable, and are therefore estimated using the Gumbel-Softmax trick [22, 41] and the Gaussian\nre-parameterization trick [27], respectively. Once the training procedure is \ufb01nished, we use the mode\nof p\u03b8(C) as C, and the mode of q\u03b8(zu | xu, C) as the representation of user u.\n(cid:81)d\nMicro disentanglement A natural strategy to encourage micro disentanglement is to force statisti-\ncal independence between the dimensions, i.e., to force q\u03b8(z(k)\nu,j | C), so that\nFortunately, the Kullback\u2013Leibler (KL) divergence term in the lower bound above does provide a\nway to encourage independence. Speci\ufb01cally, the KL term of our model can be rewritten as:\nEpdata(xu) [DKL(q\u03b8(zu | xu, C)(cid:107)p\u03b8(zu))] = Iq(xu; zu) + DKL(q\u03b8(zu | C)(cid:107)p\u03b8(zu)).\n\neach dimension describes an isolated factor. Here q\u03b8(zu | C) =(cid:82) q\u03b8(zu | xu, C)pdata(xu) dxu.\n\nu | C) \u2248\n\nj=1 q\u03b8(z(k)\n\n(4)\n\n3\n\n\findependence between the dimensions, if we choose a prior that satis\ufb01es p\u03b8(zu) =(cid:81)d(cid:48)\n\nSee the supplementary material for the proof. Similar decomposition of the KL term has been\nnoted for the original VAEs previously [1, 25, 9]. Penalizing the latter KL term would encourage\nj=1 p\u03b8(zu,j).\nOn the other hand, the former term Iq(xu; zu) is the mutual information between xu and zu under\nq\u03b8(zu | xu, C)\u00b7pdata(xu). Penalizing Iq(xu; zu) is equivalent to applying the information bottleneck\nprinciple [47, 2], which encourages zu to ignore as much noise in the input as it can and to focus\non merely the essential information. We therefore follow \u03b2-VAE [19], and strengthen these two\nregularization terms by a factor of \u03b2 (cid:29) 1, which brings us to the following training objective:\n\n(cid:2)Eq\u03b8 (zu|xu,C)[ln p\u03b8(xu | zu, C)] \u2212 \u03b2 \u00b7 DKL(q\u03b8(zu | xu, C)(cid:107)p\u03b8(zu))(cid:3) .\n\nEp\u03b8 (C)\n\n(5)\n\nImplementation\n\ni=1 \u2208 RM\u00d7d used by the decoder, M context representations {ti}M\n\n2.3\nIn this section, we describe the implementation of p\u03b8(C), p\u03b8(xu,i | zu, C) (the decoder), p\u03b8(zu)\n(the prior), q\u03b8(zu | xu, C) (the encoder), and propose an ef\ufb01cient strategy to combat mode collapse.\nk=1 \u2208 RK\u00d7d, M item\nThe parameters \u03b8 of our implementation include: K concept prototypes {mk}K\ni=1 \u2208 RM\u00d7d\nrepresentations {hi}M\nused by the encoder, and the parameters of a neural network fnn : Rd \u2192 R2d. We optimize \u03b8 to\nmaximize the training objective (see Equation 5) using Adam [26].\n(cid:81)M\nPrototype-based concept assignment A straightforward approach would be to assume p\u03b8(C) =\ni=1 p(ci) and parameterize each categorical distribution p(ci) with its own set of K \u2212 1 parameters.\nThis approach, however, would result in over-parameterization and low sample ef\ufb01ciency. We instead\npropose a prototype-based implementation. To be speci\ufb01c, we introduce K concept prototypes\ni=1 from the decoder. We then assume ci is a\n{mk}K\none-hot vector drawn from the following categorical distribution p\u03b8(ci):\n\nk=1 and reuse the item representations {hi}M\n\n\u03c4 , 1\n\nsi,k = COSINE(hi, mk)/\u03c4,\n\nci \u223c CATEGORICAL (SOFTMAX([si,1; si,2; . . . ; si,K])) ,\n\n(6)\nwhere COSINE(a, b) = a(cid:62)b/((cid:107)a(cid:107)2 (cid:107)b(cid:107)2) is the cosine similarity, and \u03c4 is a hyper-parameter that\n\u03c4 ]. We set \u03c4 = 0.1 to obtain a more skewed distribution.\nscales the similarity from [\u22121, 1] to [\u2212 1\nPreventing mode collapse We use cosine similarity, instead of the inner product similarity adopted\nby most existing deep learning methods [32, 31, 18]. This choice is crucial for preventing mode\ncollapse. In fact, with inner product, the majority of the items are highly likely to be assigned\nto a single concept mk(cid:48) that has an extremely large norm, i.e., (cid:107)mk(cid:48)(cid:107)2 \u2192 \u221e, even when the\nitems {hi}M\ni=1 correctly form K clusters in the high-dimensional Euclidean space. And we observe\nempirically that this phenomenon does occur frequently with inner product (see Figure 2e). In\ncontrast, cosine similarity avoids this degenerate case due to the normalization. Moreover, cosine\nsimilarity is related with the Euclidean distance on the unit hypersphere, and the Euclidean distance\nis a proper metric that is more suitable for inferring the cluster structure, compared to inner product.\n\nu ; z(2)\n\n(cid:80)K\nu ; . . . ; z(K)\nk=1 ci,k \u00b7 g(i)\n\nu ) = exp(COSINE(z(k)\nu }N\n\nDecoder The decoder predicts which item out of the M ones is mostly likely to be clicked by a user,\nwhen given the user\u2019s representation zu = [z(1)\nu ] and the one-hot concept assignments\n\u03b8 (z(k)\nu ) is a categorical distribution over\ni=1. We assume that p\u03b8(xu,i | zu, C) \u221d\n{ci}M\n\u03b8 (z(k)\nthe M items, and de\ufb01ne g(i)\nu , hi)/\u03c4 ). This design implies that {hi}M\ni=1\nwill be micro-disentangled if {z(k)\nu=1 is micro-disentangled, as the two\u2019s dimensions are aligned.\nPrior & Encoder The prior p\u03b8(zu) needs to be factorized in order to achieve micro disentan-\n0I). The encoder q\u03b8(zu | xu, C) is for comput-\nglement. We therefore set p\u03b8(zu) to N (0, \u03c32\ning the representation of a user when given the user\u2019s behavior data xu. The encoder main-\ntains an additional set of context representations {ti}M\ni=1, rather than reusing the item represen-\ntations {hi}M\ni=1 from the decoder, which is a common practice in the literature [32]. We assume\n| xu, C) as a multivariate\nnormal distribution with a diagonal covariance matrix N (\u00b5(k)\nu )]2), where the mean and\n\nq\u03b8(zu | xu, C) = (cid:81)K\n\n| xu, C), and represent each q\u03b8(z(k)\nu , [diag(\u03c3(k)\n\nk=1 q\u03b8(z(k)\n\nu\n\nu\n\n4\n\n\f(cid:18)\n\n1\n2\n\n\u2212\n\nb(k)\nu\n\n(cid:19)\n\n. (7)\n\na(k)\nu\n(cid:107)a(k)\nu (cid:107)2\n\nthe standard deviation are parameterized by a neural network fnn : Rd \u2192 R2d:\n(a(k)\n\n, \u03c3(k)\n\nu ) = fnn\n\nu , b(k)\n\nu =\n\nu \u2190 \u03c30 \u00b7 exp\n\n\uf8f6\uf8f8 , \u00b5(k)\n\n\uf8eb\uf8ed(cid:80)\n(cid:113)(cid:80)\n\ni:xu,i=+1 ci,k \u00b7 ti\ni:xu,i=+1 c2\ni,k\n\nThe neural network fnn(\u00b7) captures nonlinearity, and is shared across the K components. We\nnormalize the mean, so as to be consistent with the use of cosine similarity which projects the\nrepresentations onto a unit hypersphere. Note that \u03c30 should be set to a small value, e.g., around 0.1,\nsince the learned representations are now normalized.\n\n2.4 User-Controllable Recommendation\n\nThe controllability enabled by the disentangled representations can bring a new paradigm for recom-\nmendation. It allows a user to interactively search for items that are similar to an initial item except for\nsome controlled aspects, or to explicitly adjust the disentangeld representation of his/her preference,\nlearned by the system from his/her past behaviors, to actually match the current preference. Here we\nformalize the task of user-controllable recommendation, and illustrate a possible solution.\nTask de\ufb01nition Let h\u2217 \u2208 Rd be the representation to be altered, which can be initialized as either\nan item representation or a component of a user representation. The task is to gradually alter its jth\ndimension h\u2217,j, while retrieving items whose representations are similar to the altered representation.\nThis task is nontrivial, since usually no item will have exactly the same representation as the altered\none, especially when we want the transition to be smooth, monotonic, and thus human-understandable.\n\nSolution Here we illustrate our approach to this task. We \ufb01rst probe the suitable range (a, b) for\nh\u2217,j. Let us assume that prototype k\u2217 is the prototype closest to h\u2217. The range (a, b) is decided such\nthat: prototype k\u2217 remains the prototype closest to h\u2217 if and only if h\u2217,j \u2208 (a, b). We can decide each\nendpoint of the range using binary search. We then divide the range (a, b) into B subranges, a = a0 <\na1 < a2 . . . < aB = b. We ensure that the subranges contain roughly the same number of items from\ndecide the B items by maximizing(cid:80)\nconcept k\u2217 when dividing (a, b) . Finally, we aim to retrieve B items {it}B\nt=1 \u2208 {1, 2, . . . , M}B\nthat belong to concept k\u2217, each from one of the B subranges, i.e., hit,j \u2208 (at\u22121, at]. We thus\nt(cid:48) ,\u2212j )\nwhere hi,\u2212j = [hi,1; hi,2; . . . ; hi,j\u22121; hi,j+1; . . . ; hi,d] \u2208 Rd\u22121 and \u03b3 is a hyper-parameter. We\napproximately solve this maximization problem sequentially using beam search [36].\nIntuitively, selecting items from the B subranges ensures that the items change monotonously in\nterms of the jth dimension. On the other hand, the \ufb01rst term in the maximization problem forces the\nretrieved items to be similar with the initial item in terms of the dimensions other than j, while the\nsecond term encourages any two retrieved items to be similar in terms of the dimensions other than j.\n\nCOSINE(hit,\u2212j ,h\u2217,\u2212j )\n\n1\u2264t0 ci,k, where ci,k is\n\nhighest con\ufb01dence levels. The con\ufb01dence of component k is de\ufb01ned as(cid:80)\n\nthe value inferred by our model, rather than the ground-truth. The results are shown in Figure 2.\n\nInterpretability Figure 2c, which shows the clusters inferred based on the prototypes, is rather\nsimilar to Figure 2d that shows the ground-truth categories, despite the fact that our model is trained\n\n6\n\n\f(a) Items and users. Item i is colored according to arg maxk ci,k,\ni.e., the inferred category. Each component of a user is treated as an\nindividual point, and the kth component is colored according to k.\n\n(b) Users only, colored in the\nsame way as Figure 2a.\n\n(c) Items only, colored in the same\nway as Figure 2a.\n\n(d) Items only, colored according\nto their ground-truth categories.\n\n(e) Items, obtained by training a\nnew model that uses inner product\ninstead of cosine, colored accord-\ning to the value of arg maxk ci,k.\nFigure 2: The discovered clusters of items (see Figure 2c), learned unsupervisedly, align well with\nthe ground-truth categories (see Figure 2d, where the color order is chosen such that the connections\nbetween the ground-truth categories and the learned clusters are easy to verify). Figure 2e highlights\nthe importance of using cosine similarity, rather than inner product, to combat mode collapse.\n\n(a) Bag size.\n\n(b) Bag color.\n\n(c) Styles of phone cases.\n\n(d) Bag size. The same dimension\nas Figure 3a.\nFigure 3: Starting from an item representation, we gradually alter the value of a target dimension,\nand list the items that have representations similar to the altered representations (see Subsection 2.4).\n\n(e) Bag color. The same dimension\nas Figure 3b.\n\n(f) Chicken \u2192 beef \u2192 mutton \u2192\nseafood.\n\nwithout the ground-truth category labels. This demonstrates that our approach is able to discover and\ndisentangle the macro structures underlying the user behavior data in an interpretable way. Moreover,\nthe components of the user representations are near the correct cluster centers (see Figure 2a and\nFigure 2b), and are hence likely capturing the users\u2019 separate preferences for different categories.\n\nCosine vs. inner product To highlight the necessity of using cosine similarity instead of the more\ncommonly used inner product similarity, we additionally train a new model that uses inner product\nin place of cosine, and visualize the learned item representations in Figure 2e. With inner product,\nthe majority of the items are assigned to the same prototype (see Figure 2e). In comparison, all\nseven prototypes learned by the cosine-based model are assigned a signi\ufb01cant number of items (see\nFigure 2c). This \ufb01nding supports our claim that a proper metric space, such as the one implied by the\ncosine similarity, is important for preventing mode collapse.\n\n3.4 Micro Disentanglement\nIndependence We vary the hyper-parameters related with micro disentanglement (\u03b2 and \u03c30 for our\napproach, \u03b2 for \u03b2-MultVAE), and plot in Figure 4 the relationship between the level of independence\nachieved and the recommendation performance. Each method is evaluated with 2,000 randomly\n\n7\n\n\fFigure 4: Micro disentanglement vs. recommendation performance. (d, d(cid:48)) indicates d-dimensional\nitem representations and d(cid:48)-dimensional user representations. Note that d(cid:48) = Kd. We observe that\n(1) our approach outperforms the baselines in terms of both performance and micro disentanglement,\nand (2) macro disentanglement bene\ufb01ts micro disentanglement, as K = 7 is better than K = 1.\n\n(cid:80)\nsampled con\ufb01gurations on ML-100k. We quantify the level of independence achieved by a set of\nd-dimensional representations using 1 \u2212 2\n1\u2264i