{"title": "Joint Optimization of Tree-based Index and Deep Model for Recommender Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 3971, "page_last": 3980, "abstract": "Large-scale industrial recommender systems are usually confronted with computational problems due to the enormous corpus size. To retrieve and recommend the most relevant items to users under response time limits, resorting to an efficient index structure is an effective and practical solution. The previous work Tree-based Deep Model (TDM) \\cite{zhu2018learning} greatly improves recommendation accuracy using tree index. By indexing items in a tree hierarchy and training a user-node preference prediction model satisfying a max-heap like property in the tree, TDM provides logarithmic computational complexity w.r.t. the corpus size, enabling the use of arbitrary advanced models in candidate retrieval and recommendation.\n\n In tree-based recommendation methods, the quality of both the tree index and the user-node preference prediction model determines the recommendation accuracy for the most part. We argue that the learning of tree index and preference model has interdependence. Our purpose, in this paper, is to develop a method to jointly learn the index structure and user preference prediction model. In our proposed joint optimization framework, the learning of index and user preference prediction model are carried out under a unified performance measure. Besides, we come up with a novel hierarchical user preference representation utilizing the tree index hierarchy. Experimental evaluations with two large-scale real-world datasets show that the proposed method improves recommendation accuracy significantly. Online A/B test results at a display advertising platform also demonstrate the effectiveness of the proposed method in production environments.", "full_text": "Joint Optimization of Tree-based Index and Deep\n\nModel for Recommender Systems\n\nHan Zhu1, Daqing Chang1, Ziru Xu1,2\u2217, Pengye Zhang1\n\n2School of Software, Tsinghua University\n\n1Alibaba Group\n\nBeijing, China\n\n{zhuhan.zh, daqing.cdq, ziru.xzr, pengye.zpy}@alibaba-inc.com\n\nXiang Li, Jie He, Han Li, Jian Xu, Kun Gai\n\nAlibaba Group\nBeijing, China\n\n{yushi.lx, jay.hj, lihan.lh, xiyu.xj, jingshi.gk}@alibaba-inc.com\n\nAbstract\n\nLarge-scale industrial recommender systems are usually confronted with compu-\ntational problems due to the enormous corpus size. To retrieve and recommend\nthe most relevant items to users under response time limits, resorting to an ef-\n\ufb01cient index structure is an effective and practical solution. The previous work\nTree-based Deep Model (TDM) [34] greatly improves recommendation accuracy\nusing tree index. By indexing items in a tree hierarchy and training a user-node\npreference prediction model satisfying a max-heap like property in the tree, TDM\nprovides logarithmic computational complexity w.r.t. the corpus size, enabling the\nuse of arbitrary advanced models in candidate retrieval and recommendation.\nIn tree-based recommendation methods, the quality of both the tree index and the\nuser-node preference prediction model determines the recommendation accuracy\nfor the most part. We argue that the learning of tree index and preference model\nhas interdependence. Our purpose, in this paper, is to develop a method to jointly\nlearn the index structure and user preference prediction model. In our proposed\njoint optimization framework, the learning of index and user preference prediction\nmodel are carried out under a uni\ufb01ed performance measure. Besides, we come\nup with a novel hierarchical user preference representation utilizing the tree in-\ndex hierarchy. Experimental evaluations with two large-scale real-world datasets\nshow that the proposed method improves recommendation accuracy signi\ufb01cantly.\nOnline A/B test results at a display advertising platform also demonstrate the ef-\nfectiveness of the proposed method in production environments.\n\n1\n\nIntroduction\n\nRecommendation problem is basically to retrieve a set of most relevant or preferred items for each\nuser request from the entire corpus. In the practice of large-scale recommendation, the algorithm\ndesign should strike a balance between accuracy and ef\ufb01ciency. In corpus with tens or hundreds of\nmillions of items, methods that need to linearly scan each item\u2019s preference score for each single\nuser request are not computationally tractable. To solve the problem, index structure is commonly\nused to accelerate the retrieval process. In early recommender systems, item-based collaborative\n\n\u2217The work is done when she was a student intern in Alibaba Group\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\ufb01ltering (Item-CF) along with the inverted index is a popular solution to overcome the calculation\nbarrier [18]. However, the scope of candidate set is limited, because only those items similar to\nuser\u2019s historical behaviors can be ultimately recommended.\nIn recent days, vector representation learning methods [27, 16, 26, 5, 22, 2] have been actively re-\nsearched. This kind of methods can learn user and item\u2019s vector representations, the inner-product\nof which represents user-item preference. For systems that use vector representation based methods,\nthe recommendation set generation is equivalent to the k-nearest neighbor (kNN) search problem.\nQuantization-based index [19, 14] for approximate kNN search is widely adopted to accelerate the\nretrieval process. However, in the above solution, the vector representation learning and the kNN\nsearch index construction are optimized towards different objectives individually. The objective\ndivergence leads to suboptimal vector representations and index structure [4]. An even more impor-\ntant problem is that the dependence on vector kNN search index requires an inner-product form of\nuser preference modeling, which limits the model capability [10]. Models like Deep Interest Net-\nwork [32], Deep Interest Evolution Network [31] and xDeepFM [17], which have been proven to be\neffective in user preference prediction, could not be used to generate candidates in recommendation.\nIn order to break the inner-product form limitation and make arbitrary advanced user preference\nmodels computationally tractable to retrieve candidates from the entire corpus, the previous work\nTree-based Deep Model (TDM) [34] creatively uses tree structure as index and greatly improves\nthe recommendation accuracy. TDM uses a tree index to organize items, and each leaf node in the\ntree corresponds to an item. Like a max-heap, TDM assumes that each user-node preference equals\nto the largest one among the user\u2019s preference over all children of this node. In the training stage,\na user-node preference prediction model is trained to \ufb01t the max-heap like preference distribution.\nUnlike vector kNN search based methods where the index structure requires an inner-product form\nof user preference modeling, there is no restriction on the form of preference model in TDM. And\nin prediction, preference scores given by the trained model are used to perform layer-wise beam\nsearch in the tree index to retrieve the candidate items. The time complexity of beam search in\ntree index is logarithmic w.r.t. the corpus size and no restriction is imposed on the model structure,\nwhich is a prerequisite to make advanced user preference models feasible to retrieve candidates in\nrecommendation.\nThe index structure plays different roles in kNN search based methods and tree-based methods. In\nkNN search based methods, the user and item\u2019s vector representations are learnt \ufb01rst, and the vector\nsearch index is built then. While in tree-based methods, the tree index\u2019s hierarchy also affects the\nretrieval model training. Therefore, how to learn the tree index and user preference model jointly\nis an important problem. Tree-based method is also an active research topic in literature of extreme\nclassi\ufb01cation [29, 1, 24, 11, 8, 25], which is sometimes considered the same as recommendation\n[12, 25]. In the existing tree-based methods, the tree structure is learnt for a better hierarchy in\nthe sample or label space. However, the objective of sample or label partitioning task in the tree\nlearning stage is not fully consistent with the ultimate target, i.e., accurate recommendation. The\ninconsistency between objectives of index learning and prediction model training leads the overall\nsystem to a suboptimal status. To address this challenge and facilitate better cooperation of tree\nindex and user preference prediction model, we focus on developing a way to simultaneously learn\nthe tree index and user preference prediction model by optimizing a uni\ufb01ed performance measure.\nThe main contributions of this paper are: 1) We propose a joint optimization framework to learn\nthe tree index and user preference prediction model in tree-based recommendation, where a uni\ufb01ed\nperformance measure, i.e., the accuracy of user preference prediction is optimized; 2) We demon-\nstrate that the proposed tree learning algorithm is equivalent to the weighted maximum matching\nproblem of bipartite graph, and give an approximate algorithm to learn the tree; 3) We propose a\nnovel method that makes better use of tree index to generate hierarchical user representation, which\ncan help learn more accurate user preference prediction model; 4) We show that both the tree index\nlearning and hierarchical user representation can improve recommendation accuracy, and these two\nmodules can even mutually improve each other to achieve more signi\ufb01cant performance promotion.\n\n2\n\n\fFigure 1: Tree-based deep recommendation model. (a) User preference prediction model. We \ufb01rstly\nhierarchically abstract the user behaviors with nodes in corresponding levels. Then the abstract user\nbehaviors and the target node together with the other feature such as the user pro\ufb01le are used as the\ninput of the model. (b) Tree hierarchy. Each item is \ufb01rstly assigned to a different leaf node with a\nprojection function \u03c0(\u00b7). In retrieval stage, items that assigned to the red nodes in the leaf level are\nselected as the candidate set.\n\n2 Joint Optimization of Tree-based Index and Deep Model\n\nIn this section, we \ufb01rstly give a brief review of TDM [34] to make this paper self-contained. Then we\npropose the joint learning framework of the tree-based index and deep model. In the last subsection,\nwe specify the hierarchical user preference representation used in model training.\n\n2.1 Tree-based Deep Recommendation Model\n\nIn recommender systems with large-scale corpus, how to retrieve candidates effectively and ef\ufb01-\nciently is a challenging problem. TDM uses a tree as index and proposes a max-heap like probability\nformulation in the tree, where the user preference for each non-leaf node n in level l is derived as:\n\np(l)(n|u) =\n\nmaxnc\u2208{n(cid:48)s children in level l+1} p(l+1)(nc|u)\n\n\u03b1(l)\n\n(1)\nwhere p(l)(n|u) is the ground truth probability that the user u prefers the node n. \u03b1(l) is a level\nnormalization term. The above formulation means that the ground truth user-node probability on a\nnode equals to the maximum user-node probability of its children divided by a normalization term.\nTherefore, the top-k nodes in level l must be contained in the children of top-k nodes in level l \u2212 1,\nand the retrieval for top-k leaf items can be restricted to recursive top-k nodes retrieval top-down in\neach level without losing the accuracy. Based on this, TDM turns the recommendation task into a\nhierarchical retrieval problem, where the candidate items are selected gradually from coarse to \ufb01ne.\nThe candidate generating process of TDM is shown in Fig 1.\nEach item is \ufb01rstly assigned to a leaf node in the tree hierarchy T . A layer-wise beam search strategy\nis carried out as shown in Fig1(b). For level l, only the children of nodes with top-k probabilities in\nlevel l \u2212 1 are scored and sorted to pick k candidate nodes in level l. This process continues until k\nleaf items are reached. User features combined with the candidate node are used as the input of the\nprediction model M (e.g. fully-connected networks) to get the preference probability, as shown in\nFig 1(a). With tree index, the overall retrieval complexity for a user request is reduced from linear\nto logarithmic w.r.t. the corpus size, and there is no restriction on the preference model structure.\nThis makes TDM break the inner-product form of user preference modeling restriction brought by\nvector kNN search index and enable arbitrary advanced deep models to retrieve candidates from the\nentire corpus, which greatly raises the recommendation accuracy.\n\n2.2 Joint Optimization Framework\nDerive the training set that has n samples as {(u(i), c(i))}n\ni=1, in which the i-th pair (u(i), c(i)) means\nthe user u(i) is interested in the target item c(i). For (u(i), c(i)), tree hierarchy T determines the path\nthat prediction model M should select to achieve c(i) for u(i). We propose to jointly learn M and\n\n3\n\n12356748ItemItems\u014f123456789101112131415Level 0Level 1Level 2Level 3(leaf)Scored and selected nodeUser preference prediction modelNodeScored nodeProbabilityUser profile\u014f(a)(b)Level 0Level 1Level 2Level 3\u014fUser behaviors\u014f\u014fHierarchical representation\fAlgorithm 1: Joint learning framework of the tree index and deep model\nInput: Loss function L(\u03b8, \u03c0), initial deep model M and initial tree T\n1: for t = 0, 1, 2 . . . do\n2:\n3:\n4: end for\nOutput: Learned model M and tree T\n\nSolve min\u03b8 L(\u03b8, \u03c0) by optimizing the model M.\nSolve max\u03c0 \u2212L(\u03b8, \u03c0) by optimizing the tree hierarchy with Algorithm 2\n\nT under a global loss function. As we will see in experiments, jointly optimizing M and T could\nimprove the ultimate recommendation accuracy.\nGiven a user-item pair (u, c), denote p (\u03c0(c)|u; \u03c0) as user u\u2019s preference probability over leaf node\n\u03c0(c) where \u03c0(\u00b7) is a projection function that projects an item to a leaf node in T . Note that \u03c0(\u00b7)\ncompletely determines the tree hierarchy T , as shown in Fig 1(b). And optimizing T is actu-\nally optimizing \u03c0(\u00b7). The model M estimates the user-node preference \u02c6p (\u03c0(c)|u; \u03b8, \u03c0), given \u03b8\nas model parameters. If the pair (u, c) is a positive sample, we have the ground truth preference\np (\u03c0(c)|u; \u03c0) = 1 following the multi-class setting [5, 2]. According to the max-heap property, the\nuser preference probability of all \u03c0(c)\u2019s ancestor nodes, i.e., {p(bj(\u03c0(c))|u; \u03c0)}lmax\nj=0 should also be\n1, in which bj(\u00b7) is the projection from a node to its ancestor node in level j and lmax is the max\nlevel in T . To \ufb01t such a user-node preference distribution, the global loss function is formulated as\n\nL(\u03b8, \u03c0) = \u2212 n(cid:88)\n\nlmax(cid:88)\n\ni=1\n\nj=0\n\n(cid:16)\n\n(cid:17)\n\nlog \u02c6p\n\nbj(\u03c0(c(i)))|u(i); \u03b8, \u03c0\n\n,\n\n(2)\n\nwhere we sum up the negative logarithm of predicted user-node preference probability on all the\npositive training samples and their ancestor user-node pairs as the global empirical loss.\nOptimizing \u03c0(\u00b7) is a combinational optimization problem, which can hardly be simultaneously op-\ntimized with \u03b8 using gradient-based algorithms. To conquer this, we propose a joint learning frame-\nwork as shown in Algorithm 1. It alternatively optimizes the loss function (2) with respect to the\nuser preference model and the tree hierarchy. The consistency of the training loss in model training\nand tree learning promotes the convergence of the framework. Actually, Algorithm 1 surely con-\nverges if both the model training and tree learning decrease the value of (2) since {L(\u03b8t, \u03c0t)} is a\ndecreasing sequence and lower bounded by 0. In model training, min\u03b8 L(\u03b8, \u03c0) is to learn a user-node\npreference model for all levels, which can be solved by popular optimization algorithms for neural\nnetworks such as SGD[3], Adam[15]. In the normalized user preference setting [5, 2], since the\nnumber of nodes increases exponentially with the node level, Noise-contrastive estimation[7] is an\nalternative to estimate \u02c6p (bj(\u03c0(c))|u; \u03b8, \u03c0) to avoid calculating the normalization term by sampling\nstrategy. The task of tree learning is to solve max\u03c0 \u2212L(\u03b8, \u03c0) given \u03b8. max\u03c0 \u2212L(\u03b8, \u03c0) equals to the\nmaximum weighted matching problem of bipartite graph that consists of items in the corpus C and\nthe leaf nodes of T 2. The detailed proof is shown in the supplementary material.\nTraditional algorithms for assignment problem such as the classic Hungarian algorithm are hard\nto apply for large corpus because of their high complexities. Even for the naive greedy algorithm\nthat greedily chooses the unassigned edge with the largest weight, a big weight matrix needs to\nbe computed and stored in advance, which is not acceptable. To conquer this issue, we propose a\nsegmented tree learning algorithm.\nInstead of assigning items directly to leaf nodes, we achieve this step-by-step from the root node to\nthe leaf level. Given a projection \u03c0 and the k-th item ck in the corpus, denote\n\nLs,e\n\nck\n\n(\u03c0) =\n\nlog \u02c6p (bj(\u03c0(c))|u; \u03b8, \u03c0) ,\n\nwhere Ak = {(u(i), c(i))|c(i) = ck}n\n\nand e are the start and end level respectively. We \ufb01rstly maximize(cid:80)|C|\n\n(u,c)\u2208Ak\ni=1 is the set of training samples whose target item is ck, s\n(\u03c0) w.r.t. \u03c0, which is\n2For convenience, we assume T is a given complete binary tree. It is worth mentioning that the proposed\n\nk=1 L1,d\n\nck\n\nalgorithm can be naturally extended to multi-way trees.\n\n(cid:88)\n\ne(cid:88)\n\nj=s\n\n4\n\n\ffor each node ni in level l \u2212 d do\n\nAlgorithm 2: Tree learning algorithm\nInput: Gap d, max tree level lmax, original projection \u03c0old\nOutput: Optimized projection \u03c0new\n1: Set current level l \u2190 d, initialize \u03c0new \u2190 \u03c0old\n2: while d > 0 do\n3:\n4:\n5:\n6:\nend for\n7:\nd \u2190 min(d, lmax \u2212 l)\n8:\nl \u2190 l + d\n9:\n10: end while\n\nFind \u03c0\u2217 that maximize(cid:80)\n\nDenote Cni as the item set that \u2200c \u2208 Cni, bl\u2212d(\u03c0new(c)) = ni\nLl\u2212d+1,l\nUpdate \u03c0new. \u2200c \u2208 Cni, \u03c0new(c) \u2190 \u03c0\u2217(c)\n\n(\u03c0), s.t. \u2200c \u2208 Cni , bl\u2212d(\u03c0\u2217(c)) = ni\n\nc\u2208Cni\n\nc\n\n(cid:80)|C|\nk=1 Ld+1,2d\n\nequivalent to assign all the items to nodes in level d. For a complete binary tree T with max level\nlmax, each node in level d is assigned with no more than 2lmax\u2212d items. This is also a maximum\nmatching problem which can be ef\ufb01ciently solved by a greedy algorithm, since the number of pos-\nsible locations for each item is largely decreased if d is well chosen (e.g. for d = 7, the number is\n2d = 128). Denote the found optimal projection in this step as \u03c0\u2217. Then, we successively maximize\n(\u03c0) under the constraint that \u2200c \u2208 C, bd(\u03c0(c)) = bd(\u03c0\u2217(c)), which means keeping\neach item\u2019s corresponding ancestor node in level d unchanged. The recursion stops until each item\nis assigned to a leaf node. The proposed algorithm is detailed in Algorithm 2.\nIn line 5 of Algorithm 2, we use a greedy algorithm with rebalance strategy to solve the sub-problem.\n(\u00b7).\nEach item c \u2208 Cni is \ufb01rstly assigned to the child of ni in level l with largest weight Ll\u2212d+1,l\nThen, a rebalance process is applied to ensure that each child is assigned with no more than 2lmax\u2212l\nitems. The detailed implementation of Algorithm 2 is given in the supplementary material.\n\nck\n\nc\n\n2.3 Hierarchical User Preference Representation\n\nAs shown in Section 2.1, TDM is a hierarchical retrieval model to generate the candidate items hier-\narchically from coarse to \ufb01ne. In retrieval, a layer-wise top-down beam search is carried out through\nthe tree index by the user preference prediction model M. Therefore, M(cid:48)s task in each level are\nheterogeneous. Based on this, a level-speci\ufb01c input of M is necessary to raise the recommendation\naccuracy.\nA series of related work [30, 6, 18, 16, 32, 33, 34] has shown that the user\u2019s historical be-\nhaviors play a key role in predicting the user\u2019s interests. However, in our tree-based approach\nwe could even enlarge this key role in a novel and effective way. Given a user behavior se-\nquence c = {c1, c2,\u00b7\u00b7\u00b7 , cm} where ci is the i-th item the user interacts, we propose to use\ncl = {bl(\u03c0(c1)), bl(\u03c0(c2)),\u00b7\u00b7\u00b7 , bl(\u03c0(cm))} as user\u2019s behavior feature in level l. cl together with\nthe target node and other possible features such as user pro\ufb01le are used as the input of M in level\nl to predict the user-node preference, as shown in Fig 1(a). In addition, since each node or item is\na one-hot ID feature, we follow the common way to embed them into continuous feature space. In\nthis way, the ancestor nodes of items the user interacts are used as the hierarchical user preference\nrepresentation. Generally, the hierarchical representation brings two main bene\ufb01ts:\n1. Level independence. As in the common way, sharing item embeddings between different levels\nwill bring noises in training the user preference prediction model M, because the targets differ\nfor different levels. An explicit solution is to attach an item with an independent embedding for\neach level. However, this will greatly increase the number of parameters and make the system\nhard to optimize and apply. The proposed hierarchical representation uses node embeddings in\nthe corresponding level as the input of M, which achieves level independence in training without\nincreasing the number of parameters.\n2. Precise description. M generates the candidate items hierarchically through the tree. With the\nincrease of retrieval level, the candidate nodes in each level describe the ultimate recommended\nitems from coarse to \ufb01ne until the leaf level is reached. The proposed hierarchical user preference\nrepresentation grasps the nature of the retrieval process and gives a precise description of user\n\n5\n\n\fbehaviors with nodes in corresponding level, which promotes the predictability of user preference\nby reducing the confusion brought by too detailed or coarse description. For example, M\u2019s task in\nupper levels is to coarsely select a candidate set and the user behaviors are also coarsely described\nwith homogeneous node embeddings in the same upper levels in training and prediction.\n\nExperimental study in both Section 3 and the supplementary material will show the signi\ufb01cant ef-\nfectiveness of the proposed hierarchical representation.\n\n3 Experimental Study\n\nWe study the performance of the proposed method both of\ufb02ine and online in this section. We \ufb01rstly\ncompare the overall performance of the proposed method with other baselines. Then we conduct\nexperiments to verify the contribution of each part and convergence of the framework. At last, we\nshow the performance of the proposed method in an online display advertising platform with real\ntraf\ufb01c.\n\n3.1 Experiment Setup\n\nThe of\ufb02ine experiments are conducted with two large-scale real-world datasets: 1) Amazon\nBooks3[20, 9], a user-book review dataset made up of product reviews from Amazon. Here we\nuse its largest subset Books; 2) UserBehavior4[34], a subset of Taobao user behavior data. These\ntwo datasets both contain millions of items and the data is organized in user-item interaction form:\neach user-item interaction consists of user ID, item ID, category ID and timestamp. For the above\ntwo datasets, only users with no less than 10 interactions are kept.\nTo evaluate the performance of the proposed framework, we compare the following methods:\n\u2022 Item-CF [28] is a basic collaborative \ufb01ltering method and is widely used for personalized recom-\n\nmendation especially for large-scale corpus [18].\n\n\u2022 YouTube product-DNN [5] is a practical method used in YouTube video recommendation. It\u2019s\nthe representative work of vector kNN search based methods. The inner-product of the learnt\nuser and item\u2019s vector representation re\ufb02ects the preference. And we use the exact kNN search to\nretrieve candidates in prediction.\n\n\u2022 HSM [21] is the hierarchical softmax model. It adopts the multiplication of layer-wise conditional\n\nprobabilities to get the normalized item preference probability.\n\n\u2022 TDM [34] is the tree-based deep model for recommendation. It enables arbitrary advanced mod-\nels to retrieve user interests using the tree index. We use the proposed basic DNN version of TDM\nwithout tree learning and attention.\n\n\u2022 DNN is a variant of TDM without tree index. The only difference is that it directly learns a user-\nitem preference model and linearly scan all items to retrieve the top-k candidates in prediction.\nIt\u2019s computationally intractable in online system but a strong baseline in of\ufb02ine comparison.\n\n\u2022 JTM is the proposed joint learning framework of the tree index and user preference prediction\nmodel. JTM-J and JTM-H are two variants. JTM-J jointly optimizes the tree index and user\npreference prediction model without the proposed hierarchical representation in Section 2.3. And\nJTM-H adopts the hierarchical representation but use the \ufb01xed initial tree index without tree\nlearning.\n\nFollowing TDM [34], we split users into training, validation and testing sets disjointly. Each user-\nitem interaction in training set is a training sample, and the user\u2019s behaviors before the interaction\nare the corresponding features. For each user in validation and testing set, we take the \ufb01rst half of\nbehaviors along the time line as known features and the latter half as ground truth.\nTaking advantage of TDM\u2019s open source work5, we implement all methods in Alibaba\u2019s deep learn-\ning platform X-DeepLearning (XDL). HSM, DNN and JTM adopt the same user preference predic-\ntion model with TDM. We deploy negative sampling for all methods except Item-CF and use the\nsame negative sampling ratio. 100 negative items in Amazon Books and 200 in UserBehavior are\n\n3http://jmcauley.ucsd.edu/data/amazon\n4http://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1\n5http://github.com/alibaba/x-deeplearning/tree/master/xdl-algorithm-solution/TDM\n\n6\n\n\fsampled for each training sample. HSM, TDM and JTM require an initial tree in advance of training\nprocess. Following TDM, we use category information to initialize the tree structure where items\nfrom the same category aggregate in the leaf level. More details and codes about data pre-processing\nand training are listed in the supplementary material.\nPrecision, Recall and F-Measure are three general metrics and we use them to evaluate the perfor-\nmance of different methods. For a user u, suppose Pu (|Pu| = M) is the recalled set and Gu is the\nground truth set. The equations of three metrics are\n|Pu \u2229 Gu|\n\n|Pu \u2229 Gu|\n\nPrecision@M (u) =\n\nF-Measure@M (u) =\n\n, Recall@M (u) =\n\n|Gu|\n|Pu|\n2 \u2217 Precision@M (u) \u2217 Recall@M (u)\nPrecision@M (u) + Recall@M (u)\n\nThe results of each metric are averaged across all users in the testing set, and the listed values are\nthe average of \ufb01ve different runs.\n\n3.2 Comparison Results\n\nTable 1 exhibits the results of all methods in two datasets. It clearly shows that our proposed JTM\noutperforms other baselines in all metrics. Compared with the previous best model DNN in two\ndatasets, JTM achieves 45.3% and 9.4% recall lift in Amazon Books and UserBehavior respectively.\n\nTable 1: Comparison results of different methods in Amazon Books and UserBehavior (M = 200).\n\nYouTube product-DNN\n\nMethod\n\nItem-CF\n\nHSM\nTDM\nDNN\nJTM-J\nJTM-H\nJTM\n\nAmazon Books\n\nUserBehavior\n\nF-Measure\n\nPrecision\n0.52%\n0.92%\n0.53%\n0.93%\n0.42%\n0.72%\n0.50%\n0.88%\n0.56%\n0.98%\n0.51%\n0.89%\n0.68%\n1.19%\n0.79% 12.45% 1.38%\n\nRecall\n8.18%\n8.26%\n6.22%\n7.49%\n8.57%\n7.60%\n10.45%\n\nF-Measure\n\nPrecision\n1.56%\n2.30%\n2.25%\n3.36%\n1.80%\n2.71%\n2.23%\n3.40%\n2.81%\n4.23%\n2.48%\n3.73%\n2.66%\n4.02%\n3.11% 14.71% 4.68%\n\nRecall\n6.75%\n10.15%\n8.62%\n10.84%\n13.45%\n11.72%\n12.93%\n\nAs mentioned before, though computationally intractable in online system, DNN is a signi\ufb01cantly\nstrong baseline for of\ufb02ine comparison. Comparison results of DNN and other methods give insights\nin many aspects.\nFirstly, gap between YouTube product-DNN and DNN shows the limitation of inner-product form.\nThe only difference between these two methods is that YouTube product-DNN uses the inner-\nproduct of user and item\u2019s vectors to calculate the preference score, while DNN uses a fully-\nconnected network. Such a change brings apparent improvement, which veri\ufb01es the effectiveness of\nadvanced neural network over inner-product form.\nNext, TDM performs worse than DNN with an ordinary but not optimized tree hierarchy. Tree hi-\nerarchy takes effect in both training and prediction process. User-node samples are generated along\nthe tree to \ufb01t max-heap like preference distribution, and layer-wise beam search is deployed in the\ntree index when prediction. Without a well-de\ufb01ned tree hierarchy, user preference prediction model\nmay converge to a suboptimal version with confused generated samples, and it\u2019s possible to lose\ntargets in the non-leaf levels so that inaccurate candidate sets may be returned. Especially in sparse\ndataset like Amazon Books, learnt embedding of each node in tree hierarchy is not distinguishable\nenough so that TDM doesn\u2019t perform well than other baselines. This phenomenon illustrates the\nin\ufb02uence of tree and necessity of tree learning. Additionally, HSM gets much worse results than\nTDM. This result is consistent with that reported in TDM[34]. When dealing with large corpus, as\na result of layer-wise probability multiplication and beam search, HSM cannot guarantee the \ufb01nal\nrecalled set to be optimal.\nBy joint learning of tree index and user preference model, JTM outperforms DNN on all met-\nrics in two datasets with much lower retrieval complexity. More precise user preference predic-\n\n7\n\n\ftion model and better tree hierarchy are obtained in JTM, which leads a better item set selection.\nHierarchical user preference representation alleviates the data sparsity problem in upper levels, be-\ncause the feature space of user behavior feature is much smaller while having the same number\nof samples. And it helps model training in a layer-wise way to reduce the propagation of noises\nbetween levels. Besides, tree hierarchy learning makes similar items aggregate in the leaf level,\nso that the internal level models can get training samples with more consistent and unambiguous\ndistribution. Bene\ufb01ted from the above two reasons, JTM provides better results than DNN.\nResults in Table 1 under the dash line indicate the contribution of each part and their joint perfor-\nmance in JTM. Take the recall metric as an example. Compared to TDM in UserBehavior, tree\nlearning and hierarchical representation of user preference brings 0.88% and 2.09% absolute gain\nseparately. Furthermore, 3.87% absolute recall promotion is achieved by the corporation of both\noptimizations under a uni\ufb01ed objective. Similar gain is observed in Amazon Books. The above\nresults clearly show the effectiveness of hierarchical representation and tree learning, as well as the\njoint learning framework.\n\nConvergence of Iterative Joint Learning Tree hierarchy determines sample generation and\nsearch path. A suitable tree would bene\ufb01t model training and inference a great deal. Fig 2 gives\nthe comparison of clustering-based tree learning algorithm proposed in TDM [34] and our proposed\njoint learning approach. For fairness, two methods both adopt hierarchical user representation.\nSince the proposed tree learning algorithm has the same objective with the user preference prediction\nmodel, it has two merits from the results: 1) It can converge to an optimal tree stably; 2) The \ufb01nal\nrecommendation accuracy is higher than the clustering-based method. From Fig 2, we can see that\nresults increase iteratively on all three metrics. Besides, the model stably converges in both datasets,\nwhile clustering-based approach ultimately over\ufb01ts. The above results demonstrate the effectiveness\nand convergence of iterative joint learning empirically. Some careful readers might have noticed\nthat the clustering algorithm outperforms JTM in the \ufb01rst few iterations. The reason is that the tree\nlearning algorithm in JTM involves a lazy strategy, i.e., try to reduce the degree of tree structure\nchange in each iteration (details are given in the supplementary material).\n\n(a) Precision\n\n(b) Recall\n\n(c) F-Measure\n\n(d) Precision\n\n(e) Recall\n\n(f) F-Measure\n\nFigure 2: Results of iterative joint learning in two datasets (M = 200). 2(a), 2(b), 2(c) are results in\nAmazon Books and 2(d), 2(e), 2(f) shows the performance in UserBehavior. The horizontal axis of\neach \ufb01gure represents the number of iterations.\n\n3.3 Online Results\n\nWe also evaluate the proposed JTM in production environments: the display advertising scenario\nof Guess What You Like column of Taobao App Homepage. We use click-through rate (CTR) and\nrevenue per mille (RPM) to measure the performance, which are the key performance indicators.\n\n8\n\n0123456789100.65%0.70%0.75%0.80%Joint LearningClustering01234567891010.2%11.1%11.9%12.7%Joint LearningClustering0123456789101.16%1.24%1.32%1.40%Joint LearningClustering0123456789102.62%2.80%2.98%3.16%Joint LearningClustering01234567891012.8%13.5%14.2%14.9%Joint LearningClustering0123456789103.94%4.20%4.46%4.72%Joint LearningClustering\fThe de\ufb01nitions are:\n\nCTR =\n\n# of clicks\n\n# of impressions\n\n, RPM =\n\nAd revenue\n\n# of impressions\n\n\u2217 1000\n\nIn the platform, advertisers bid on plenty of granularities like ad clusters, items, shops, etc. Several\nsimultaneously running recommendation approaches in all granularities produce candidate sets and\nthe combination of them are passed to subsequent stages, like CTR prediction [32, 31, 23], ranking\n[33, 13], etc. The comparison baseline is such a combination of all running recommendation meth-\nods. To assess the effectiveness of JTM, we deploy JTM to replace Item-CF, which is one of the\nmajor candidate-generation approaches in granularity of items in the platform. TDM is evaluated\nin the same way as JTM. The corpus to deal with contains tens of millions of items. Each com-\nparison bucket has 2% of the online traf\ufb01c, which is big enough considering the overall page view\nrequest amount. Table 2 lists the promotion of the two main online metrics. 11.3% growth on CTR\nexhibits that more precise items have been recommended with JTM. As for RPM, it has a 12.9%\nimprovement, indicating JTM can bring more income for the platform.\n\nTable 2: Online results from Jan 21 to Jan 27, 2019.\n\nMetric Baseline\nCTR\nRPM\n\n0.0%\n0.0%\n\nJTM\n\nTDM\n+5.4% +11.3%\n+7.6% +12.9%\n\n4 Conclusion\n\nRecommender system plays a key role in various kinds of applications such as video streaming\nand e-commerce. In this paper, we address an important problem in large-scale recommendation,\ni.e., how to optimize user representation, user preference prediction and the index structure under\na global objective. To the best of our knowledge, JTM is the \ufb01rst work that proposes a uni\ufb01ed\nframework to integrate the optimization of these three key factors. A joint learning approach of the\ntree index and user preference prediction model is introduced in this framework. The tree index\nand deep model are alternatively optimized under a global loss function with a novel hierarchical\nuser representation based on the tree index. Both online and of\ufb02ine experimental results show the\nadvantages of the proposed framework over other large-scale recommendation models.\n\nAcknowledgements\n\nWe deeply appreciate Jingwei Zhuo, Mingsheng Long, Jin Li for their helpful suggestions and dis-\ncussions. Thank Huimin Yi, Yang Zheng and Xianteng Wu for implementing the key components\nof the training and inference platform. Thank Yin Yang, Liming Duan, Yao Xu, Guan Wang and\nYue Gao for necessary supports about online serving.\n\nReferences\n[1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: recom-\n\nmending advertiser bid phrases for web pages. In WWW, pages 13\u201324, 2013.\n\n[2] A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, V. Gatto, and E. H. Chi. Latent cross: Making use of context\n\nin recurrent recommender systems. In WSDM, pages 46\u201354, 2018.\n\n[3] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177\u2013186.\n\n[4] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for ef\ufb01cient image retrieval.\n\n[5] P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In RecSys,\n\n2010.\n\nIn AAAI, pages 3457\u20133463, 2016.\n\npages 191\u2013198, 2016.\n\n[6] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. V. Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Liv-\ningston, and D. Sampath. The youtube video recommendation system. In RecSys, pages 293\u2013296, 2010.\n[7] M. Gutmann and A. Hyv\u00a8arinen. Noise-contrastive estimation: A new estimation principle for unnormal-\n\nized statistical models. In AISTATS, pages 297\u2013304, 2010.\n\n9\n\n\f[8] L. Han, Y. Huang, and T. Zhang. Candidates vs. noises estimation for large multi-class classi\ufb01cation\n\nproblem. In ICML, pages 1885\u20131894, 2018.\n\n[9] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class\n\ncollaborative \ufb01ltering. In WWW, pages 507\u2013517, 2016.\n\n[10] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua. Neural collaborative \ufb01ltering. In WWW, pages\n\n[11] H. D. III, N. Karampatziakis, J. Langford, and P. Mineiro. Logarithmic time one-against-some. In ICML,\n\n173\u2013182, 2017.\n\npages 923\u2013932, 2017.\n\n[12] H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation, tagging,\n\nranking & other missing label applications. In KDD, pages 935\u2013944, 2016.\n\n[13] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang. Real-time bidding with multi-agent reinforcement\n\nlearning in display advertising. In CIKM, pages 2193\u20132201, 2018.\n\n[14] J. Johnson, M. Douze, and H. J\u00b4egou. Billion-scale similarity search with gpus.\n\narXiv preprint\n\n[15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[16] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE\n\narXiv:1702.08734, 2017.\n\nComputer, 42(8):30\u201337, 2009.\n\n[17] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun. xdeepfm: Combining explicit and implicit\n\nfeature interactions for recommender systems. In KDD, pages 1754\u20131763, 2018.\n\n[18] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative \ufb01ltering.\n\n[19] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An investigation of practical approximate nearest neighbor\n\nIEEE Internet Computing, 7(1):76\u201380, 2003.\n\nalgorithms. In NeurIPS, pages 825\u2013832, 2004.\n\nsubstitutes. In SIGIR, pages 43\u201352, 2015.\n\n[20] J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel. Image-based recommendations on styles and\n\n[21] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In AISTATS, 2005.\n[22] S. Okura, Y. Tagami, S. Ono, and A. Tajima. Embedding-based news recommendation for millions of\n\nusers. In KDD, pages 1933\u20131942, 2017.\n\n[23] Q. Pi, W. Bian, G. Zhou, X. Zhu, and K. Gai. Practice on long sequential user behavior modeling for\n\nclick-through rate prediction. In KDD, pages 2671\u20132679, 2019.\n\n[24] Y. Prabhu and M. Varma. Fastxml: a fast, accurate and stable tree-classi\ufb01er for extreme multi-label\n\nlearning. In KDD, pages 263\u2013272, 2014.\n\n[25] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for extreme\n\nclassi\ufb01cation with application to dynamic search advertising. In WWW, pages 993\u20131002, 2018.\n\n[26] S. Rendle. Factorization machines. In ICDM, pages 995\u20131000, 2010.\n[27] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NeurIPS, pages 1257\u20131264, 2007.\n[28] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative \ufb01ltering recommendation\n\nalgorithms. In WWW, pages 285\u2013295, 2001.\n\n[29] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In ICML, pages 181\u2013189,\n\n[30] S. Zhang, L. Yao, and A. Sun. Deep learning based recommender system: A survey and new perspectives.\n\narXiv preprint arXiv:1707.07435, 2017.\n\n[31] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai. Deep interest evolution network\n\nfor click-through rate prediction. arXiv preprint arXiv:1809.03672, 2018.\n\n[32] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai. Deep interest network\n\nfor click-through rate prediction. In KDD, pages 1059\u20131068, 2018.\n\n[33] H. Zhu, J. Jin, C. Tan, F. Pan, Y. Zeng, H. Li, and K. Gai. Optimized cost per click in taobao display\n\n[34] H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai. Learning tree-based deep model for recommender\n\nadvertising. In KDD, pages 2191\u20132200, 2017.\n\nsystems. In KDD, pages 1079\u20131088, 2018.\n\n2013.\n\n10\n\n\f", "award": [], "sourceid": 2189, "authors": [{"given_name": "Han", "family_name": "Zhu", "institution": "Alibaba Group"}, {"given_name": "Daqing", "family_name": "Chang", "institution": "Alibaba Group"}, {"given_name": "Ziru", "family_name": "Xu", "institution": "Alibaba Group"}, {"given_name": "Pengye", "family_name": "Zhang", "institution": "Alibaba Group"}, {"given_name": "Xiang", "family_name": "Li", "institution": "Alibaba Group"}, {"given_name": "Jie", "family_name": "He", "institution": "Alibaba Group"}, {"given_name": "Han", "family_name": "Li", "institution": "Alibaba Group"}, {"given_name": "Jian", "family_name": "Xu", "institution": "Alibaba Group"}, {"given_name": "Kun", "family_name": "Gai", "institution": "Alibaba Group"}]}