{"title": "Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora", "book": "Advances in Neural Information Processing Systems", "page_first": 2143, "page_last": 2150, "abstract": "We propose Dirichlet-Bernoulli Alignment (DBA), a generative model for corpora in which each pattern (e.g., a document) contains a set of instances (e.g., paragraphs in the document) and belongs to multiple classes. By casting predefined classes as latent Dirichlet variables (i.e., instance level labels), and modeling the multi-label of each pattern as Bernoulli variables conditioned on the weighted empirical average of topic assignments, DBA automatically aligns the latent topics discovered from data to human-defined classes. DBA is useful for both pattern classification and instance disambiguation, which are tested on text classification and named entity disambiguation for web search queries respectively.", "full_text": "Dirichlet-Bernoulli Alignment: A Generative Model\nfor Multi-Class Multi-Label Multi-Instance Corpora\n\nShuang-Hong Yang\nCollege of Computing\n\nGeorgia Tech\n\nHongyuan Zha\n\nCollege of Computing\n\nGeorgia Tech\n\nBao-Gang Hu\n\nNLPR & LIAMA\n\nChinese Academy of Sciences\n\nshy@gatech.edu\n\nzha@cc.gatech.edu\n\nhubg@nlpr.ia.ac.cn\n\nAbstract\n\nWe propose Dirichlet-Bernoulli Alignment (DBA), a generative model for cor-\npora in which each pattern (e.g., a document) contains a set of instances (e.g.,\nparagraphs in the document) and belongs to multiple classes. By casting prede-\n\ufb01ned classes as latent Dirichlet variables (i.e., instance level labels), and modeling\nthe multi-label of each pattern as Bernoulli variables conditioned on the weighted\nempirical average of topic assignments, DBA automatically aligns the latent top-\nics discovered from data to human-de\ufb01ned classes. DBA is useful for both pattern\nclassi\ufb01cation and instance disambiguation, which are tested on text classi\ufb01cation\nand named entity disambiguation in web search queries respectively.\n\n1 Introduction\n\nWe consider multi-class, multi-label and multi-instance classi\ufb01cation (M3C), a task of learning de-\ncision rules from corpora in which each pattern consists of multiple instances1 and is associated\nwith multiple classes. M3C \ufb01nds its application in many \ufb01elds: For example, in web page classi\ufb01-\ncation, a web page (pattern) typically comprises of different entities (instances) (e.g., texts, pictures\nand videos) and is usually associated with several different topics (e.g., \ufb01nance, sports and poli-\ntics). In such tasks, a pattern usually consists of a set of instances, and the possible instances may\nbe too diverse in nature (e.g., of different structures or types, described by different features) to be\nrepresented in a universal space. What makes the problem more complicated and challenging is\nthat the pattern is usually ambiguous, i.e., it can belong to several different classes simultaneously.\nTraditional classi\ufb01cation algorithms are typically incapable of handling such complications.\n\nEven for corpora consisting of relatively homogenous data, treating the tasks as M3C might still\nbe advantageous since it enables us to explore the inner structures and the ambiguity of the data\nsimultaneously. For example, in text classi\ufb01cation, a document usually comprises several separate\nsemantic parts (e.g., paragraphs), and several different topics are evolving along these parts. Since\nthe class-labels are often only locally tied to the document (e.g., paragraphs are often far more topic-\nfocused than the whole document), base the classi\ufb01cation on the whole document would incur too\nmuch noise and in turn harm the performance. In addition, treating the task as M3C also offers a\nnatural way to track the topic evolution along paragraphs, a task that is otherwise dif\ufb01cult to handle.\n\nM3C also arises naturally when the acquisition of labeled data is expensive. For example, in scene\nclassi\ufb01cation, a picture usually contains several objects (e.g., cat, desk, man) belonging to several\ndifferent classes (e.g., animal, furniture, human). Ideal annotation requires a skilled expert to specify\nboth the exact location and class label of each object in the image, which, though not completely\nimpossible, involves too much human efforts especially for large image repositories. The annotation\nburden would be greatly relieved if each image is labeled as a whole (e.g., a caption indicating what\nis in the image), which, however, requires the learning system to be capable to handle M3C tasks.\n\n1A \u201cpattern\u201d or \u201cexample\u201d is a typical sample in a data collection and an\u201cinstance\u201d is a part of a \u201cpattern\u201d.\n\n1\n\n\fRecently, the Latent Dirichlet Allocation (LDA, [4]) model has been established for automatic ex-\ntraction of topical structures from large repository of documents. LDA is a highly-modularized\nprobabilistic model with various variations and extensions (e.g., [2, 3]). By modeling a document\nas a mixture over topics, LDA allows each document to be associated with multiple topics with\ndifferent proportions, and thus provides a promising way to capture the heterogeneity/ambiguity in\nthe data. However, the topics discovered by LDA are implicit (i.e., each topic is expressed as a dis-\ntribution over words, comprehensible interpretation of which requires human expertise), and cannot\nbe easily aligned to the topics of human interests. In addition, the standard LDA does not model the\nmulti-instance structure of a pattern. Hence, LDA and its like cannot be directly applied to M3C.\n\nIn this paper, by taking advantage of the LDA building blocks, we present a new probabilistic gener-\native model for multi-class, multi-label and multi-instance corpora, referred to as Dirichlet-Bernoulli\nAlignment (DBA). DBA assumes a tree-structure about the data, i.e., each multi-labeled pattern is a\nbag of single-labeled instances. In DBA, each pattern is modeled as a mixture over the set of pre-\nde\ufb01ned classes, an instance is then generated independently conditioned on a sampled class-label,\nand the label of a pattern is generated from a Bernoulli distribution conditioned on all the sampled\nlabels used for generating its instances. DBA is essentially a topic model similar to LDA except that\n(1) an instance rather than a single feature is generated conditioned on each sampled topic; and (2)\ninstead of using implicit topics for dimensionality reduction as in LDA, DBA casts each class as an\nexplicit topic to gain discriminative power from the data. Through likelihood maximization, DBA\nautomatically aligns the topics discovered from the data to the prede\ufb01ned classes of our interests.\nDBA can be naturally tailored to M3C tasks for both pattern classi\ufb01cation and instance disambigua-\ntion. In this paper, we apply the DBA model to text classi\ufb01cation tasks and an interesting real-world\nproblem, i.e., named entity disambiguation for web search queries. The experiments con\ufb01rm the\nusefulness of the proposed DBA model.\n\nThe rest parts of this paper is organized as follows. Section 2 brie\ufb02y reviews some related topics\nand Section 3 presents the formal description of the corpora used in M3C and the basic assumptions\nof our model. Section 4 introduces the detailed DBA model. In Section 5, we establish algorithms\nfor inference and parameter estimation for DBA. And in Section 6, we apply the DBA model to text\nclassi\ufb01cation and query disambiguation tasks. Finally, Section 7 presents concluding remarks.\n\n2 Related Works\n\nTraditional classi\ufb01cation largely focuses on a single-label single-instance framework (i.e., i.i.d pat-\nterns, associated with exclusive/disjoint classes). However, the real-world is more like a web of\n(sub-)patterns connected with a web of classes that they belong to. Clearly, M3C re\ufb02ects more of\nthe reality. Recently, two partial solutions, i.e., multi-instance classi\ufb01cation (MIC) [7, 11, 1] and\nmulti-label classi\ufb01cation (MLC) [10, 8, 5] were investigated. MIC assumes that each pattern con-\nsists of multiple instances but belongs to a single class, whereas MLC studies single-instance pattern\nassociated with multiple classes. Although both MLC and MIC have drawn increasing attentions in\nthe literature, neither of them can handle the cases where multi-instance and multi-label are simulta-\nneously present. Perhaps the \ufb01rst work investigating M3C is [13], in which the authors proposed an\nindirect solution, i.e., to convert an M3C task into several MIC or MLC sub-tasks each of which is\nthen divided into single-label and single-instance classi\ufb01cation problems and solved by discrimina-\ntive algorithms such as AdaBoost or SVM. A practical challenge of this approach is its complexity,\ni.e, the number of sub-tasks can be huge, making the training data extremely sparse for each sub-\nclassi\ufb01er and the computation cost unacceptably high in both training and testing. Recently, Cour et\nal proposed a discriminative framework [6] based on convex surrogate loss minimization for clas-\nsifying ambiguously labeled images; and Xu et al established a hybrid generative/discriminative\napproach (i.e., a heuristically regularized LDA classi\ufb01er) [12] to mining named entity from web\nsearch click-through data. In this paper, we present a generative approach for M3C.\n\nOur proposed DBA model can be viewed as a supervised version of topic models. A widely used\ntopic model for categorical data is the LDA model [4]. By modeling a pattern as a random mixture\nover latent topics and a topic as a Multinomial distribution over features in a dictionary, LDA is\neffective in discovering implicit topics from a corpus. The supervised LDA (sLDA) model [2], by\nlinking the empirical topics to the label of each pattern, is able to learn classi\ufb01ers using Generalized\nLinear Models. However, both LDA and sLDA are in essence dimensionality reduction techniques,\nand cannot be employed directly for the M3C tasks.\n\n2\n\n\fpattern\n\nclass\n\ninstance\n\nX\n\nc\n\nx\n\nc\n\n...\n\nx\n\nc\n\nx\n\n(a)\n\na\n\nB\n\nf\n\ny\n\nL\n\nM\n\nN\n\nz\n\n(b)\n\n\u03b8\n\nFigure 1: (a): Tree structure of a multi-class multi-label multi-instance corpus. (b):A graphic repre-\nsentation of the DBA model with multinomial bag-of-feature instance model.\n\n3 Problem Formalization\n\nIntuitively, we can think of a pattern as a document, an instance as a paragraph, and a feature as a\nword. In M3C, we are interested in inferring class labels for both the document and its paragraphs.\n\nFormally, let X \u2282 RD denote the instance space (e.g., a vector space), Y = {1, 2, . . . , C} (C > 2)\nthe set of class labels, and F = {f1, f2, . . . , fD} the dictionary of features. A multi-class, multi-\nlabel multi-instance corpus D consists of a set of input patterns {Xn}n=1,2,...,N along with the\ncorresponding labels {Yn}n=1,2,...,N , where each pattern Xn = {xmn}m=1,2,...,Mn contains a set\nof instances xmn \u2208 X , and Yn \u2282 Y consists of a set of class labels. The goal of M3C is to \ufb01nd a\ndecision rule Y = \u03d5(X) : 2X \u2192 2Y, where 2A denotes the power set of a set A. For simplicity, we\nmake the following assumptions.\n\nAssumption 1 [Exchangeability]: A corpus is a bag of patterns, and each pattern is a bag of instances.\n\nAssumption 2 [Distinguishablity]: Each pattern can belong to several classes, but each instance\nbelongs to a single class.\n\nThese assumptions are equivalent to assuming a tree structure for the corpus (Figure 1(a)).\n\n4 Dirichlet-Bernoulli Alignment\n\nIn this section, we present Dirichlet-Bernoulli Alignment (DBA), a probabilistic generative model\nfor the multi-class, multi-label and multi-instance corpus described in Section 3. In DBA, each\npattern X in a corpus D is assumed to be generated by the following process:\n\n1. Sample \u03b8\u223cDir(a).\n2. For each of the M instances in X:\n\u22b2 Choose a class z \u223cMult(\u03b8);\n\u22b2 Generate an instance x \u223c p(x|z, B);\n\n3. Generate the label y\u223c p(y|z1:M ,\u03bb).\n\nis known and \ufb01xed.\n\ntion Dir(a), which is de\ufb01ned in the (C-1)-simplex: \u03b8c > 0,PC\n\nIn DBA, a =\nWe assume the total number of prede\ufb01ned classes, C,\n[a1, . . . , aC]\u22a4 with ac > 0, c = 1, . . . , C, is a C-vector prior parameter for a Dirichlet distribu-\nc=1 \u03b8c = 1. z is a class indicator, i.e.,\na binary C-vector with the 1-of-C code: zc = 1 if the c-th class is chosen, and \u2200i 6= c, zi = 0.\ny = [y1, . . . , yC]\u22a4 is also a binary C-vector with yc = 1 if the pattern X belongs to the c-th class\nand yc = 0 otherwise.\nIn this paper, we assume the label of a pattern is generated by a cost-sensitive voting process accord-\ning to the labels of the instances in it, which is intuitively reasonable. As a result, yc (c = 1, . . . , C) is\ngenerated from a Bernoulli distribution, i.e., p(yc|\u03c0c) = (\u03c0c)yc(1\u2212\u03c0c)(1\u2212yc), where \u03c0 is a probabil-\nity vector based on a weighted empirical average of the Dirichlet realization \u03bb\u22a4\u00afz, \u00afz = [\u00afz1, . . . , \u00afzC]\u22a4\nis the average of z1, . . . , zM : \u00afzc = 1\nm=1 zmc. For example, \u03c0 can be a Dirichlet distribution\n\u03c0\u223cDir(\u03bb1\u00afz1, . . . , \u03bbC \u00afzC). In this paper, we use a logistic model:\n\nM PM\n\np(yc = 1|\u00afz, \u03bb) =\n\nexp(\u03bbc \u00afzc)\n\n1 + exp(\u03bbc \u00afzc)\n\n.\n\n(1)\n\n3\n\n\fIn practice, the set of possible instances can be quite diverse, such as pictures, texts, music and\nvideos on a web page. Without loss of generality, we follow the convention of topic models to\nassume that each instance x is a bag of discrete features {f1, f2, . . . , fL} and use a multinomial\ndistribution2:\n\np(x|z, B) = p({f1, . . . , fL}|z, B) \u221d bx1\n\nc1 bx2\n\nc2 . . . bxD\n\ncD |zc=1,\n\nwhere L is the total number of feature occurrences in x (e.g., the length of a paragraph), B =\n[b1, . . . , bD] is a C \u00d7 D-matrix with the (c, d)-th entry bcd = p(fd = 1|zc = 1) and xd is the\nfrequency of fd in x. The joint probability is then given by:\n\np(X, y, Z, \u03b8|a, B, \u03bb) = p(\u03b8|a)\n\nM\n\nYm=1 p(zm|\u03b8)\n\nL\n\nYl=1\n\np(fml|B, zm)! p(y|\u00afz, \u03bb).\n\n(2)\n\nThe graphical model for DBA is depicted in Figure 1(b). We can see that DBA has a diagram very\nsimilar to that of sLDA (Figure 1 in [2]). The key differences are: (1) Instead of using implicit\ntopics for dimensionality reduction as in sLDA, DBA casts the prede\ufb01ned classes as explicit topics\nto discover the discriminative properties from the data; (2) A bag-of-feature instance rather than a\nsingle feature is generated conditioned on each sampled topic (class); (3) DBA models a multi-class,\nmulti-label multi-instance corpus and can be applied directly to M3C, i.e., the classi\ufb01cation of each\npattern as well as the instances within it.\n\n5 Parameter Estimation and Inference\n\nBoth parameter estimation and inferential tasks in DBA involve intractable computation of marginal\nprobabilities. We use variational methods to approximate those distributions.\n\n5.1 Variational Approximations\n\nWe use the following fully-factorized variational distribution to approximate the posterior distribu-\ntion of the latent variables:\n\nq(Z, \u03b8|\u03b3, \u03a6) = q(\u03b8|\u03b3)\n\nq(zm|\u03c6m) =\n\nM\n\nYm=1\n\nc=1 \u03b3c)\nc=1 \u0393(\u03b3c)\n\n\u0393(PC\nQC\n\nC\n\nYc=1 \u03b8\u03b3c\u22121\n\nc\n\n\u03c6zmc\n\nmc ! ,\n\nM\n\nYm=1\n\nwhere \u03b3 and \u03a6=[\u03c61,. . . ,\u03c6M ] are variational parameters for a pattern X. We have:\n\nlog P (X, y|a, B, \u03bb) = logZ\u03b8XZ\n\np(X, y, Z, \u03b8|a, B, \u03bb)d\u03b8\n\n=L(\u03b3, \u03a6) + KL(q(Z, \u03b8|\u03b3, \u03a6)||p(Z, \u03b8|a, B, \u03bb)) \u2248 max\n\u03b3 ,\u03a6\n\nL(\u03b3, \u03a6),\n\n(3)\n\n(4)\n\nwhere KL(q(x)||p(x)) =Rx q(x) log q(x)\n\ndistributions p and q, and L(\u00b7) is the variational lower bound for the log-likelihood:\n\np(x) dx is the Kullback-Leibler (KL) divergence between two\n\nq(Z, \u03b8|\u03b3, \u03a6) log\n\np(X, y, Z, \u03b8|a, B, \u03bb)\n\nq(Z, \u03b8|\u03b3, \u03a6)\n\nd\u03b8 = Eq[log p(\u03b8|a)]\n\nM\n\n(5)\n\nEq[log p(zm|\u03b8)] +\n\nEq[log p(xm|B, zm)] + Eq[log p(y|\u00afz, \u03bb)] + Hq.\n\nL(\u03b3, \u03a6) = logZ\u03b8XZ\nXm=1\n\n+\n\nM\n\nXm=1\n\n4\n\n2This is only a simple special case instance model for DBA. It is quite straightforward to substitute other\n\ninstance models such as Gaussian, Poisson and other more complicated models like Gaussian mixtures.\n\n\fThe \ufb01rst two terms and the \ufb01fth term (the entropy of the variational distribution) in the right-hand\nside of Eq.(5) are identical to the corresponding terms in sLDA [2]. The third term, i.e., the varia-\ntional expectation of the log likelihood for instance observations is:\n\nEq[log p(xm|B, zm)] =\n\nM\n\nXm=1\n\nM\n\nC\n\nD\n\nXm=1\n\nXc=1\n\nXd=1\n\n\u03c6mcxmd log bcd.\n\n(6)\n\nThe forth term in the righthand side of Eq.(5) corresponds to the expected log likelihood of observing\nthe labels given the topic assignments:\n\nEq[log p(y|\u00afz, \u03bb)] =\n\n1\nM\n\nM\n\nC\n\nXm=1\n\nXc=1\n\n(yc \u2212\n\n1\n2\n\n)\u03bbc\u03c6mc \u2212\n\nC\n\nXc=1\n\nEq[log(exp\n\n\u03bbc \u00afzc\n\n2\n\n+ exp\n\n\u2212\u03bbc \u00afzc\n\n2\n\n)].\n\n(7)\n\nWe bound the second term above by using the lower bound for logistic function [9]:\n\n\u2212 log(exp\n\n\u03bbc \u00afzc\n\n2\n\n+ exp\n\n\u2212\u03bbc \u00afzc\n\n2\n\n) > \u2212 log(1 + exp(\u2212\u03bec)) \u2212\n\n\u2248 \u2212 log(1 + exp(\u2212\u03bec)) \u2212\n\n\u03bec\n2\n\u03bec\n2\n\n+ \u03c2c(\u03bb2\n\nc \u00afz2\n\nc \u2212 \u03be2\nc )\n\n+ 2\u03c2c(\u03bbc \u00afzc\u03bec \u2212 \u03be2\n\nc ),\n\n(8)\n\nwhere \u03be=[\u03be1, . . . , \u03beC ]\u22a4 are variational parameters, \u03c2c = 1\n4\u03bec\nterm is omitted since the lower bound is exact when \u03bec = \u2212\u03bbc \u00afzc.\nObtaining an approximate posterior distribution for the latent variables is then reduced to optimizing\nthe objective max L(q) or min KL(q||p) with respect to the variational parameters. By using La-\ngrange multipliers, we can easily derive the optimal condition which can be achieved by iteratively\nupdating the variational parameters according to the following formulas:\n\n2 ), and the second order residue\n\ntanh( \u03bec\n\nD\n\n\u03c6mc \u221d\n\n\u03b3c = ac +\n\n(bcd)xmd exp(cid:18)\u03a8(\u03b3c) +\nYd=1\nXm=1\n\n\u03c6mc,\n\nM\n\n\u03bbc\n2M\n\n[2yc \u2212 1 + tanh(\n\n\u03bec\n2\n\n)](cid:19),\n\n\u03bec = \u2212\u03bbc\n\n1\nM\n\nM\n\nXm=1\n\n\u03c6mc,\n\n(9)\n\nwhere \u03a8(\u00b7) is the digamma function. Note that instead of only one feature contributing to \u03c6mc as in\nLDA, all the features appearing in an instance are now responsible for contributing. This property\ntends to make DBA more robust to data sparsity. Also, DBA makes use of the supervision infor-\nc=1 \u03bbc \u00afzc(2yc \u2212 1) in the variational likelihood bound L. As L is optimized,\nthis term is equivalent to maximizing the likelihood of sampling the classes to which the pattern be-\nm=1 zmc, if yc = 1} and simultaneously minimizing the likelihood of sampling\nm=1 zmc, if yc = 0}. Here \u03bbc (-\u03bbc)\nacts like a utility (cost) of assigning X to the c-th class. As a result, it tends to align the Dirichlet\ntopics discovered from the data to the class labels (Bernoulli observations) y. This is why we coin\nthe name Dirichlet-Bernoulli Alignment.\n\nmation with a termPC\nlongs: {max \u03bbcPM\nthe classes to which the pattern does not belong: {min \u03bbcPM\n\n5.2 Parameter Estimation\n\nThe maximum likelihood parameter estimation of DBA relies on the variational approximation pro-\ncedure. Given a corpus D = {(Xn, yn)}n=1,...,N , the MLE can be formulated as:\n\na\u2217, B\u2217, \u03bb\u2217 = arg max log P (D|a, B, \u03bb) = arg max\na,B,\u03bb\n\nN\n\nXn=1\n\nmax\n\u03b3 n,\u03a6n\n\nL(\u03b3 n, \u03a6n|a, B, \u03bb).\n\n(10)\n\n5\n\n\fData Set\nText\nQuery\n\n#Train\n1200\n300\n\n#Test D\n679\n100\n\n500\n2000\n\nTable 1: Characteristic of the data sets.\n\nC\n10\n101\n\n|Y |avg #(|Y | > 1) Mavg Mmin Mmax\n1.4\n1.4\n\n721 (38.4%)\n99 (24.8%)\n\n36\n731\n\n8.2\n65\n\n1\n3\n\n100\n\n90\n\n80\n\n70\n\n \n\nDBA\n\nMNB\n\nMIMLSVM\n\nMIMLBoost\n\n \n\nacq\n\ncorn\n\ncrude\n\nearn\n\ngrain\n\ninterest money\n\nship\n\ntrade\n\nwheat\n\noverall\n\nFigure 2: Accuracies(%) of DBA, MNB, MIMLSVM, and MIMLBoost for text classi\ufb01cation.\n\nThe two-layer optimization in Eq.(10) involves two groups of parameters corresponding to the DBA\nmodel and its variational approximation, respectively. Optimizing alternatively between these two\ngroups leads to a Variational Expectation Maximization (VEM) algorithm similar to the one used in\nLDA, where the E-step corresponds to the variational approximation for each pattern in the corpus.\nAnd the M-step in turn maximizes the objective in Eq.(6) w.r.t. the model parameters. These two\nsteps are repeated alternatively until convergence.\n\n5.3 Inference\nDBA involves three types of inferential tasks. The \ufb01rst task is to infer the latent variables for a\ngiven pattern, which is straightforward after the variational approximation. The second task, pat-\ntern classi\ufb01cation, addresses prediction of labels for a new pattern X: p(yc = 1|X; a, B, \u03bb) \u2248\nexp(\u03bbc \u00af\u03c6c)/(1 + exp(\u03bbc \u00af\u03c6c)), where \u00af\u03c6c = 1\n2 )]\nis removed when updating \u03c6 in Eq.(9). The third task, instance disambiguation, \ufb01nds labels\n\nm=1 \u03c6mc and the term \u03bbc\n\n2M [2yc \u2212 1 + tanh( \u03bec\n\nfor each instances within a pattern: p(zm|X, y) = R\u03b8 p(zm, \u03b8|X, y)d\u03b8 \u2248 q(zm|\u03c6m), that is,\n\np(zmc = 1|X, y) = \u03c6mc.\n\nM PM\n\n6 Experiments\n\nIn this section, we conduct extensive experiments to test the DBA model as it is applied to pattern\nclassi\ufb01cation and instance disambiguation respectively. We \ufb01rst apply DBA to text classi\ufb01cation and\ncompare its performance with state-of-the-art M3C algorithms. Then the instance disambiguation\nperformance of DBA is tested on a novel real-world task, i.e., named entity disambiguation for web\nsearch queries. Table 1 shows the information of the data sets used in our experiments.\n\n6.1 Text Classi\ufb01cation\nThis experiment is conducted on the ModApte split of the Reuters-21578 text collection, which\ncontains 10788 documents belonging to the most popular 10 classes. We use the top 500 words with\nthe highest document frequency as features, and represent each document as a pattern with each of\nits paragraphs being an instance in order to exploit the semantic structure of documents explicitly.\nAfter eliminating the documents that have empty label set or less than 20 features, we obtain a subset\nof 1879 documents, among which 721 documents (about 38.4%) have multiple labels. The average\nnumber of labels per document is 1.4\u00b10.6 and the average number of instances (paragraphs) per\npattern (document) is 8.2\u00b14.8. The data set is further randomly partitioned into a subset of 1200\ndocuments for training and the rest for testing.\n\nFor comparison, we also test two state-of-the-art M3C algorithms, the MIMLSVM and MIMLBoost\n[13], and use the Multinomial Na\u00a8\u0131ve Bayes (MNB) classi\ufb01er trained on the vector space model of the\nwhole documents as the baseline. For a fair comparison, linear kernel is used in both MIMLSVM\nand MIMLBoost and all the hyper-parameters are tuned by 5-fold cross validation prior to training.\nWe use the Hamming-Accuracy [13] to evaluate the results, for DBA and MNB, the label is esti-\nmated by: y = \u03b4(p(y = 1|X) > t), where the cut-off probability threshold is also selected based\non 5-fold cross validation. Each experiment is repeated for 5 random runs and the average results\nare reported by a bar chart as depicted in Figure 2. We can see that: (1) for most classes, the three\n\n6\n\n\fTable 2: Accuracy@N (N = 1, 2, 3) and micro-averaged and macro-averaged F-measures of DBA, MNB and\nSVM based disambiguation methods.\n\nMethod\nMNB-TF\nMNB-TF-IDF\nSVM-TF\nSVM-TF-IDF\nDBA\n\nA@1\n0.4154\n0.4177\n0.4927\n0.4912\n0.5415\n\nGain\n30.4%\n29.6%\n9.9%\n10.2%\n\n-\n\nGain\n25.7%\n25.6%\n\nA@2\n0.4913\n0.4918\n\nNA\nNA\n\nGain\n25.4%\n25.2%\n\nA@3\n0.5168\n0.5176\n\nNA\nNA\n\n0.6175\n\n-\n\n0.6482\n\n-\n\nFmicro\n0.4154\n0.4177\n0.4927\n0.4912\n0.5415\n\nGain\n30.4%\n29.6%\n9.9%\n10.2%\n\n-\n\nFmacro\n0.3144\n0.2988\n0.3720\n0.3670\n0.4622\n\nGain\n47.0%\n54.7%\n24.2%\n25.0%\n\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\np\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n0\n\nDBA\nMNBTF\nMNBTFIDF\nSVMTF\nSVMTFIDF\n\n \n\n20\n\n40\n\n60\n\n80\n\n100\n\nclass\n\nl\nl\n\na\nc\ne\nr\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n0\n\nDBA\nMNBTF\nMNBTFIDF\nSVMTF\nSVMTFIDF\n\n20\n\n40\n\n60\n\n80\n\n100\n\nclass\n\nFigure 3: Precision and Recall scores for each of 101 classes by using DBA, MNB and SVM based methods.\n\nM3C algorithms outperform the MNB baseline; (2) the performance of DBA is at least comparable\nwith MIMLBoost and MIMLSVM. For most classes and overall, DBA performs the best, whereas\nfor some classes, MIMLBoost and MIMLSVM perform even slightly worse than MNB. A possi-\nble reason might be: if the documents are very short, splitting them might introduce severe data\nsparseness and in turn harms the performance. We also observe that DBA is much more ef\ufb01cient\nthan MIMLBoost and MIMLSVM. For training, DBA takes 42 mins on average, in contrast to 557\nminutes (MIMLSVM) and 806 minutes (MIMLBoost).\n\n6.2 Named Entity Disambiguation\nQuery ambiguity is a fundamental obstacle for search engine to capture users\u2019 search intentions. In\nthis section, we employ DBA to disambiguate the named entities in web search queries. This is a\nvery challenging problem because queries are usually very short (2 to 3 words on average), noisy\n(e.g., misspellings, abbreviations, less grammatical structure) and topic-distracted. A single named-\nentity query Q can be viewed as a combination of a single named entity e and a set of context words\nw (the remaining text in Q). By differentiating the possible meanings of the named entity in a query\nand identifying the most possible one, entity disambiguation can help search engines to capture the\nprecise information need of the user and in turn improve search by responding with the truly most\nrelevant documents. For example, when a user inputs \u201cWhen are the casting calls for Harry Potter\nin USA?\u201d, the system should be able to identify that the ambiguous named entity \u201cHarry Potter\u201d\n(i.e., it can be a movie, a book or a game) really refers to a movie in this speci\ufb01c query.\n\nWe treat the ambiguity of e as a hidden class z over e and make use of the query log as a data\nsource for mining the relationship among e, w and z. In particular, the query log can be viewed\nas a multi-class, multi-label and multi-instance corpus {(Xn, Yn)}n=1,2,...,N , in which each pat-\ntern X corresponds to a named-entity e and is characterized by a set of instances {xm}m=1,2,...,M\ncorresponding to all the contexts {wm}m=1,2,...,M that co-occur with e in queries, and the label Y\ncontains all the ambiguities of e.\nOur data was based on a snapshot of answers.yahoo.com crawled in early 2008, containing\n216563 queries from 101 classes. We manually collect 400 named entities and label them according\nto the labels of their co-occurring queries in Yahoo! CQA. A randomly chosen subset of 300 entities\nare used as training data and the other 100 are used for testing. We compare our DBA based method\nwith baselines including Multinomial Na\u00a8\u0131ve Bayes classi\ufb01er using TF (MNB-TF) or TF-IDF (MNB-\nTFIDF) as word attributes, and SVM classi\ufb01er using TF (SVM-TF) or TFIDF (SVM-TF-IDF). For\nSVM, a similar scheme as MIMLSVM is used for learning M3C classi\ufb01ers.\n\nTable 2 demonstrates the Accuracy@N (N = 1, 2, 3) as well as micro-averaged and macro-average\nF-measure scores of each disambiguation approach3. All the results are obtained through 5-fold\ncross-validation. From the table, we observe that DBA achieves signi\ufb01cantly better performance\nthan all the other methods. In particular, for Accuracy@1 scores, DBA can achieve a gain of about\n\n3Since SVM only outputs hard class assignments, there is no Accuracy@2,3 for SVM based methods.\n\n7\n\n\f30% relative to two MNB methods, and about 10% relative to two SVM methods; for macro-average\nF-measures, DBA can achieve a gain of about 50% over MNB methods, and about 25% over SVM\nmethods. As a reference, in Figure 3, we also illustrate the sorted precision and recall scores for each\nof the 101 classes. We can see that, DBA slightly outperforms the baselines in terms of precision,\nand signi\ufb01cantly performs better in terms of the recall scores. In particular, for recall, DBA can\nachieve a gain of more than 50% relative to MNB and SVM baselines.\n\n7 Concluding Remarks\nMulti-class, multi-label and multi-instance classi\ufb01cation (M3C) is encountered in many applications.\nEven for task that is not explicitly an M3C problem, it might still be advantageous to treat it as\nM3C so as to better explore its inner structures and effectively handle the ambiguities. M3C also\nnaturally arises from the dif\ufb01culty of acquiring \ufb01nely-labeled data. In this paper, we have proposed a\nprobabilistic generative model for M3C corpora. The proposed DBA model is useful for both pattern\nclassi\ufb01cation and instance disambiguation, as has been tested respectively in text classi\ufb01cation and\nnamed-entity disambiguation tasks.\n\nAn interesting observation in practice is that, although there might be a large number of\nclasses/topics, usually a pattern is only associated with a very limited number of them.\nIn our\nexperiment, we found that substantial improvement could be achieved by simply enforcing label\nsparsity, e.g., by using LASSO style regularization. In future, we will investigate such \u201cLabel Parsi-\nmoniousness\u201d in a principled way. Another meaningful investigation would be to explicitly capture\nor explore the class correlations by using, for example, the Logistic Normal distribution [3] rather\nthan Dirichlet.\nAcknowledgments\nHongyuan Zha is supported by NSF #DMS-0736328 and grant from Microsoft. Bao-Gang Hu is\nsupported by NSFC #60275025 and the MOST of China grant #2007DFC10740.\n\nReferences\n[1] Andrews S. and Hofmann T. (2003) Multiple Instance Learning via Disjunctive Programming\n\nBoosting, In Advances in Neural Information Processing Systems 17 (NIPS\u201903), MIT Press.\n\n[2] Blei D. and McAuliffe J. (2007) Supervised topic models. In Advances in Neural Information\n\nProcessing Systems 21 (NIPS\u201907), MIT Press.\n\n[3] Blei D. and Lafferty J. (2007) A correlated topic model of Science. Annals of Applied Statistics.\n\nVol. 1, No. 1, pp. 17\u201335, 2007.\n\n[4] Blei D., Ng A. and Jordan M. (2003) Latent Dirichlet Allocation. Journal of Machine Learning\n\nResearch, Vol. 3, pp.993\u20131022, Jan. 2003, MIT Press.\n\n[5] Boutell M. R., Luo J., Shen X. and Brown C. M. (2004) Learning Multi-Label Scene Classi\ufb01-\n\ncation. Pattern Recognition, 37(9), pp.1757\u20131771, 2004.\n\n[6] Cour T., Sapp B., Jordan C. and Taskar B. (2009) Learning from Ambiguously Labeled Images,\n\nIn the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201909).\n\n[7] Dietterich T. G., Lathrop R. H., Lozano-Perez T. (1997) Solving the Multiple-Instance Problem\n\nwith Axis-Parallel Rectangles. Arti\ufb01cial Intelligence Journal, Vol. 89, pp.31\u201371, Jan.1997.\n\n[8] Ghamrawi N. and McCallum A. (2005) Collective Multi-Label Classi\ufb01cation, In ACM Interna-\n\ntional Conference On Information And Knowledge Management (CIKM\u201905), pp.195\u2013200.\n\n[9] Jaakkola, T. and Jordan M. I. (2000). Bayesian parameter estimation via variational methods.\n\nStatistics and Computing, Vol 10, Issue 1, pp. 25\u201337.\n\n[10] Ueda N. and Saito K. (2002) Parametric Mixture Models For Multi-Labeled Text. In Advances\n\nin Neural Information Processing Systems 15 (NIPS\u201902).\n\n[11] Viola P., Platt J. and Zhang C. (2006). Multiple Instance Boosting For Object Detection. In\nAdvances in Neural Information Processing Systems 20 (NIPS\u201906), pp.1419\u20131426, MIT Press.\n[12] Xu G., Yang S.-H. and Li H. (2009) Named Entity Mining from Click-Through Data Using\n\nWeakly Supervised LDA, In ACM Knowledge Discovery and Data Mining (KDD\u201909).\n\n[13] Zhou Z.-H. and Zhang M.-L. (2006) Multi-Instance Multi-Label Learning with Application to\n\nScene Classi\ufb01cation, In Advances in Neural Information Processing Systems 20 (NIPS\u201906).\n\n8\n\n\f", "award": [], "sourceid": 45, "authors": [{"given_name": "Shuang-hong", "family_name": "Yang", "institution": null}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": null}, {"given_name": "Bao-gang", "family_name": "Hu", "institution": null}]}