{"title": "Estimating Accuracy from Unlabeled Data: A Probabilistic Logic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 4361, "page_last": 4370, "abstract": "We propose an efficient method to estimate the accuracy of classifiers using only unlabeled data. We consider a setting with multiple classification problems where the target classes may be tied together through logical constraints. For example, a set of classes may be mutually exclusive, meaning that a data instance can belong to at most one of them. The proposed method is based on the intuition that: (i) when classifiers agree, they are more likely to be correct, and (ii) when the classifiers make a prediction that violates the constraints, at least one classifier must be making an error. Experiments on four real-world data sets produce accuracy estimates within a few percent of the true accuracy, using solely unlabeled data. Our models also outperform existing state-of-the-art solutions in both estimating accuracies, and combining multiple classifier outputs. The results emphasize the utility of logical constraints in estimating accuracy, thus validating our intuition.", "full_text": "Estimating Accuracy from Unlabeled Data:\n\nA Probabilistic Logic Approach\n\nEmmanouil A. Platanios\nCarnegie Mellon University\n\nPittsburgh, PA\n\ne.a.platanios@cs.cmu.edu\n\nTom M. Mitchell\n\nCarnegie Mellon University\n\nPittsburgh, PA\n\ntom.mitchell@cs.cmu.edu\n\nHoifung Poon\n\nMicrosoft Research\n\nRedmond, WA\n\nhoifung@microsoft.com\n\nEric Horvitz\n\nMicrosoft Research\n\nRedmond, WA\n\nhorvitz@microsoft.com\n\nAbstract\n\nWe propose an ef\ufb01cient method to estimate the accuracy of classi\ufb01ers using only\nunlabeled data. We consider a setting with multiple classi\ufb01cation problems where\nthe target classes may be tied together through logical constraints. For example, a\nset of classes may be mutually exclusive, meaning that a data instance can belong to\nat most one of them. The proposed method is based on the intuition that: (i) when\nclassi\ufb01ers agree, they are more likely to be correct, and (ii) when the classi\ufb01ers\nmake a prediction that violates the constraints, at least one classi\ufb01er must be making\nan error. Experiments on four real-world data sets produce accuracy estimates\nwithin a few percent of the true accuracy, using solely unlabeled data. Our models\nalso outperform existing state-of-the-art solutions in both estimating accuracies,\nand combining multiple classi\ufb01er outputs. The results emphasize the utility of\nlogical constraints in estimating accuracy, thus validating our intuition.\n\n1\n\nIntroduction\n\nEstimating the accuracy of classi\ufb01ers is central to machine learning and many other \ufb01elds. Accuracy\nis de\ufb01ned as the probability of a system\u2019s output agreeing with the true underlying output, and thus\nis a measure of the system\u2019s performance. Most existing approaches to estimating accuracy are\nsupervised, meaning that a set of labeled examples is required for the estimation. Being able to\nestimate the accuracies of classi\ufb01ers using only unlabeled data is important for many applications,\nincluding: (i) any autonomous learning system that operates under no supervision, as well as (ii)\ncrowdsourcing applications, where multiple workers provide answers to questions, for which the\ncorrect answer is unknown. Furthermore, tasks which involve making several predictions which are\ntied together by logical constraints are abundant in machine learning. As an example, we may have\ntwo classi\ufb01ers in the Never Ending Language Learning (NELL) project [Mitchell et al., 2015] which\npredict whether noun phrases represent animals or cities, respectively, and we know that something\ncannot be both an animal and a city (i.e., the two categories are mutually exclusive). In such cases, it\nis not hard to observe that if the predictions of the system violate at least one of the constraints, then\nat least one of the system\u2019s components must be wrong. This paper extends this intuition and presents\nan unsupervised approach (i.e., only unlabeled data are needed) for estimating accuracies that is able\nto use information provided by such logical constraints. Furthermore, the proposed approach is also\nable to use any available labeled data, thus also being applicable to semi-supervised settings.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: System overview diagram. The classi\ufb01er outputs (corresponding to the function approx-\nimation outputs) and the logical constraints make up the system inputs. The representation of the\nlogical constraints in terms of the function approximation error rates is described in section 3.2. In\nthe logical constraints box, blue arrows represent subsumption constraints, and labels connected by\na red dashed line represent a mutually exclusive set. Given the inputs, the \ufb01rst step is grounding\n(computing all feasible ground predicates and rules that the system will need to perform inference\nover) and is described in section 3.3.2. In the ground rules box, \u2227, \u00ac, \u2192 correspond to the logic AND,\nOR, and IMPLIES. Then, inference is performed in order to infer the most likely truth values of the\nunobserved ground predicates, given the observed ones and the ground rules (described in detail in\nsection 3.3). The results constitute the outputs of our system and they include: (i) the estimated error\nrates, and (ii) the most likely target function outputs (i.e., combined predictions).\n\n1 , . . . , \u02c6f d\n\nWe consider a \u201cmultiple approximations\u201d problem setting in which we have several different ap-\nproximations, \u02c6f d\nN d, to a set of target boolean classi\ufb01cation functions, f d : X (cid:55)\u2192 {0, 1} for\nd = 1, . . . , D, and we wish to know the true accuracies of each of these different approximations,\nusing only unlabeled data, as well as the response of the true underlying functions, f d. Each value\nof d characterizes a different domain (or problem setting) and each domain can be interpreted as a\nclass or category of objects. Similarly, the function approximations can be interpreted as classifying\ninputs as belonging or not to these categories. We consider the case where we may have a set of\nlogical constraints de\ufb01ned over the domains. Note that, in contrast with related work, we allow the\nfunction approximations to provide soft responses in the interval [0, 1] (as opposed to only allowing\nbinary responses \u2014 i.e., they can now return the probability for the response being 1), thus allowing\nmodeling of their \u201ccertainty\u201d. As an example of this setting, to which we will often refer throughout\nthis paper, let us consider a part of NELL, where the input space of our functions, X , is the space of\nall possible noun phrases (NPs). Each target function, f d, returns a boolean value indicating whether\nthe input NP belongs to a category, such as \u201ccity\u201d or \u201canimal\u201d, and these categories correspond to our\ndomains. There also exist logical constraints between these categories that may be hard (i.e., strongly\nenforced) or soft (i.e., enforced in a probabilistic manner). For example, \u201ccity\u201d and \u201canimal\u201d may\nbe mutually exclusive (i.e., if an object belongs to \u201ccity\u201d, then it is unlikely that it also belongs to\n\u201canimal\u201d). In this case, the function approximations correspond to different classi\ufb01ers (potentially\nusing a different set of features / different views of the input data), which may return a probability\nfor a NP belonging to a class, instead of a binary value. Our goal is to estimate the accuracies of\nthese classi\ufb01ers using only unlabeled data. In order to quantify accuracy, we de\ufb01ne the error rate of\nclassi\ufb01er j in domain d as ed\nj (X) (cid:54)= f d(X)], for the binary case, for j = 1, . . . , N d, where\nj\n\n(cid:44) PD[ \u02c6f d\n\n2\n\nanimalfishbird. . . Logical Constraintssharkanimal99%fish95%bird5%. . .sparrowanimal95%fish10%bird26%. . .. . .Classi\ufb01er #1sharkanimal99%fish95%bird5%. . .sparrowanimal95%fish2%bird84%. . .. . .Classi\ufb01er #2InstanceCategoryProbabilityClassi\ufb01er Outputs. . .SUB(animal,fish)^\u00ac\u02c6fanimal1(shark)^ffish(shark)!eanimal1...ME(fish,bird)^\u02c6ffish1(sparrow)^fbird(sparrow)!efish1Ground Rules\u02c6fanimal1(shark)=0.99\u02c6ffish1(shark)=0.95\u02c6fbird1(shark)=0.05\u02c6fanimal1(sparrow)=0.95\u02c6ffish1(sparrow)=0.10\u02c6fbird1(sparrow)=0.26...SUB(animal,fish)=1SUB(animal,bird)=1ME(fish,bird)=1eanimal1efish1ebird1...fanimal(shark)ffish(shark)fbird(shark)fanimal(sparrow)ffish(sparrow)fbird(sparrow)ObservedUnobservedError RatesCombined PredictionsResultsClassi\ufb01er #1animal1%fish5%bird57%. . .Classi\ufb01er #2animal1%fish2%bird9%. . .. . .sparrowanimal95%fish4%bird75%. . .sharkanimal99%fish95%bird8%. . .. . .GroundingInputs: Predicted probability for each classi\ufb01er-object-category.Outputs: Set of object-category classi\ufb01cation pairs and category-classi\ufb01er error-rate pairs that are not directly constrained to be 0 or 1 from the logical constraints.Description: Section 3.3.2.Probabilistic InferenceInputs: Ground predicates and rules.Step 1: Create a Markov Random Field (MRF).Step 2: Perform probabilistic inference to obtain the most likely values for the unobserved ground predicates. Inference is performed using a modi\ufb01ed version of the Probabilistic Soft Logic (PSL) framework.Outputs: Classi\ufb01er error rates and underlying function values.Description: Section 3.3.Ground Predicates\fD is the true underlying distribution of the input data. Note that accuracy is equal to one minus error\nrate. This de\ufb01nition may be relaxed for the case where \u02c6f d\nj (X) \u2208 [0, 1] representing a probability:\nj (X))PD[f d(X)(cid:54)= 0], which resembles an expected probabil-\ned\nj\nity of error. Even though our work is motivated by the use of logical constraints de\ufb01ned over the\ndomains, we also consider the setting where there are no such constraints.\n\nj (X)PD[f d(X)(cid:54)= 1] + (1 \u2212 \u02c6f d\n\n(cid:44) \u02c6f d\n\n2 Related Work\n\nThe literature covers many projects related to estimating accuracy from unlabeled data. The setting\nwe are considering was previously explored by Collins and Singer [1999], Dasgupta et al. [2001],\nBengio and Chapados [2003], Madani et al. [2004], Schuurmans et al. [2006], Balcan et al. [2013],\nand Parisi et al. [2014], among others. Most of their approaches made some strong assumptions,\nsuch as assuming independence given the outputs, or assuming knowledge of the true distribution\nof the outputs. None of the previous approaches incorporated knowledge in the form of logical\nconstraints. Collins and Huynh [2014] review many methods that were proposed for estimating the\naccuracy of medical tests in the absence of a gold standard. This is effectively the same problem that\nwe are considering, applied to the domains of medicine and biostatistics. They present a method\nfor estimating the accuracy of tests, where these tests are applied in multiple different populations\n(i.e., different input data), while assuming that the accuracies of the tests are the same across the\npopulations, and that the test results are independent conditional on the true \u201coutput\u201d. These are\nsimilar assumptions to the ones made by several of the other papers already mentioned, but the idea\nof applying the tests to multiple populations is new and interesting. Platanios et al. [2014] proposed a\nmethod relaxing some of these assumptions. They formulated the problem of estimating the error\nrates of several approximations to a function as an optimization problem that uses agreement rates\nof these approximations over unlabeled data. Dawid and Skene [1979] were the \ufb01rst to formulate\nthe problem in terms of a graphical model and Moreno et al. [2015] proposed a nonparametric\nextension to that model applied to crowdsourcing. Tian and Zhu [2015] proposed an interesting\nmax-margin majority voting scheme for combining classi\ufb01er outputs, also applied to crowdsourcing.\nHowever, all of these approaches were outperformed by the models of Platanios et al. [2016], which\nare most similar to the work of Dawid and Skene [1979] and Moreno et al. [2015]. To the best of\nour knowledge, our work is the \ufb01rst to use logic for estimating accuracy from unlabeled data and, as\nshown in our experiments, outperforms all competing methods. Logical constraints provide additional\ninformation to the estimation method and this partially explains the performance boost.\n\n3 Proposed Method\n\nj , in terms of the error rates ed\n\nOur method consists of: (i) de\ufb01ning a set of logic rules for modeling the logical constraints between\nthe f d and the \u02c6f d\nj and the known logical constraints, and (ii) performing\nprobabilistic inference using these rules as priors, in order to obtain the most likely values of the\nj and the f d, which are not observed. The intuition behind the method is that if the constraints\ned\nare violated for the function approximation outputs, then at least one of these functions has to be\nmaking an error. For example, in the NELL case, if two function approximations respond that a\nNP belongs to the \u201ccity\u201d and the \u201canimal\u201d categories, respectively, then at least one of them has to\nbe making an error. We de\ufb01ne the form of the logic rules in section 3.2 and then describe how to\nperform probabilistic inference over them in section 3.3. An overview of our system is shown in\n\ufb01gure 1. In the next section we introduce the notion of probabilistic logic, which fuses classical logic\nwith probabilistic reasoning and that forms the backbone of our method.\n\n3.1 Probabilistic Logic\n\nIn classical logic, we have a set of predicates (e.g., mammal(x) indicating whether x is a mammal,\nwhere x is a variable) and a set of rules de\ufb01ned in terms of these predicates (e.g., mammal(x) \u2192\nanimal(x), where \u201c\u2192\u201d can be interpreted as \u201cimplies\u201d). We refer to predicates and rules de\ufb01ned for\na particular instantiation of their variables as ground predicates and ground rules, respectively (e.g.,\nmammal(whale) and mammal(whale) \u2192 animal(whale)). These ground predicates and rules take\nboolean values (i.e., are either true or false \u2014 for rules, the value is true if the rule holds). Our goal\n\n3\n\n\fis to infer the most likely values for a set of unobserved ground predicates, given a set of observed\nground predicate values and logic rules.\nIn probabilistic logic, we are instead interested in inferring the probabilities of these ground predicates\nand rules being true, given a set of observed ground predicates and rules. Furthermore, the truth\nvalues of ground predicates and rules may be continuous and lie in the interval [0, 1], instead of being\nboolean, representing the probability that the corresponding ground predicate or rule is true. In this\ncase, boolean logic operators, such as AND (\u2227), OR (\u2228), NOT (\u00ac), and IMPLIES (\u2192), need to be\nrede\ufb01ned. For the next section, we will assume their classical logical interpretation.\n\n3.2 Model\n\nAs described earlier, our goal is to estimate the true accuracies of each of the function approximations,\n\u02c6f d\n1 , . . . , \u02c6f d\nN d for d = 1, . . . , D, using only unlabeled data, as well as the response of the true\nunderlying functions, f d. We now de\ufb01ne the logic rules that we perform inference over in order to\nachieve that goal. The rules are de\ufb01ned in terms of the following predicates, for d = 1, . . . , D:\n\u2022 Function Approximation Outputs: \u02c6f d\n\nj (X), de\ufb01ned over all approximations j = 1, . . . , N d, and\ninputs X \u2208 X , for which the corresponding function approximation has provided a response. Note\nthat the values of these ground predicates lie in [0, 1] due to their probabilistic nature (i.e., they do\nnot have to be binary, as in related work), and some of them are observed.\n\u2022 Target Function Outputs: f d(X), de\ufb01ned over all inputs X \u2208 X . Note that, in the purely\nunsupervised setting, none of these ground predicate values are observed, in contrast with the\nsemi-supervised setting.\n\n\u2022 Function Approximation Error Rates: ed\n\nj , de\ufb01ned over all approximations j = 1, . . . , N d. Note\nthat none of these ground predicate values are observed. The primary goal of this paper is to infer\ntheir values.\n\nThe goal of the logic rules we de\ufb01ne is two-fold: (i) to combine the function approximation outputs\nin a single output value, and (ii) to account for the logical constraints between the domains. We aim\nto achieve both goals while accounting for the error rates of the function approximations. We \ufb01rst\nde\ufb01ne a set of rules that relate the function approximation outputs with the true underlying function\noutput. We call this set of rules the ensemble rules and we describe them in the following section.\nWe then discuss how to account for the logical constraints between the domains.\n\n3.2.1 Ensemble Rules\n\nj \u2192\u00acf d(X),\nj \u2192 f d(X),\n\nj (X) \u2227 \u00aced\nj (X) \u2227 ed\n\nj \u2192 f d(X), \u00ac \u02c6f d\nj \u2192\u00acf d(X), and \u00ac \u02c6f d\n\nThis \ufb01rst set of rules speci\ufb01es a relation between the target function outputs, f d(X), and the function\napproximation outputs, \u02c6f d\nj (X), independent of the logical constraints:\n\u02c6f d\nj (X) \u2227 \u00aced\n\u02c6f d\nj (X) \u2227 ed\n\n(1)\n(2)\nfor d = 1, . . . , D, j = 1, . . . , N d, and X \u2208 X . In words: (i) the \ufb01rst set of rules state that if a\nfunction approximation is not making an error, its output should match the output of the target\nfunction, and (ii) the second set of rules state that if a function approximation is making an error, its\noutput should not match the output of the target function.\nAn interesting point to make is that the ensemble rules effectively constitute a weighted majority\nvote for combining the function approximation outputs, where the weights are determined by the\nerror rates of the approximations. These error rates are implicitly computed based on agreement\nbetween the function approximations. This is related to the work of Platanios et al. [2014]. There,\nthe authors try to answer the question of whether consistency in the outputs of the approximations\nimplies correctness. They directly use the agreement rates of the approximations in order to estimate\ntheir error rates. Thus, there exists an interesting connection in our work in that we also implicitly\nuse agreement rates to estimate error rates, and our results, even though improving upon theirs\nsigni\ufb01cantly, reinforce their claim.\n\nIdenti\ufb01ability. Let us consider \ufb02ipping the values of all error rates (i.e., setting them to one minus\ntheir value) and the target function responses. Then, the ensemble logic rules would evaluate to\nthe same value as before (e.g., satis\ufb01ed or unsatis\ufb01ed). Therefore, the error rates and the target\nfunction values are not identi\ufb01able when there are no logical constraints. As we will see in the next\n\n4\n\n\fsection, the constraints may sometimes help resolve this issue as, often, the corresponding logic\nrules do not exhibit that kind of symmetry. However, for cases where that symmetry exists, we\ncan resolve it by assuming that most of the function approximations have error rates better than\nchance (i.e., < 0.5). This can be done by considering the two rules: (i) \u02c6f d\nj (X) \u2192 f d(X), and\n\u00ac \u02c6f d\nj (X) \u2192 \u00acf d(X), for d = 1, . . . , D, j = 1, . . . , N d, and X \u2208 X . Note that all that these rules\nimply is that \u02c6f d\nj (X) = f d(X) (i.e., they represent the prior belief that function approximations are\ncorrect). As will be discussed in section 3.3, in probabilistic frameworks where rules are weighted\nwith a real value in [0, 1], these rules will be given a weight that represents their signi\ufb01cance or\nstrength. In such a framework, we can consider using a smaller weight for these prior belief rules,\ncompared to the remainder of the rules, which would simply correspond to a regularization weight.\nThis weight can be a tunable or even learnable parameter.\n\n3.2.2 Constraints\n\nThe space of possible logical constraints is huge; we do not deal with every possible constraint in\nthis paper. Instead, we focus our attention on two types of constraints that are abundant in structured\nprediction problems in machine learning, and which are motivated by the use of our method in the\ncontext of NELL:\n\u2022 Mutual Exclusion: If domains d1 and d2 are mutually exclusive, then f d1 = 1 implies that f d2 = 0.\nFor example, in the NELL setting, if a NP belongs to the \u201ccity\u201d category, then it cannot also belong\nto the \u201canimal\u201d category.\n\n\u2022 Subsumption: If d1 subsumes d2, then if f d2 = 1, we must have that f d1 = 1. For example, in\nthe NELL setting, if a NP belongs to the \u201ccat\u201d category, then it must also belong to the \u201canimal\u201d\ncategory.\n\nThis set of constraints is suf\ufb01cient to model most ontology constraints between categories in NELL,\nas well as a big subset of the constraints more generally used in practice.\n\nMutual Exclusion Rule. We \ufb01rst de\ufb01ne the predicate ME(d1, d2), indicating that domains d1 and\nd2 are mutually exclusive1. This predicate has value 1 if domains d1 and d2 are mutually exclusive,\nand value 0 otherwise, and its truth value is observed for all values of d1 and d2. Furthermore, note\nthat it is symmetric, meaning that if ME(d1, d2) is true, then ME(d2, d1) is also true. We de\ufb01ne the\nmutual exclusion logic rule as:\n\nME(d1, d2) \u2227 \u02c6f d1\n\n(3)\nfor d1 (cid:54)= d2 = 1, . . . , D, j = 1, . . . , N d1, and X \u2208 X . In words, this rule says that if f d2(X) = 1\nand domains d1 and d2 are mutually exclusive, then \u02c6f d1\nj (X) must be equal to 0, as it is an approxi-\nmation to f d1(X) and ideally we want that \u02c6f d1\nj must\nbe making an error.\n\nj (X) = f d1 (X). If that is not the case, then \u02c6f d1\n\nj (X) \u2227 f d2(X) \u2192 ed1\nj ,\n\nSubsumption Rule. We \ufb01rst de\ufb01ne the predicate SUB(d1, d2), indicating that domain d1 subsumes\ndomain d2. This predicate has value 1 if domain d1 subsumes domain d2, and 0 otherwise, and its\ntruth value is always observed. Note that, unlike mutual exclusion, this predicate is not symmetric.\nWe de\ufb01ne the subsumption logic rule as:\n\nSUB(d1, d2) \u2227 \u00ac \u02c6f d1\n\nj (X) \u2227 f d2(X) \u2192 ed1\nj ,\n\n(4)\nfor d1, d2 = 1, . . . , D, j = 1, . . . , N d1, and X \u2208 X . In words, this rule says that if f d2 (X) = 1 and\nd1 subsumes d2, then \u02c6f d1\nj (X) must be equal to 1, as it is an approximation to f d1(X) and ideally we\nwant that \u02c6f d1\nHaving de\ufb01ned all of the logic rules that comprise our model, we now describe how to perform\ninference under such a probabilistic logic model, in the next section. Inference in this case comprises\ndetermining the most likely truth values of the unobserved ground predicates, given the observed\npredicates and the set of rules that comprise our model.\n\nj (X) = f d1(X). If that is not the case, then \u02c6f d1\n\nj must be making an error.\n\n1A set of mutually-exclusive domains can be reduced to pairwise ME constraints for all pairs in that set.\n\n5\n\n\f3.3\n\nInference\n\nIn section 3.1 we introduced the notion of probabilistic logic and we de\ufb01ned our model in terms\nof probabilistic predicates and rules. In this section we discuss in more detail the implications of\nusing probabilistic logic, and the way in which we perform inference in our model. There exist\nvarious probabilistic logic frameworks, each making different assumptions. In what is arguably the\nmost popular such framework, Markov Logic Networks (MLNs) [Richardson and Domingos, 2006],\ninference is performed over a constructed Markov Random Field (MRF) based on the model logic\nrules. Each potential function in the MRF corresponds to a ground rule and takes an arbitrary positive\nvalue when the ground rule is satis\ufb01ed and the value 0 otherwise (the positive values are often called\nrule weights and can be either \ufb01xed or learned). Each variable is boolean-valued and corresponds\nto a ground predicate. MLNs are thus a direct probabilistic extension to boolean logic. It turns out\nthat due to the discrete nature of the variables in MLNs, inference is NP-hard and can thus be very\ninef\ufb01cient. Part of our goal in this paper is for our method to be applicable at a very large scale (e.g.,\nfor systems like NELL). We thus resorted to Probabilistic Soft Logic (PSL) [Br\u00f6cheler et al., 2010],\nwhich can be thought of as a convex relaxation of MLNs.\nNote that the model proposed in the previous section, which is also the primary contribution of this\npaper, can be used with various probabilistic logic frameworks. Our choice, which is described in\nthis section, was motivated by scalability. One could just as easily perform inference for our model\nusing MLNs, or any other such framework.\n\n3.3.1 Probabilistic Soft Logic (PSL)\n\nIn PSL, models, which are composed of a set of logic rules, are represented using hinge-loss\nMarkov random \ufb01elds (HL-MRFs) [Bach et al., 2013]. In this case, inference amounts to solving a\nconvex optimization problem. Variables of the HL-MRF correspond to soft truth values of ground\npredicates. Speci\ufb01cally, a HL-MRF, f, is a probability density over m random variables, Y =\n{Y1, . . . , Ym} with domain D = [0, 1]m, corresponding to the unobserved ground predicate values.\nLet X = {X1, . . . , Xn} be an additional set of variables with known values in the domain [0, 1]n,\ncorresponding to observed ground predicate values. Let \u03c6 = {\u03c61, . . . , \u03c6k} be a \ufb01nite set of k\ncontinuous potential functions of the form \u03c6j(X, Y) = (max{(cid:96)j(X, Y), 0})pj , where (cid:96)j is a linear\nfunction of X and Y, and pj \u2208 {1, 2}. We will soon see how these functions relate to the ground\nrules of the model. Given the above, for a set of non-negative free parameters \u03bb = {\u03bb1, . . . , \u03bbk} (i.e.,\nthe equivalent of MLN rule weights), the HL-MRF density is de\ufb01ned as:\n\n\u03bbj\u03c6j(X, Y),\n\n(5)\n\nwhere Z is a normalizing constant so that f is a proper probability density function. Our goal is to\ninfer the most probable explanation (MPE), which consists of the values of Y that maximize the\nlikelihood of our data2. This is equivalent to solving the following convex problem:\n\nf (Y) =\n\n1\nZ\n\nexp\u2212\n\nk(cid:88)\n\nj=1\n\nk(cid:88)\n\nj=1\n\nmin\n\nY\u2208[0,1]m\n\n\u03bbj\u03c6j(X, Y).\n\n(6)\n\nEach variable Xi or Yi corresponds to a soft truth value (i.e., Yi \u2208 [0, 1]) of a ground predicate.\nEach function (cid:96)j corresponds to a measure of the distance to satis\ufb01ability of a logic rule. The set\nof rules used is what characterizes a particular PSL model. The rules represent prior knowledge\nwe might have about the problem we are trying to solve. For our model, these rules were de\ufb01ned\nin section 3.2. As mentioned above, variables are allowed to take values in the interval [0, 1]. We\nthus need to de\ufb01ne what we mean by the truth value of a rule and its distance to satis\ufb01ability. For\nthe logical operators AND (\u2227), OR (\u2228), NOT (\u00ac), and IMPLIES (\u2192), we use the de\ufb01nitions from\n\u0141ukasiewicz Logic [Klir and Yuan, 1995]: P \u2227Q (cid:44) max{P + Q \u2212 1, 0}, P \u2228Q (cid:44) min{P + Q, 1},\n\u00acP (cid:44) 1 \u2212 P , and P \u2192 Q (cid:44) min{1 \u2212 P + Q, 1}. Note that these operators are a simple continuous\nrelaxation of the corresponding boolean operators, in that for boolean-valued variables, with 0\ncorresponding to FALSE and 1 to TRUE, they are equivalent. By writing all logic rules in the form\nB1 \u2227 B2 \u2227 \u00b7\u00b7\u00b7 \u2227 Bs \u2192 H1 \u2228 H2 \u2228 \u00b7\u00b7\u00b7 \u2228 Ht, it is easy to observe that the distance to satis\ufb01ability\n2As opposed to performing marginal inference which aims to infer the marginal distribution of these values.\n\n6\n\n\fFigure 2: Illustration of the NELL-11 data set constraints. Each box represents a label, each blue\narrow represents a subsumption constraint, and each set of labels connected by a red dashed line\nrepresents a mutually exclusive set of labels. For example, Animal subsumes Vertebrate and\nBird, Fish, and Mammal are mutually exclusive.\n\ni=1 Bi \u2212\n\n(i.e., 1 minus its truth value) of a rule evaluates to max{0,(cid:80)s\n\n(cid:80)t\nj=1 Ht + 1 \u2212 s}. Note\nthat any set of rules of \ufb01rst-order predicate logic can be represented in this form [Br\u00f6cheler et al.,\n2010], and that minimizing this quantity amounts to making the rule \u201cmore satis\ufb01ed\u201d.\nIn order to complete our system description we need to describe: (i) how to obtain a set of ground\nrules and predicates from a set of logic rules of the form presented in section 3.2 and a set of\nobserved ground predicates, and de\ufb01ne the objective function of equation 6, and (ii) how to solve\nthe optimization problem of that equation to obtain the most likely truth values for the unobserved\nground predicates. These two steps are described in the following two sections.\n\n3.3.2 Grounding\n\nGrounding is the process of computing all possible groundings of each logic rule to construct the\ninference problem variables and the objective function. As already described in section 3.3.1, the\nvariables X and Y correspond to ground predicates and the functions (cid:96)j correspond to ground rules.\nThe easiest way to ground a set of logic rules would be to go through each one and create a ground\nrule instance of it, for each possible value of its arguments. However, if a rule depends on n variables\nand each variable can take m possible values, then mn ground rules would be generated. For example,\nthe mutual exclusion rule of equation 3 depends on d1, d2, j, and X, meaning that D2\u00d7N d1\u00d7|X|\nground rule instances would be generated, where |X| denotes the number of values that X can\ntake. The same applies to predicates; \u02c6f d1\nj (X) would result in D\u00d7 N d1 \u00d7|X| ground instances,\nwhich would become variables in our optimization problem. This approach would thus result in a\nhuge optimization problem rendering it impractical when dealing with large scale problems such as\nNELL. The key to scaling up the grounding procedure is to notice that many of the possible ground\nrules are always satis\ufb01ed (i.e., have distance to satis\ufb01ability equal to 0), irrespective of the values\nof the unobserved ground predicates that they depend upon. These ground rules would therefore\nnot in\ufb02uence the optimization problem solution and can be safely ignored. Since in our model we\nare only dealing with a small set of prede\ufb01ned logic rule forms, we devised a heuristic grounding\nprocedure that only generates those ground rules and predicates that may in\ufb02uence the optimization.\nOur grounding algorithm is shown in the supplementary material and is based on the idea that a\nground rule is only useful if the function approximation predicate that appears in its body is observed.\nIt turns out that this approach is orders of magnitude faster than existing state-of-the-art solutions\nsuch as the grounding solution used by Niu et al. [2011].\n\n3.3.3 Solving the Optimization Problem\n\nFor large problems, the objective function of equation 6 will be a sum of potentially millions of\nterms, each one of which only involving a small set of variables. In PSL, the method used to solve\nthis optimization problem is based on the consensus Alternating Directions Method of Multipliers\n(ADMM). The approach consists of handling each term in that sum as a separate optimization\nproblem using copies of the corresponding variables, while adding the constraint that all copies of\neach variable must be equal. This allows for solving the subproblems completely in parallel and\nis thus scalable. The algorithm is summarized in the supplementary material. More details on this\nalgorithm and on its convergence properties can be found in the latest PSL paper [Bach et al., 2015].\nWe propose a stochastic variation of this consensus ADMM method that is even more scalable.\nDuring each iteration, instead of solving all subproblems and aggregating their solutions in the\nconsensus variables, we sample K << k subproblems to solve. The probability of sampling each\n\n7\n\nAnimalVertebrateInvertebrateRiverLakeCityCountryBirdFishMammalArthropodMolluskLocation\fTable 1: Mean absolute deviation (MAD) of the error rate rankings and the error rate estimates (lower\nMAD is better), and area under the curve (AUC) of the label estimates (higher AUC is better). The\nbest results for each experiment, across all methods, are shown in bolded text and the results for our\nproposed method are highlighted in blue.\n\nMAJ\nAR-2\nAR\nBEE\nCBEE\nHCBEE\nLEE\n\u22122\n\n\u00d710\n\nMAJ\nGIBBS-SVM\nGD-SVM\nDS\nAR-2\nAR\nBEE\nCBEE\nHCBEE\nLEE\n\u22121\n\n\u00d710\n\nMAJ\nGIBBS-SVM\nGD-SVM\nDS\nAR-2\nBEE\nCBEE\nHCBEE\nLEE\n\nMADerror rank\n\n7.71\n12.0\n11.4\n6.00\n6.00\n5.03\n3.71\n\nMADerror rank\n\n23.3\n102.0\n26.7\n170.0\n48.3\n48.3\n40.0\n40.0\n81.7\n30.0\n\nNELL-7\nMADerror\n0.238\n0.261\n0.260\n0.231\n0.232\n0.229\n0.152\n\nuNELL-All\nMADerror\n\n0.47\n2.05\n0.42\n7.08\n2.63\n2.60\n0.60\n0.61\n2.53\n0.37\n\nAUCtarget\n0.372\n0.378\n0.374\n0.314\n0.314\n0.452\n0.508\n\nAUCtarget\n99.9\n28.6\n71.3\n12.1\n96.7\n96.7\n99.8\n99.8\n99.4\n96.5\n\nMADerror rank\n\n7.54\n10.8\n11.1\n5.69\n5.69\n5.14\n4.77\n\nMADerror rank\n\n33.3\n101.7\n93.3\n180.0\n50.0\n48.3\n31.7\n118.0\n81.7\n30.0\n\nNELL-11\nMADerror\n0.303\n0.350\n0.350\n0.291\n0.291\n0.324\n0.180\n\nuNELL-10%\n\nMADerror\n0.54\n2.15\n1.90\n6.96\n2.56\n2.52\n0.64\n45.40\n2.45\n0.43\n\nAUCtarget\n0.447\n0.455\n0.477\n0.368\n0.368\n0.462\n0.615\n\nAUCtarget\n\n87.7\n28.2\n67.8\n12.3\n96.4\n96.4\n79.5\n55.4\n84.9\n97.3\n\nMADerror rank\n\nuBRAIN-All\n\nMADerror\n\nAUCtarget\n\nMADerror rank\n\nuBRAIN-10%\n\nMADerror\n\nAUCtarget\n\n8.76\n7.77\n7.60\n7.77\n16.40\n7.98\n10.90\n28.10\n7.60\n\n0.57\n0.43\n0.44\n0.44\n0.87\n0.40\n0.43\n0.85\n0.38\n\n8.49\n4.65\n5.24\n8.76\n9.71\n9.32\n9.34\n9.20\n9.95\n\n1.52\n1.51\n1.50\n1.32\n2.28\n1.38\n1.77\n3.25\n1.32\n\n0.68\n0.66\n0.68\n0.63\n0.97\n0.63\n0.89\n0.97\n0.47\n\n7.84\n5.28\n8.56\n4.59\n9.89\n9.35\n9.30\n9.37\n9.98\n\nsubproblem is proportional to the distance of its variable copies from the respective consensus\nvariables. The intuition and motivation behind this approach is that at the solution of the optimization\nproblem, all variable copies should be in agreement with the consensus variables. Therefore, priori-\ntizing subproblems whose variables are in greater disagreement with the consensus variables might\nfacilitate faster convergence. Indeed, this modi\ufb01cation to the inference algorithm allowed us to apply\nour method to the NELL data set and obtain results within minutes instead of hours.\n\n4 Experiments\n\nOur implementation as well as the experiment data sets are available at https://github.com/\neaplatanios/makina.\n\nData Sets. First, we considered the following two data sets with logical constraints:\n\u2022 NELL-7: Classify noun phrases (NPs) as belonging to a category or not (categories correspond\nto domains in this case). The categories considered for this data set are Bird, Fish, Mammal,\nCity, Country, Lake, and River. The only constraint considered is that all these categories\nare mutually exclusive.\n\n\u2022 NELL-11: Perform the same task, but with the categories and constraints illustrated in \ufb01gure 2.\nFor both of these data sets, we have a total of 553,940 NPs and 6 classi\ufb01ers, which act as our function\napproximations and are described in [Mitchell et al., 2015]. Not all of the classi\ufb01ers provide a\nresponse every input NP. In order to show the applicability of our method in cases where there are no\nlogical constraints between the domains, we also replicated the experiments of Platanios et al. [2014]:\n\u2022 uNELL: Same task as NELL-7, but without considering the constraints and using 15 categories, 4\n\nclassi\ufb01ers, and about 20,000 NPs per category.\n\n8\n\n\f\u2022 uBRAIN: Classify which of two 40 second long story passages corresponds to an unlabeled 40\nsecond time series of Functional Magnetic Resonance Imaging (fMRI) neural activity. 11 classi\ufb01ers\nwere used and the domain in this case is de\ufb01ned by 11 different locations in the brain, for each of\nwhich we have 924 examples. Additional details can be found in [Wehbe et al., 2014].\n\nMethods. Some of the methods we compare against do not explicitly estimate error rates. Rather,\nthey combine the classi\ufb01er outputs to produce a single label. For these methods, we produce an\nestimate of the error rate using these labels and compare against this estimate.\n1. Majority Vote (MV): This is the most intuitive method and it consists of taking the most common\n\noutput among the provided function approximation responses, as the combined output.\n\n2. GIBBS-SVM/GD-SVM: Methods of Tian and Zhu [2015].\n3. DS: Method of Dawid and Skene [1979].\n4. Agreement Rates (AR): This is the method of Platanios et al. [2014]. It estimates error rates\nbut does not infer the combined label. To that end, we use a weighted majority vote, where the\nclassi\ufb01ers\u2019 predictions are weighted according to their error rates in order to produce a single\noutput label. We also compare against a method denoted by AR-2 in our experiments, which is\nthe same method, except only pairwise function approximation agreements are considered.\n\n5. BEE/CBEE/HCBEE: Methods of Platanios et al. [2016].\nIn the results, LEE stands for Logic Error Estimation and refers to the proposed method of this paper.\n\nEvaluation. We compute the sample error rate estimates using the true target function labels (which\nare always provided), and we then compute three metrics for each domain and average over domains:\n\u2022 Error Rank MAD: We rank the function approximations by our estimates and by the sample\nestimates to produce two vectors with the ranks. We then compute the mean absolute deviation\n(MAD) between the two vectors, where by MAD we mean the (cid:96)1 norm of the vectors\u2019 difference.\n\u2022 Error MAD: MAD between the vector of our estimates and the vector of the sample estimates,\nwhere each vector is indexed by the function approximation index.\n\u2022 Target AUC: Area under the precision-recall curve for the inferred target function values, relative\n\nto the true function values that are observed.\n\nResults. First, note that the largest execution time of our method among all data sets was about 10\nminutes, using a 2013 15-inch MacBook Pro. The second best performing method, HCBEE, required\nabout 100 minutes. This highlights the scalability of our approach. Results are shown in table 1.\n1. NELL-7 and NELL-11 Data Sets: In this case we have logical constraints and thus, this set of\nresults is most relevant to the central research claims in this paper (our method was motivated by\nthe use of such logical constraints). It is clear that our method outperforms all existing methods,\nincluding the state-of-the-art, by a signi\ufb01cant margin. Both the MADs of the error rate estimation,\nand the AUCs of the target function response estimation, are signi\ufb01cantly better.\n\n2. uNELL and uBRAIN Data Sets: In this case there exist no logical constraints between the domains.\nOur method still almost always outperforms the competing methods and, more speci\ufb01cally, it\nalways does so in terms of error rate estimation MAD. This set of results makes it clear that our\nmethod can also be used effectively in cases where there are no logical constraints.\n\nAcknowledgements\n\nWe would like to thank Abulhair Saparov and Otilia Stretcu for the useful feedback they provided in\nearly versions of this paper. This research was performed during an internship at Microsoft Research,\nand was also supported in part by NSF under award IIS1250956, and in part by a Presidential\nFellowship from Carnegie Mellon University.\n\nReferences\nS. H. Bach, B. Huang, B. London, and L. Getoor. Hinge-loss Markov Random Fields: Convex\nInference for Structured Prediction. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n9\n\n\fS. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random \ufb01elds and\nprobabilistic soft logic. CoRR, abs/1505.04406, 2015. URL http://dblp.uni-trier.de/\ndb/journals/corr/corr1505.html#BachBHG15.\n\nM.-F. Balcan, A. Blum, and Y. Mansour. Exploiting Ontology Structures and Unlabeled Data for\n\nLearning. International Conference on Machine Learning, pages 1112\u20131120, 2013.\n\nY. Bengio and N. Chapados. Extensions to Metric-Based Model Selection. Journal of Machine\n\nLearning Research, 3:1209\u20131227, 2003.\n\nM. Br\u00f6cheler, L. Mihalkova, and L. Getoor. Probabilistic Similarity Logic.\n\nUncertainty in Arti\ufb01cial Intelligence, pages 73\u201382, 2010.\n\nIn Conference on\n\nJ. Collins and M. Huynh. Estimation of Diagnostic Test Accuracy Without Full Veri\ufb01cation: A\n\nReview of Latent Class Methods. Statistics in Medicine, 33(24):4141\u20134169, June 2014.\n\nM. Collins and Y. Singer. Unsupervised Models for Named Entity Classi\ufb01cation. In Joint Conference\n\non Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.\n\nS. Dasgupta, M. L. Littman, and D. McAllester. PAC Generalization Bounds for Co-training. In\n\nNeural Information Processing Systems, pages 375\u2013382, 2001.\n\nA. P. Dawid and A. M. Skene. Maximum Likelihood Estimation of Observer Error-Rates Using the\nEM Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):20\u201328,\n1979.\n\nG. J. Klir and B. Yuan. Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice-Hall, Inc.,\n\nUpper Saddle River, NJ, USA, 1995. ISBN 0-13-101171-5.\n\nO. Madani, D. Pennock, and G. Flake. Co-Validation: Using Model Disagreement on Unlabeled Data\n\nto Validate Classi\ufb01cation Algorithms. In Neural Information Processing Systems, 2004.\n\nT. Mitchell, W. W. Cohen, E. Hruschka Jr, P. Pratim Talukdar, J. Betteridge, A. Carlson, B. Dalvi,\nM. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. A.\nPlatanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov,\nM. Greaves, and J. Welling. Never-Ending Learning. In Association for the Advancement of\nArti\ufb01cial Intelligence, 2015.\n\nP. G. Moreno, A. Art\u00e9s-Rodr\u00edguez, Y. W. Teh, and F. Perez-Cruz. Bayesian Nonparametric Crowd-\n\nsourcing. Journal of Machine Learning Research, 16, 2015.\n\nF. Niu, C. R\u00e9, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks\nusing an rdbms. Proc. VLDB Endow., 4(6):373\u2013384, Mar. 2011. ISSN 2150-8097. doi: 10.14778/\n1978665.1978669. URL http://dx.doi.org/10.14778/1978665.1978669.\n\nF. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without\n\nlabeled data. Proceedings of the National Academy of Sciences, 2014.\n\nE. A. Platanios, A. Blum, and T. M. Mitchell. Estimating Accuracy from Unlabeled Data.\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\nIn\n\nE. A. Platanios, A. Dubey, and T. M. Mitchell. Estimating Accuracy from Unlabeled Data: A\nBayesian Approach. In International Conference on Machine Learning, pages 1416\u20131425, 2016.\n\nM. Richardson and P. Domingos. Markov Logic Networks. Mach. Learn., 62(1-2):107\u2013136, 2006.\n\nD. Schuurmans, F. Southey, D. Wilkinson, and Y. Guo. Metric-Based Approaches for Semi-\n\nSupervised Regression and Classi\ufb01cation. In Semi-Supervised Learning. 2006.\n\nT. Tian and J. Zhu. Max-Margin Majority Voting for Learning from Crowds. In Neural Information\n\nProcessing Systems, 2015.\n\nL. Wehbe, B. Murphy, P. Talukdar, A. Fyshe, A. Ramdas, and T. Mitchell. Predicting brain activity\n\nduring story processing. in review, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2274, "authors": [{"given_name": "Emmanouil", "family_name": "Platanios", "institution": "Carnegie Mellon University"}, {"given_name": "Hoifung", "family_name": "Poon", "institution": "Microsoft Research"}, {"given_name": "Tom", "family_name": "Mitchell", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Horvitz", "institution": "Microsoft Research"}]}