{"title": "Robust Conditional Probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 6359, "page_last": 6368, "abstract": "Conditional probabilities are a core concept in machine learning. For example, optimal prediction of a label $Y$ given an input $X$ corresponds to maximizing the conditional probability of $Y$ given $X$. A common approach to inference tasks is learning a model of conditional probabilities. However, these models are often based on strong assumptions (e.g., log-linear models), and hence their estimate of conditional probabilities is not robust and is highly dependent on the validity of their assumptions. Here we propose a framework for reasoning about conditional probabilities without assuming anything about the underlying distributions, except knowledge of their second order marginals, which can be estimated from data. We show how this setting leads to guaranteed bounds on conditional probabilities, which can be calculated efficiently in a variety of settings, including structured-prediction. Finally, we apply them to semi-supervised deep learning, obtaining results competitive with variational autoencoders.", "full_text": "Robust Conditional Probabilities\n\nSchool of Computer Science and Engineering\n\nThe Balvatnik School of Computer Science\n\nAmir Globerson\n\nTel-Aviv University\ngamir@mail.tau.ac.il\n\nYoav Wald\n\nHebrew University\n\nyoav.wald@mail.huji.ac.il\n\nAbstract\n\nConditional probabilities are a core concept in machine learning. For example,\noptimal prediction of a label Y given an input X corresponds to maximizing the\nconditional probability of Y given X. A common approach to inference tasks is\nlearning a model of conditional probabilities. However, these models are often\nbased on strong assumptions (e.g., log-linear models), and hence their estimate of\nconditional probabilities is not robust and is highly dependent on the validity of\ntheir assumptions.\nHere we propose a framework for reasoning about conditional probabilities without\nassuming anything about the underlying distributions, except knowledge of their\nsecond order marginals, which can be estimated from data. We show how this\nsetting leads to guaranteed bounds on conditional probabilities, which can be calcu-\nlated ef\ufb01ciently in a variety of settings, including structured-prediction. Finally, we\napply them to semi-supervised deep learning, obtaining results competitive with\nvariational autoencoders.\n\n1\n\nIntroduction\n\nIn classi\ufb01cation tasks the goal is to predict a label Y for an object X. Assuming that the joint\ndistribution of these two variables is p\u2217(x, y) then optimal prediction1 corresponds to returning\nthe label y that maximizes the conditional probability p\u2217(y|x). Thus, being able to reason about\nconditional probabilities is fundamental to machine learning and probabilistic inference.\nIn the fully supervised setting, one can sidestep the task of estimating conditional probabilities by\ndirectly learning a classi\ufb01er in a discriminative fashion. However, in unsupervised or semi-supervised\nsettings, a reliable estimate of the conditional distributions becomes important. For example, consider\na self-training [17, 31] or active learning setting. In both scenarios, the learner has a set of unlabeled\nsamples and it needs to choose which ones to tag. Given an unlabeled sample x, if we could reliably\nconclude that p\u2217(y|x) is close to 1 for some label y, we could easily decide whether to tag x or not.\nIntuitively, an active learner would prefer not to tag x while a self training algorithm would tag it.\nThere are of course many approaches to \u201cmodelling\u201d conditional distributions, from logistic regression\nto conditional random \ufb01elds. However, these do not come with any guarantees of approximations\nto the true underlying conditional distributions of p\u2217 and thus cannot be used to reliably reason\nabout these. This is due to the fact that such models make assumptions about the conditionals (e.g.,\nconditional independence or parametric), which are unlikely to be satis\ufb01ed in practice.\nAs an illustrative example for our motivation and setup, consider a set of n binary variables\nX1, ..., Xn whose distribution we are interested in. Suppose we have enough data to obtain\nIf (1, 2) \u2208 E and we con-\nthe joint marginals, P [Xi = xi, Xj = xj], of pairs i, j in a set E.\ncluded that P [X1 = 1|X2 = 1] = 1, this lets us reason about many other probabilities. For ex-\nample, we know that P [X1 = 1|X2 = 1, . . . , Xn = xn] = 1 for any setting of the x3, . . . , xn\n\n1In the sense of minimizing prediction error.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fvariables. This is a simple but powerful observation, as it translates knowledge about prob-\nabilities over small subsets to robust estimates of conditional probability over large subsets.\nNow, what happens when P [X1 = 1|X2 = 1] = 0.99? In other words, what can we say about\nP [X1 = 1|X2 = 1, . . . , Xn = xn] given information about probabilities P [Xi = xi, Xj = xj]. As\nwe show here, it is still possible to reason about such conditional probabilities even under this partial\nknowledge.\nMotivated by the above, we propose a novel model-free approach for reasoning about conditional\nprobabilities. Speci\ufb01cally, we shall show how conditional probabilities can be lower bounded when\nthe only assumption made is that certain low-order marginals of the distribution are known. One of the\nsurprising outcomes of our analysis is that these lower bounds can be calculated ef\ufb01ciently, and often\nhave an elegant closed form. Finally, we show how these bounds can be used in a semi-supervised\nsetting, obtaining results that are competitive with variational autoencoders [11].\n\n2 Problem Setup\n\nWe begin by de\ufb01ning notations to be used in what follows. Let X denote a vector of random variables\nX1, . . . , Xn which are the features and Y denote labels. If we have a single label we will denote it by\nY , otherwise, a multivariate label will be denoted by Y1, . . . , Yr. X, Y are generated by an unknown\nunderlying distribution p\u2217(X, Y ). All variables are discrete (i.e., can take on a \ufb01nite set of values).\nHere we will assume that although we do not know p\u2217 we have access to some of its low order\nmarginals, such as those of a single feature and a label:\n\n(cid:88)\n\n\u00b5i(xi, y) =\n\n\u00afx1,...,\u00afxn:\u00afxi=xi\n\np\u2217(\u00afx1, . . . , \u00afxn, y).\n\n(cid:41)\n\n(cid:40)\n\n(cid:88)\n\nSimilarly we may have access to the set of pairwise marginals \u00b5ij(xi, xj, y) for all i, j \u2208 E, where\nthe set E corresponds to edges of a graph G (see also [7]). Denote the set of all such marginals by \u00b5.\nFor simplicity we assume the marginals are exact. Generally they are of course only approximate, but\nconcentration bounds can be used to quantify this accuracy as a function of data size. Furthermore,\nmost of the methods described here can be extended to inexact marginals (e.g., see [6] for an approach\nthat can be applied here).\nSince \u00b5 does not uniquely specify a distribution p\u2217, we will be interested in the set of all distributions\nthat attain these marginals. Denote this set by P(\u00b5), namely:\n\nP(\u00b5) =\n\np \u2208 \u2206 :\n\np(\u00afx1, . . . , \u00afxn, y) = \u00b5i(xi, y) \u2200i\n\n(1)\n\n\u00afx1,...,\u00afxn:\u00afxi=xi\n\nwhere \u2206 is the probability simplex of the appropriate dimension.\nMore generally, one may consider some vector function f : X, Y \u2192 Rd and its expected value\naccording to p\u2217, denoted by a = Ep\u2217 [f (X, Y )]. Then the corresponding set of distributions is:\n\nP(a) = {p \u2208 \u2206 : Ep [f (X, Y )] = a} .\n\nSince marginals are expectations of random variables [30], this generalizes the notation given above.\n\n2.1 The Robust Conditionals Problem\nOur approach is to reason about conditional distributions using only the fact that p\u2217 \u2208 P(\u00b5). Our\nkey goal is to lower bound these conditionals, since this will allow us to conclude that certain labels\nare highly likely in cases where the lower bound is large. We shall also be interested in upper and\nlower bounding joint probabilities, since these will play a key role in bounding the conditionals.\nOur goal is thus to solve the following optimization problems.\np(x, y), min\np\u2208P(\u00b5)\n\np(x, y), max\np\u2208P(\u00b5)\n\np (y | x).\n\nmin\np\u2208P(\u00b5)\n\n(2)\n\nIn all three problems, the constraint set is linear in p. However, note that p is speci\ufb01ed by an\nexponential number of variables (one per assignment x1, . . . , xn, y) and thus it is not feasible to plug\nthese constraints into an LP solver. In terms of objective, the min and max problems are linear, and\nthe conditional is fractional linear. In what follows we show how all three problems can be solved\nef\ufb01ciently for tree shaped graphs.\n\n2\n\n\f3 Related Work\n\nThe problem of reasoning about a distribution based on its expected values has a long history, with\nmany beautiful mathematical results. An early example is the classical Chebyshev inequality, which\nbounds the tail of a distribution given its \ufb01rst and second moments. This was signi\ufb01cantly extended\nin the Chebyshev Markov Stieltjes inequality [2]. More recently, various generalized Chebyshev\ninequalities have been developed [4, 22, 27] and some further results tying moments with bounds on\nprobabilities have been shown (e.g. [18]). A typical statement of these is that several moments are\ngiven, and one seeks the minimum measure of some set S under any distribution that agrees with\nthe moments. As [4] notes, most of these problems are NP hard, with isolated cases of tractability.\nSuch inequalities have been used to obtain minimax optimal linear classi\ufb01ers in [14]. The moment\nproblems we consider here are very different from those considered previously, in terms of the \ufb01nite\nsupport we require, our focus on bounding probabilities and conditional probabilities of assignments.\nThe above approaches consider worst case bounds on probabilities of events for distributions in P(a).\nA different approach is to pick a particular distribution in P(a) as an approximation (or model) of p\u2217.\nThe most common choice here is the maximum entropy distribution in P(a). Such log-linear models\nhave found widespread use in statistics and machine learning. In particular, most graphical models\ncan be viewed as distributions of this type (e.g., see [12, 13]). However, probabilities given by these\nmodels cannot be related to the true probabilities in any sense (e.g., upper or lower bound). This is\nwhere our approach markedly differs from entropy based assumptions. Another approach to reduce\nmodeling assumptions is robust optimization, where data and certain model parameters are assumed\nnot to be known precisely, and optimality is sought in a worst case adversarial setting. This approach\nhas been applied to machine learning in various settings (e.g, see [32, 16]), establishing close links to\nregularization. None of these approaches considers bounding probabilities as is our focus here.\nFinally, another elegant moment approach is that based on kernel mean embedding [23, 24]. In this\napproach, one maps a distribution into a set of expected values of a set of functions (possibly in\ufb01nite).\nThe key observation is that this mean embedding lies in an RKHS, and hence many operations can be\ndone implicitly. Most of the applications of this idea assume that the set of functions is rich enough to\nfully specify the distribution (i.e., characteristic kernels [25]). The focus is thus different from ours,\nwhere moments are not assumed to be fully informative, and the set P(a) contains many possible\ndistributions. It would however be interesting to study possible uses of RKHS in our setting.\n\n4 Calculating Robust Conditional Probabilities\n\nThe optimization problems in Eq. (2) are linear programs (LP) and fractional LPs, where the number\nof variables scales exponentially with n. Yet, as we show in this section and Section 5, it turns\nout that in many non-trivial cases, they can be ef\ufb01ciently solved. Our focus below is on the case\nwhere the set of edges E corresponding to the pairwise marginals forms a tree structured graph.\nThe tree structure assumption is common in literature on Graphical Models, only here we do not\nmake an inductive assumption on the generating distribution (i.e., we make none of the conditional\nindependence assumptions that are implied by tree-structured graphical models). In the following\nsections we study solutions of robust conditional probabilities under the tree assumption. We will\nalso discuss some extensions to the cyclic case. Finally, note that although the derivations here are\nfor pairwise marginals, these can be extended to the non-pairwise case by considering clique-trees\n[e.g., see 30]. Pairs are used here to allow a clearer presentation.\nIn what follows, we show that the conditional lower bound has a simple structure as stated in Theorem\n4.1. This result does not immediately suggest an ef\ufb01cient algorithm since its denominator includes an\nexponentially sized LP. Next, in Section 4.2 we show how this LP can be reduced to polynomial sized,\nresulting in an ef\ufb01cient algorithm for the lower bound. Finally, in Section 5 we show that in certain\ncases there is no need to use a general purpose LP solver and the problem can be solved either in\nclosed form or via combinatorial algorithms. Detailed proofs are provided in the supplementary \ufb01le.\n\n4.1 From Conditional Probabilities To Maximum Probabilities with Exclusion\n\nThe main result of this section will reduce calculation of the robust conditional probability for\np(y | x), to one of maximizing the probability of all labels other than y. This reduction by itself will\nnot allow for ef\ufb01cient calculation of the desired conditional probabilities, as the new problem is also\n\n3\n\n\fa large LP that needs to be solved. Still the result will take us one step further towards a solution, as\nit reveals the probability mass a minimizing distribution p will assign to x, y.\nThis part of the solution is related to a result from [8], where the authors derive the solution of\nminp\u2208P(\u00b5) p(x, y). They prove that under the tree assumption this problem has a simple closed form\nsolution, given by the functional I(x, y ; \u00b5):\n\nI(x, y ; \u00b5) =\n\n(1 \u2212 di)\u00b5i(xi, y) +\n\n\u00b5ij(xi, xj, y)\n\n.\n\n(3)\n\n(cid:88)\n\nij\u2208E\n\n\uf8f9\uf8fb\n\n+\n\n\uf8ee\uf8f0(cid:88)\n\ni\n\nHere [\u00b7]+ denotes the ReLU function [z]+ = max{z, 0} and di is the degree of node i in G.\nIt turns out that robust conditional probabilities will assign the event x, y its minimal possible\nprobability as given in Eq. (3). Moreover, it will assign all other labels their maximum possible\nprobability. This is indeed a behaviour that may be expected from a robust bound, we formalize it in\nthe main result for this part:\nTheorem 4.1 Let \u00b5 be a vector of tree-structured pairwise marginals, then\n\nmin\np\u2208P(\u00b5)\n\np (y | x) =\n\nI(x, y ; \u00b5)\nI(x, y ; \u00b5) + maxp\u2208P(\u00b5)\n\n.\n\n(4)\n\n\u00afy(cid:54)=y p(x, \u00afy)\n\n(cid:80)\n\nThe proof of this theorem is rather technical and we leave it for the supplementary material.\nWe note that the above result also applies to the \u201cstructured-prediction\u201d setting where y is multivariate\nand we also assume knowledge of marginals \u00b5(yi, yj). In this case, the expression for I(x, y ; \u00b5)\nwill also include edges between yi variables, and incorporate their degrees in the graph.\nThe important implication of Theorem 4.1 is that it reduces the minimum conditional problem to that\nof probability maximization with an assignment exclusion. Namely:\n\np(x, \u00afy).\n\n(5)\n\n(cid:88)\n\n\u00afy(cid:54)=y\n\nmax\np\u2208P(\u00b5)\n\nAlthough this is still a problem with an exponential number of variables, we show in the next section\nthat it can be solved ef\ufb01ciently.\n\n4.2 Minimizing and Maximizing Probabilities\n\nTo provide an ef\ufb01cient solution for Eq. (5), we turn to a class of joint probability bounding problems.\nAssume we constrain each variable Xi and Yj to a subset \u00afXi, \u00afYj of its domain and would like to\nreason about the probability of this constrained set of joint assignments:\n\nU =(cid:8)x, y | xi \u2208 \u00afXi, yj \u2208 \u00afYj \u2200i \u2208 [n], j \u2208 [r](cid:9) .\n\n(6)\n\nUnder this setting, an ef\ufb01cient algorithm for solving\n\n(cid:88)\n\nmax\np\u2208P(\u00b5)\n\nu\u2208U\\(x,y)\n\np(u),\n\nwill also solve Eq. (5). By the results of last section, we will then also have an algorithm calculates\nrobust conditional probabilities. To see this is indeed the case, assume we are given an assignment\n(x, y). Then setting \u00afXi = {xi} for all features and \u00afYj = {1, . . . ,|Yj|} for labels (i.e. U does not\nrestrict labels), gives exactly Eq. (5).\nTo derive the algorithm, we will \ufb01nd a compact representation of the LP, with a polynomial number of\nvariables and constraints. The result is obtained by using tools from the literature on Graphical Models.\nIt shows how to formulate probability maximisation problems over U as problems constrained by the\nlocal marginal polytope [30]. Its de\ufb01nition in our setting slightly deviates from its standard de\ufb01nition,\nas it does not require that probabilities sum up to 1:\nDe\ufb01nition 1 The set of locally consistent pseudo marginals over U is de\ufb01ned as:\n\n\u02dc\u00b5ij(xi, xj) = \u02dc\u00b5j(xj) \u2200(i, j) \u2208 E, xj \u2208 \u00afXj}.\n\nML(U ) = { \u02dc\u00b5 | (cid:88)\n\nThe partition function of \u02dc\u00b5, Z( \u02dc\u00b5), is given by(cid:80)\n\nxi\u2208 \u00afXi\n\nxi\u2208 \u00afXi\n\n\u02dc\u00b5i(xi).\n\n4\n\n\fThe following theorem states that solving Eq. (5) is equivalent to solving an LP over ML(U ) with\nadditional constraints.\nTheorem 4.2 Let U be a universe of assignments as de\ufb01ned in Eq. (6), x \u2208 U and \u00b5 a vector of\ntree-structured pairwise marginals, then the values of the following problems:\n\nmax\np\u2208P(\u00b5)\n\nare equal (respectively) to:\n\np(u), max\np\u2208P(\u00b5)\n\nu\u2208U\\(x,y)\n\np(u),\n\nZ( \u02dc\u00b5).\n\n(7)\n\n(cid:88)\n\nu\u2208U\n\n(cid:88)\n\nmax\n\n\u02dc\u00b5\u2208ML(U ), \u02dc\u00b5\u2264\u00b5\n\nZ( \u02dc\u00b5),\n\nmax\n\n\u02dc\u00b5\u2208ML(U ), \u02dc\u00b5\u2264\u00b5\n\nI(x,y ; \u02dc\u00b5)\u22640\n\nThese LPs involve a polynomial number of constraints and variables, thus can be solved ef\ufb01ciently.\n\nProofs of this result can be obtained by exploiting properties of functions that decompose over trees.\nIn the supplementary material, we provide a proof similar to that given in [30] to show equality of the\nmarginal and local-marginal polytopes in tree models.\nTo conclude this section, we restate the main result: the robust conditional probability problem Eq. (2)\ncan be solved in polynomial time by combining Theorems 4.1 and 4.2. As a by-product of this\nderivation we also presented ef\ufb01cient tools for bounding answers to a large class of probabilistic\nqueries. While this is not the focus of the current paper, these tools may be a useful in probabilistic\nmodelling, where we often combine estimates of low order marginals with assumptions on the data\ngenerating process. Bounds like the ones presented in this section give a quantitative estimate of the\nuncertainty that is induced by data and circumvented by our assumptions.\n\n5 Closed Form Solutions and Combinatorial Algorithms\n\nThe results of the previous section imply that the minimum conditional can be found by solving a\npoly-sized LP. Although this results in polynomial runtime, it is interesting to improve as much as\npossible on the complexity of this calculation. One reason is that application of the bounds might\nrequire solving them repeatedly within some larger learning probelm. For instance, in classi\ufb01cation\ntasks it may be necessary to solve Eq. (4) for each sample in the dataset. An even more demanding\nprocedure will come up in our experimental evaluation, where we learn features that result in high\ncon\ufb01dence under our bounds. There, we need to solve Eq. (4) over mini-batches of training data\nonly to calculate a gradient at each training iteration. Since using an LP solver in these scenarios is\nimpractical, we next derive more ef\ufb01cient solutions for some special cases of Eq. (4).\n\n5.1 Closed Form for Multiclass Problems\n\nThe multiclass setting is a special case of Eq. (4) when y is a single label variable (e.g., a digit label\nin MNIST with values y \u2208 {0, . . . , 9}). The solution of course depends on the type of marginals\nprovided in P(\u00b5). Here we will assume that we have access to joint marginals of the label y and pairs\nof feature xi, xj corresponding to edges ij \u2208 E of a graph G. We note that we can obtain similar\nresults for the cases where some additional \u201cunlabeled\u201d statistics \u00b5ij(xi, xj) are known.\nIt turns out that in both cases Eq. (5) has a simple solution. Here we write it for the case without\nunlabeled statistics. The following lemma is based on a result that states maxp\u2208P(\u00b5) p(x) =\nminij \u00b5ij(xi, xj), which we prove in the supplementary material.\nLemma 5.1 Let x \u2208 X and \u00b5 a vector of tree-structured pairwise marginals, then\n\np (y | x) =\n\nmin\np\u2208P(\u00b5)\n\nI(x, y ; \u00b5) +(cid:80)\n\nI(x, y ; \u00b5)\n\n\u00afy(cid:54)=y minij \u00b5ij(xi, xj, \u00afy)\n\n.\n\n(8)\n\n5.2 Combinatorial Algorithms and Connection to Maximum Flow Problems\n\nIn some cases, fast algorithms for the optimization problem in Eq. (5) can be derived by exploiting a\ntight connection of our problems to the Max-Flow problem. The problems are also closely related\n\n5\n\n\fto the weighted Set-Cover problem. To observe the connection to the latter, consider an instance of\nSet-Cover de\ufb01ned as follows. The universe is all assignments x. Sets are de\ufb01ned for each i, j, xi, xj\nand are denoted by Sij,xi,xj . The set Sij,xi,xj contains all assignments \u00afx whose values at i, j are\nxi, xj. Moreoever, the set Sij,xi,xj has weight w(Sij,xi,xj ) = \u00b5ij(xi, xj). Note that the number of\nitems in each set is exponential, but the number of sets is polynomial. Now consider using these sets\nto cover some set of assignments U with the minimum possible weight. It turns out that under the\ntree structure assumption, this problem is closely related to the problem of maximizing probabilities.\n\nLemma 5.2 Let U be a set of assignments and \u00b5 a vector of tree-structured marginals. Then:\n\n(cid:88)\n\nu\u2208U\n\nmax\np\u2208P(\u00b5)\n\np(u),\n\n(9)\n\nhas the same value as the standard LP relaxation [28] of the Set-Cover problem above.\n\nThe connection to Set-Cover may not give a path to ef\ufb01cient algorithms, but it does illuminate some\nof the results presented earlier. It is simple to verify that minij \u00b5ij(xi, xj, \u00afy) is a weight of a cover\nof x, \u00afy, while Eq. (3) equals one minus the weight of a set that covers all assignments but x, y. A\nconnection that we may exploit to obtain more ef\ufb01cient algorithms is to Max-Flow. When the graph\nde\ufb01ned by E is a chain, we show in the supplementary material that the value of Eq. (9) can be found\nby solving a \ufb02ow problem on a simple network. We note that using the same construction, Eq. (5)\nturns out to be Max Flow under a budget constraint [1]. This may prove very bene\ufb01cial for our goals,\nas it allows for ef\ufb01cient calculation of the robust conditionals we are interested in. Our conjecture\nis that this connection goes beyond chain graphs, but leave this for exploration in future work. The\nproofs for results in this section may also be found in the supplementary material.\n\n6 Experiments\n\nTo evaluate the utility of our bounds, we consider their use in settings of semi-supervised deep\nlearning and structured prediction. For the bounds to be useful, the marginal distributions need to\nbe suf\ufb01ciently informative. In some datasets, the raw features already provide such information, as\nwe show in Section 6.3. In other cases, such as images, a single raw feature (i.e., a pixel) does not\nprovide suf\ufb01cient information about the label. These cases are addressed in Section 6.1 where we\nshow how to learn new features which do result in meaningful bounds. Using deep networks to learn\nthese features turns out to be an effective method for semi-supervised settings, reaching results close\nto those demonstrated by Variational Autoencoders [11]. It would be interesting to use such feature\nlearning methods for structured prediction too; however this requires incorporation of the max-\ufb02ow\nalgorithm into the optimization loop, and we defer this to future work.\n\n6.1 Deep Semi-Supervised Learning\n\nA well known approach to semi-supervised learning is to optimize an empirical loss, while adding\nanother term that measures prediction con\ufb01dence on unlabeled data [9, 10]. Let us describe one such\nmethod and how to adapt it to use our bounds.\nEntropy Regularizer: Consider training a deep neural network where the last layer has n neurons\nz1, . . . , zn connected to a softmax layer of size |Y | (i.e. the number of labels), and the loss we use is\na cross entropy loss. Denote the weights of the softmax layer by W \u2208 Rn\u00d7|Y |. Given an input x,\nde\ufb01ne the softmax distribution at the output of the network as:\n\u02dcpy = softmaxy((cid:104)Wy, z(cid:105)) ,\n\n(10)\nwhere Wy is the y\u2019th row of W . The min-entropy regularizer [9] adds an entropy term \u03b2H(\u02dcpy) to\nthe loss, for each unlabeled x in the training set.\nPlugging in Robust Conditional Probabilities: We suggest a simple adaptation of this method that\nuses our bounds. Let us remove the softmax layer and set the activations of the neurons z1, . . . , zn to\na sigmoid activation. Let Z1, . . . , Zn denote random variables that take on the values of the output\nneurons, these variables will be used as features in our bounds (in previous sections we refer to\nfeatures as Xi. Here we switch to Zi since Xi are understood as the raw features of the problem. e.g.,\nthe pixel values in the image). Since our bounds apply to discrete variables, while z1, . . . , zn are real\nvalues, we use a smoothed version of our bounds.\n\n6\n\n\fLoss Function and Smoothed Bounds: A smoothed version of the marginals \u00b5 is calculated by\nconsidering Zi as an indicator variable (e.g., the probability p(Zi = 1) would just be the average of\nthe Zi values). Then the smoothed marginal \u00af\u00b5(zi = 1, y) is the average of zi values over all training\ndata labeled with y. In our experiments we used all the labeled data to estimate \u00af\u00b5 at each iteration.\nThe smoothed version of I(z, y; \u00b5), which we shall call \u00afI(z, y; \u00b5), is then calculated with Eq. (3)\nwhen switching \u00b5 with \u00af\u00b5 and the ReLU operator with a softplus.\nTo de\ufb01ne a loss function we take a distribution over all labels:\n\u00afI(z, y ; \u00af\u00b5)\n\n\u02dcpy = softmaxy(\n\n\u00afI(z, y ; \u00af\u00b5) +(cid:80)\n\n\u00afy(cid:54)=y minij \u00af\u00b5ij(zi, zj, \u00afy)\n\n) ,\n\n(11)\n\nThis is very similar to the standard distribution taken in a neural net, but it uses our bounds to make a\nmore robust estimate of the conditionals. Then we use the exact same loss as the entropy regularizer,\na cross entropy loss for labeled data with an added entropy term for unlabeled instances.\n\n6.1.1 Algorithm Settings and Baselines\n\nWe implemented the min-entropy regularizer and our proposed method using a multilayer perceptron\n(MLP) with fully connected layers and a ReLU activation at each layer (except a sigmoid at the last\nlayer for our method). In our experiments we used hidden layers of sizes 1000, 500, 50 (so we learn\n50 features Z1, . . . , Z50). We also add (cid:96)2 regularization on the weights of the soft-max layer for the\nentropy regularizer, since otherwise entropy can always be driven to zero in the separable case. We\nalso experimented with adding a hinge loss as a regularizer (as in Transductive SVM [10]), but omit\nit from the comparison because it did not yield signi\ufb01cant improvement over the entropy regularizer.\nWe also compare our results with those obtained by Variational Autoencoders and Ladder Networks.\nAlthough we do not expect to get accuracies as high as these methods, getting comparable numbers\nwith a simple regularizer (compared to the elaborate techniques used in these works) like the one we\nsuggest, shows that the use of our bounds results in a very powerful method.\n\n6.2 MNIST Dataset\n\nWe trained the above models on the MNIST dataset, using 100 and 1000 labeled samples (see [11]\nfor a similar setup). We set the two regularization parameters required for the entropy regularizer and\nthe one required for our minimum probability regularizer with \ufb01ve fold cross validation. We used\n10% of the training data as a validation set and compared error rates on the 104 samples of the test set.\nResults are shown in Figure 1. They show that on the 1000 sample case we are slightly outperformed\nby VAE and for 100 samples we lose by 1%. Ladder networks outperform other baselines.\n\nN\n100\n1000\n\nLadder [21]\n1.06(\u00b10.37)\n0.84(\u00b10.08)\n\nVAE [11]\n3.33(\u00b10.14)\n2.40(\u00b10.02)\n\nRobust Probs\n4.44(\u00b10.22)\n2.48(\u00b10.03)\n\nEntropy\n\n18.93(\u00b10.54)\n3.15(\u00b10.03)\n\nMLP+Noise\n21.74(\u00b11.77)\n5.70(\u00b10.20)\n\nFigure 1: Error rates of several semi-supervised learning methods on the MNIST dataset with few\ntraining samples.\n\nAccuracy vs. Coverage Curves:\nIn self-training and co-training methods, a classi\ufb01er adds its\nmost con\ufb01dent predictions to the training set and then repeats training. A crucial factor in the success\nof such methods is the error in the predictions we add to the training pool. Classi\ufb01ers that use\ncon\ufb01dence over unlabelled data as a regularizer are natural choices for base classi\ufb01ers in such a\nsetting. Therefore an interesting comparison to make is the accuracy we would get over the unlabeled\ndata, had the classi\ufb01er needed to choose its k most con\ufb01dent predictions.\nWe plot this curve as a function of k for the entropy regularizer and our min-probabilities regularizer.\nSamples in the unlabelled training data are sorted in descending order according to con\ufb01dence.\nCon\ufb01dence for a sample in entropy regularized MLP is calculated based on the value of the logit that\nthe predicted label received in the output layer. For the robust probabilities classi\ufb01er, the con\ufb01dence\nof a sample is the minimum conditional probability the predicted label received. As can be observed\nin Figure 6.2, our classi\ufb01er ranks its predictions better than the entropy based method. We attribute\nthis to our classi\ufb01er being trained to give robust bounds under minimal assumptions.\n\n7\n\n\fFigure 2: Accuracy for k most con\ufb01dent samples in unlabelled data. Blue curve shows results for\nthe Robust Probabilities Classi\ufb01er, green for the Entropy Regularizer. Con\ufb01dence is measured by\nconditional probabilities and logits accordingly.\n\n6.3 Multilabel Structured Prediction\n\nAs mentioned earlier, in the structured prediction setting it is more dif\ufb01cult to learn features that\nyield high certainty. We therefore provide a demonstration of our method on a dataset where the raw\nfeatures are relatively informative. The Genbase dataset taken from [26], is a protein classi\ufb01cation\nmultilabel dataset. It has 662 instances, divided into a training set of 463 samples and a test set of\n199, each sample has 1185 binary features and 27 binary labels. We ran a structured-SVM algorithm,\ntaken from [19] to obtain a classi\ufb01er that outputs a labelling \u02c6y for each x in the dataset (the error\nof the resulting classi\ufb01er was 2%). We then used our probabilistic bounds to rank the classi\ufb01er\u2019s\npredictions by their robust conditional probabilities. The bounds were calculated based on the set of\nmarginals \u00b5ij(xi, yj), estimated from the data for each pair of a feature and a label Xi, Yj. The graph\ncorresponding to these marginals is not a tree and we handled it as discussed in Section 7. The value\nof our bounds was above 0.99 for 85% of the samples, indicating high certainty that the classi\ufb01er is\ncorrect. Indeed only 0.59% of these 85% were actually errors. The remaining errors made by the\nclassi\ufb01er were assigned a robust probability of 0 by our bounds, indicating low level of certainty.\n\n7 Discussion\n\nWe presented a method for bounding conditional probabilities of a distribution based only on\nknowledge of its low order marginals. Our results can be viewed as a new type of moment problem,\nbounding a key component of machine learning systems, namely the conditional distribution. As we\nshow, calculating these bounds raises many challenging optimization questions, which surprisingly\nresult in closed form expressions in some cases.\nWhile the results were limited to the tree structured case, some of the methods have natural extensions\nto the cyclic case that still result in robust estimations. For instance, the local marginal polytope in\nEq. (7) can be taken over a cyclic structure and still give a lower bound on maximum probabilities.\nAlso in the presence of the cycles, it is possible to \ufb01nd the spanning tree that induces the best bound\non Eq. (3) using a maximum spanning tree algorithm. Plugging these solutions into Eq. (4) results in\na tighter approximation which we used in our experiments.\nOur method can be extended in many interesting directions. Here we addressed the case of discrete\nrandom variables, although we also showed in our experiments how these can be dealt with in the\ncontext of continuous features. It will be interesting to calculate bounds on conditional probabilities\ngiven expected values of continuous random variables. In this case, sums-of-squares characterizations\nplay a key role [15, 20, 3], and their extension to the conditional case is an exciting challenge. It will\nalso be interesting to study how these bounds can be used in the context of unsupervised learning. One\nnatural approach here would be to learn constraint functions such that the lower bound is maximized.\nFinally, we plan to study the implications of our approach to diverse learning settings, from self-\ntraining to active learning and safe reinforcement learning.\n\n8\n\n0100002000030000400005000060000k0.9650.9700.9750.9800.9850.9900.995Accuracy\fAcknowledgments: This work was supported by the ISF Centers of Excellence grant 2180/15, and\nby the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).\n\nReferences\n[1] R. K. Ahuja and J. B. Orlin. A capacity scaling algorithm for the constrained maximum \ufb02ow\n\nproblem. Networks, 25(2):89\u201398, 1995.\n\n[2] N. I. Akhiezer. The classical moment problem: and some related questions in analysis, volume 5.\n\nOliver & Boyd, 1965.\n\n[3] A. Benavoli, A. Facchini, D. Piga, and M. Zaffalon. Sos for bounded rationality. Proceedings of\nthe Tenth International Symposium on Imprecise Probability: Theories and Applications, 2017.\n[4] D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: A convex optimization\n\napproach. SIAM Journal on Optimization, 15(3):780\u2013804, 2005.\n\n[5] R. G. Cowell, P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic networks and\nexpert systems: Exact computational methods for Bayesian networks. Springer Science &\nBusiness Media, 2006.\n\n[6] M. Dud\u00edk, S. J. Phillips, and R. E. Schapire. Maximum entropy density estimation with\ngeneralized regularization and an application to species distribution modeling. Journal of\nMachine Learning Research, 8(Jun):1217\u20131260, 2007.\n\n[7] E. Eban, E. Mezuman, and A. Globerson. Discrete Chebyshev classi\ufb01ers. In Proceedings of the\n31st International Conference on Machine Learning (ICML). JMLR Workshop and Conference\nProceedings Volume 32, pages 1233\u20131241, 2014.\n\n[8] M. Fromer and A. Globerson. An LP view of the M-best MAP problem. In NIPS, volume 22,\n\npages 567\u2013575, 2009.\n\n[9] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances\n\nin neural information processing systems, pages 529\u2013536, 2005.\n\n[10] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\nProceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled,\nSlovenia, June 27 - 30, 1999, pages 200\u2013209, 1999.\n\n[11] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with\ndeep generative models. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal,\nQuebec, Canada, pages 3581\u20133589, 2014.\n\n[12] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\n[13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\nsegmenting and labeling sequence data. In Proceedings of the 18th International Conference on\nMachine Learning, pages 282\u2013289. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[14] G. R. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach\n\nto classi\ufb01cation. Journal of Machine Learning Research, 3(Dec):555\u2013582, 2002.\n\n[15] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM\n\nJournal on Optimization, 11(3):796\u2013817, 2001.\n\n[16] R. Livni and K. C. A. Globerson. A simple geometric interpretation of SVM using stochastic\nadversaries. In Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AI-STATS), pages 722\u2013730. JMLR: W&CP, 2012.\n\n[17] D. McClosky, E. Charniak, and M. Johnson. Effective self-training for parsing. In Proceedings\nof the main conference on human language technology conference of the North American\nChapter of the Association of Computational Linguistics, pages 152\u2013159. Association for\nComputational Linguistics, 2006.\n\n9\n\n\f[18] E. Miranda, G. De Cooman, and E. Quaeghebeur. The hausdorff moment problem under \ufb01nite\n\nadditivity. Journal of Theoretical Probability, 20(3):663\u2013693, 2007.\n\n[19] A. C. Muller and S. Behnke. pystruct - learning structured prediction in python. Journal of\n\nMachine Learning Research, 15:2055\u20132060, 2014.\n\n[20] P. A. Parrilo. Semide\ufb01nite programming relaxations for semialgebraic problems. Mathematical\n\nprogramming, 96(2):293\u2013320, 2003.\n\n[21] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning\nwith ladder networks. In Advances in Neural Information Processing Systems 28: Annual\nConference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal,\nQuebec, Canada, pages 3546\u20133554, 2015.\n\n[22] J. E. Smith. Generalized chebychev inequalities: theory and applications in decision analysis.\n\nOperations Research, 43(5):807\u2013825, 1995.\n\n[23] A. Smola, A. Gretton, L. Song, and B. Sch\u00f6lkopf. A hilbert space embedding for distributions.\n\nIn International Conference on Algorithmic Learning Theory, pages 13\u201331. Springer, 2007.\n\n[24] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A\nIEEE Signal\n\nuni\ufb01ed kernel framework for nonparametric inference in graphical models.\nProcessing Magazine, 30(4):98\u2013111, 2013.\n\n[25] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels\n\nand rkhs embedding of measures. J. Mach. Learn. Res., 12:2389\u20132410, July 2011.\n\n[26] G. Tsoumakas, E. Spyromitros-Xiou\ufb01s, J. Vilcek, and I. Vlahavas. Mulan: A java library for\n\nmulti-label learning. Journal of Machine Learning Research, 12:2411\u20132414, 2011.\n\n[27] L. Vandenberghe, S. Boyd, and K. Comanor. Generalized chebyshev bounds via semide\ufb01nite\n\nprogramming. SIAM review, 49(1):52\u201364, 2007.\n\n[28] V. V. Vazirani. Approximation algorithms. Springer Science & Business Media, 2013.\n\n[29] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree consistency and bounds on the\nperformance of the max-product algorithm and its generalizations. Statistics and Computing,\n14(2):143\u2013166, 2004.\n\n[30] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[31] D. Weiss, C. Alberti, M. Collins, and S. Petrov. Structured training for neural network transition-\nbased parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational\nLinguistics and the 7th International Joint Conference on Natural Language Processing (Volume\n1: Long Papers), pages 323\u2013333, Beijing, China, July 2015. Association for Computational\nLinguistics.\n\n[32] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines.\n\nJ. Mach. Learn. Res., 10:1485\u20131510, December 2009.\n\n10\n\n\f", "award": [], "sourceid": 3184, "authors": [{"given_name": "Yoav", "family_name": "Wald", "institution": "Hebrew University"}, {"given_name": "Amir", "family_name": "Globerson", "institution": "HUJI"}]}