{"title": "Submodular Hamming Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 3141, "page_last": 3149, "abstract": "We show that there is a largely unexplored class of functions (positive polymatroids) that can define proper discrete metrics over pairs of binary vectors and that are fairly tractable to optimize over. By exploiting submodularity, we are able to give hardness results and approximation algorithms for optimizing over such metrics. Additionally, we demonstrate empirically the effectiveness of these metrics and associated algorithms on both a metric minimization task (a form of clustering) and also a metric maximization task (generating diverse k-best lists).", "full_text": "Submodular Hamming Metrics\n\nJennifer Gillenwater\u2020, Rishabh Iyer\u2020, Bethany Lusch\u2217, Rahul Kidambi\u2020, Jeff Bilmes\u2020\n\n\u2020 University of Washington, Dept. of EE, Seattle, U.S.A.\n\n\u2217 University of Washington, Dept. of Applied Math, Seattle, U.S.A.\n\n{jengi, rkiyer, herwaldt, rkidambi, bilmes}@uw.edu\n\nAbstract\n\nWe show that there is a largely unexplored class of functions (positive polyma-\ntroids) that can de\ufb01ne proper discrete metrics over pairs of binary vectors and\nthat are fairly tractable to optimize over. By exploiting submodularity, we are\nable to give hardness results and approximation algorithms for optimizing over\nsuch metrics. Additionally, we demonstrate empirically the effectiveness of these\nmetrics and associated algorithms on both a metric minimization task (a form of\nclustering) and also a metric maximization task (generating diverse k-best lists).\n\n1 Introduction\n\ni=1\n\ni=1 1A(cid:52)B(i) =(cid:80)n\n\nthen: dH (A, B) = |A(cid:52)B| =(cid:80)n\n\nA good distance metric is often the key to an effective machine learning algorithm. For instance,\nwhen clustering, the distance metric largely de\ufb01nes which points end up in which clusters. Similarly,\nin large-margin learning, the distance between different labelings can contribute as much to the\nde\ufb01nition of the margin as the objective function itself. Likewise, when constructing diverse k-best\nlists, the measure of diversity is key to ensuring meaningful differences between list elements.\nWe consider distance metrics d : {0, 1}n \u00d7 {0, 1}n \u2192 R+ over binary vectors, x \u2208 {0, 1}n. If\nwe de\ufb01ne the set V = {1, . . . , n}, then each x = 1A can seen as the characteristic vector of a\nset A \u2286 V , where 1A(v) = 1 if v \u2208 A, and 1A(v) = 0 otherwise. For sets A, B \u2286 V , with\n(cid:52) representing the symmetric difference, A(cid:52)B (cid:44) (A \\ B) \u222a (B \\ A), the Hamming distance is\n1(1A(i) (cid:54)= 1B(i)). A Hamming distance\nbetween two vectors assumes that each entry difference contributes value one. Weighted Hamming\ndistance generalizes this slightly, allowing each entry a unique weight. The Mahalanobis distance\nfurther extends this. For many practical applications, however, it is desirable to have entries interact\nwith each other in more complex and higher-order ways than Hamming or Mahalanobis allow. Yet,\narbitrary interactions would result in non-metric functions whose optimization would be intractable.\nIn this work, therefore, we consider an alternative class of functions that goes beyond pairwise\ninteractions, yet is computationally feasible, is natural for many applications, and preserves metricity.\nGiven a set function f : 2V \u2192 R, we can de\ufb01ne a distortion between two binary vectors as\nfollows: df (A, B) = f (A(cid:52)B). By asking f to satisfy certain properties, we will arrive at a class\nof discrete metrics that is feasible to optimize and preserves metricity. We say that f is positive\nif f (A) > 0 whenever A (cid:54)= \u2205; f is normalized if f (\u2205) = 0; f is monotone if f (A) \u2264 f (B)\nfor all A \u2286 B \u2286 V ; f is subadditive if f (A) + f (B) \u2265 f (A \u222a B) for all A, B \u2286 V ; f is\nmodular if f (A) + f (B) = f (A \u222a B) + f (B \u2229 A) for all A, B \u2286 V ; and f is submodular\nif f (A) + f (B) \u2265 f (A \u222a B) + f (B \u2229 A) for all A, B \u2286 V . If we assume that f is positive,\nnormalized, monotone, and subadditive then df (A, B) is a metric (see Theorem 3.1), but without\nuseful computational properties. If f is positive, normalized, monotone, and modular, then we recover\nthe weighted Hamming distance. In this paper, we assume that f is positive, normalized, monotone,\nand submodular (and hence also subadditive). These conditions are suf\ufb01cient to ensure the metricity\nof df , but allow for a signi\ufb01cant generalization over the weighted Hamming distance. Also, thanks to\nthe properties of submodularity, this class yields ef\ufb01cient optimization algorithms with guarantees\n\n1\n\n\fTable 1: Hardness for SH-min and SH-max. UC stands for unconstrained, and Card stands for\ncardinality-constrained. The entry \u201copen\u201d implies that the problem is potentially poly-time solvable.\n\nSH-min\n\n(cid:16)\n\nUC\nCard \u2126\n\nhomogeneous\n\nOpen\n\u221a\n\u221a\nn\nn\u22121)(1\u2212\u03baf )\n\n1+(\n\n(cid:17)\n\n(cid:16)\n\n\u2126\n\nheterogeneous\n\n4/3\n\u221a\nn\nn\u22121)(1\u2212\u03baf )\n\n\u221a\n\n1+(\n\n(cid:17)\n\nSH-max\n\nhomogeneous\n\nheterogeneous\n\n3/4\n1 \u2212 1/e\n\n3/4\n1 \u2212 1/e\n\nTable 2: Approximation guarantees of algorithms for SH-min and SH-max. \u2019-\u2019 implies that no\nguarantee holds for the corresponding pair. BEST-B only works for the homogeneous case, while all\nother algorithms work in both cases.\n\nUNION-SPLIT\nUC\nCard\n2\n1/4\n\n1/2e\n\n-\n\nSH-min\nSH-max\n\nUC\n\nBEST-B MAJOR-MIN\n2 \u2212 2/m\n\n1+(n\u22121)(1\u2212\u03baf )\n\nCard\n\nn\n\n-\n\n-\n\nRAND-SET\n\nUC\n-\n1/8\n\nfor practical machine learning problems. In what follows, we will refer to normalized monotone\nsubmodular functions as polymatroid functions; all of our results will be concerned with positive\npolymatroids. We note here that despite the restrictions described above, the polymatroid class is in\nfact quite broad; it contains a number of natural choices of diversity and coverage functions, such as\nset cover, facility location, saturated coverage, and concave-over-modular functions.\nGiven a positive polymatroid function f, we refer to df (A, B) = f (A(cid:52)B) as a submodular\nHamming (SH) distance. We study two optimization problems involving these metrics (each fi is a\npositive polymatroid, each Bi \u2286 V , and C denotes a combinatorial constraint):\n\nm(cid:88)\n\nfi(A(cid:52)Bi).\n\n(1)\n\nm(cid:88)\n\nSH-min: min\nA\u2208C\n\nfi(A(cid:52)Bi),\n\nand\n\nSH-max: max\nA\u2208C\n\ni=1\n\ni=1\n\nF (A) for the objective function(cid:80)m\n\nWe will use F as shorthand for the sequence (f1, . . . , fm), B for the sequence (B1, . . . , Bm), and\ni=1 fi(A(cid:52)Bi). We will also make a distinction between the\nhomogeneous case where all fi are the same function, and the more general heterogeneous case\nwhere each fi may be distinct. In terms of constraints, in this paper\u2019s theory we consider only the\nunconstrained (C = 2V ) and the cardinality-constrained (e.g., |A| \u2265 k, |A| \u2264 k) settings. In general\nthough, C could express more complex concepts such as knapsack constraints, or that solutions must\nbe an independent set of a matroid, or a cut (or spanning tree, path, or matching) in a graph.\nIntuitively, the SH-min problem can be thought of as a centroid-\ufb01nding problem; the minimizing A\nshould be as similar to the Bi\u2019s as possible, since a penalty of fi(A(cid:52)Bi) is paid for each difference.\nAnalogously, the SH-max problem can be thought of as a diversi\ufb01cation problem; the maximizing A\nshould be as distinct from all Bi\u2019s as possible, as fi(A(cid:52)B) is awarded for each difference. Given\nmodular fi (the weighted Hamming distance case), these optimization problems can be solved exactly\nand ef\ufb01ciently for many constraint types. For the more general case of submodular fi, we establish\nseveral hardness results and offer new approximation algorithms, as summarized in Tables 1 and 2.\nOur main contribution is to provide (to our knowledge), the \ufb01rst systematic study of the properties of\nsubmodular Hamming (SH) metrics, by showing metricity, describing potential machine learning\napplications, and providing optimization algorithms for SH-min and SH-max.\nThe outline of this paper is as follows. In Section 2, we offer further motivation by describing several\napplications of SH-min and SH-max to machine learning. In Section 3, we prove that for a positive\npolymatroid function f, the distance df (A, B) = f (A(cid:52)B) is a metric. Then, in Sections 4 and 5 we\ngive hardness results and approximation algorithms, and in Section 6 we demonstrate the practical\nadvantage that submodular metrics have over modular metrics for several real-world applications.\n\n2 Applications\n\nWe motivate SH-min and SH-max by showing how they occur naturally in several applications.\n\n2\n\n\fClustering: Many clustering algorithms, including for example k-means [1], use distance functions\nin their optimization.\nIf each item i to be clustered is represented by a binary feature vector\nbi \u2208 {0, 1}n, then counting the disagreements between bi and bj is one natural distance function.\nDe\ufb01ning sets Bi = {v : bi(v) = 1}, this count is equivalent to the Hamming distance |Bi(cid:52)Bj|.\nConsider a document clustering application where V is the set of all features (e.g., n-grams) and\nBi is the set of features for document i. Hamming distance has value 2 both when Bi(cid:52)Bj =\n{\u201csubmodular\u201d, \u201csynapse\u201d} and when Bi(cid:52)Bj = {\u201csubmodular\u201d, \u201cmodular\u201d}. Intuitively, however,\na smaller distance seems warranted in the latter case since the difference is only in one rather than\ntwo distinct concepts. The submodular Hamming distances we propose in this work can easily\ncapture this type of behavior. Given feature clusters W, one can de\ufb01ne a submodular function as:\n\n(cid:112)|Y \u2229 W|. Applying this with Y = Bi(cid:52)Bj, if the documents\u2019 differences are\n\nf (Y ) =(cid:80)\n\nW\u2208W\n\ni\u2208Cj\n\n(cid:80)\n\nf (A(cid:52)Bi).\n\n\u221a\ncon\ufb01ned to one cluster, the distance is smaller than if the differences occur across several word\nclusters. In the case discussed above, the distances are 2 and\n2. If this submodular Hamming\ndistance is used for k-means clustering, then the mean-\ufb01nding step becomes an instance of the SH-\nmin problem. That is, if cluster j contains documents Cj, then its mean takes exactly the following\nSH-min form: \u00b5j \u2208 argminA\u2286V\nStructured prediction: Structured support vector machines (SVMs) typically rely on Hamming\ndistance to compare candidate structures to the true one. The margin required between the correct\nstructure score and a candidate score is then proportional to their Hamming distance. Consider\nthe problem of segmenting an image into foreground and background. Let Bi be image i\u2019s true\nset of foreground pixels. Then Hamming distance between Bi and a candidate segmentation with\nforeground pixels A counts the number of mis-labeled pixels. However, both [2] and [3] observe\npoor performance with Hamming distance and recent work by [4] shows improved performance\nwith richer distances that are supermodular functions of A. One potential direction for further\nenriching image segmentation distance functions is thus to consider non-modular functions from\nwithin our submodular Hamming metrics class. These functions have the ability to correct for\nthe over-penalization that the current distance functions may suffer from when the same kind of\ndifference happens repeatedly. For instance, if Bi differs from A only in the pixels local to a particular\nblock of the image, then current distance functions could be seen as over-estimating the difference.\nUsing a submodular Hamming function, the \u201closs-augmented inference\u201d step in SVM optimization\nbecomes an SH-max problem. More concretely, if the segmentation model is de\ufb01ned by a submodular\ngraph cut g(A), then we have: maxA\u2286V g(A) + f (A(cid:52)Bi). (Note that g(A) = g(A(cid:52)\u2205).) In fact,\n[5] observes superior results with this type of loss-augmented inference using a special case of a\nsubmodular Hamming metric for the task of multi-label image classi\ufb01cation.\nDiverse k-best: For some machine learning tasks, rather than \ufb01nding a model\u2019s single highest-\nscoring prediction, it is helpful to \ufb01nd a diverse set of high-quality predictions. For instance, [6]\nshowed that for image segmentation and pose tracking a diverse set of k solutions tended to contain\na better predictor than the top k highest-scoring solutions. Additionally, \ufb01nding diverse solutions\ncan be bene\ufb01cial for accommodating user interaction. For example, consider the task of selecting\n10 photos to summarize the 100 photos that a person took while on vacation. If the model\u2019s best\nprediction (a set of 10 images) is rejected by the user, then the system should probably present a\nsubstantially different prediction on its second try. Submodular functions are a natural model for\nseveral summarization problems [7, 8]. Thus, given a submodular summarization model g, and a\nset of existing diverse summaries A1, A2, . . . , Ak\u22121, one could \ufb01nd a kth summary to present to\ni=1 f (A(cid:52)Ai). If f and g are both positive\n\nthe user by solving: Ak = argmaxA\u2286V,|A|=(cid:96) g(A) +(cid:80)k\u22121\n\npolymatroids, then this constitutes an instance of the SH-max problem.\n\n3 Properties of the submodular Hamming metric\n\nWe next show several interesting properties of the submodular Hamming distance. Proofs for all\ntheorems and lemmas can be found in the supplementary material. We begin by showing that any\npositive polymatroid function of A(cid:52)B is a metric. In fact, we show the more general result that any\npositive normalized monotone subadditive function of A(cid:52)B is a metric. This result is known (see for\ninstance Chapter 8 of [9]), but we provide a proof (in the supplementary material) for completeness.\nTheorem 3.1. Let f : 2V \u2192 R be a positive normalized monotone subadditive function. Then\ndf (A, B) = f (A(cid:52)B) is a metric on A, B \u2286 V .\n\n3\n\n\fWhile these subadditive functions are metrics, their optimization is known to be very dif\ufb01cult. The\nsimple subadditive function example in the introduction of [10] shows that subadditive minimization is\n\u221a\ninapproximable, and Theorem 17 of [11] states that no algorithm exists for subadditive maximization\nthat has an approximation factor better than \u02dcO(\nn). By contrast, submodular minimization is\npoly-time in the unconstrained setting [12], and a simple greedy algorithm from [13] gives a 1 \u2212 1/e-\napproximation for maximization of positive polymatroids subject to a cardinality constraint. Many\nother approximation results are also known for submodular function optimization subject to various\nother types of constraints. Thus, in this work we restrict ourselves to positive polymatroids.\nCorollary 3.1.1. Let f : 2V \u2192 R+ be a positive polymatroid function. Then df (A, B) = f (A(cid:52)B)\nis a metric on A, B \u2286 V .\nThis restriction does not entirely resolve the question of optimization hardness though. Recall that\nthe optimization in SH-min and SH-max is with respect to A, but that the fi are applied to the sets\nA(cid:52)Bi. Unfortunately, the function gB(A) = f (A(cid:52)B), for a \ufb01xed set B, is neither necessarily\nsubmodular nor supermodular in A. The next example demonstrates this violation of submodularity.\nExample 3.1.1. To be submodular, the function gB(A) = f (A(cid:52)B) must satisfy the following\ncondition for all sets A1, A2 \u2286 V : gB(A1) + gB(A2) \u2265 gB(A1 \u222a A2) + gB(A1 \u2229 A2). Consider\nThen for A1 = {b1} and A2 = {c} (with c /\u2208 B): gB(A1) + gB(A2) =\ngB(A1 \u222a A2) + gB(A1 \u2229 A2).\nAlthough gB(A) = f (A(cid:52)B) can be non-submodular, we are interestingly still able to make use of\nthe fact that f is submodular in A(cid:52)B to develop approximation algorithms for SH-min and SH-max.\n\nthe positive polymatroid function f (Y ) =(cid:112)|Y | and let B consist of two elements: B = {b1, b2}.\n\n1 +\n\n\u221a\n\n\u221a\n\u221a\n3 < 2\n\n2 =\n\n4 Minimization of the submodular Hamming metric\n\nfunction F (A) = (cid:80)m\n\nIn this section, we focus on SH-min (the centroid-\ufb01nding problem). We consider the four cases\nfrom Table 1: the constrained (A \u2208 C \u2282 2V ) and unconstrained (A \u2208 C = 2V ) settings, as well\nas the homogeneous case (where all fi are the same function) and the heterogeneous case. Before\ndiving in, we note that in all cases we assume not only the natural oracle access to the objective\ni=1 fi(A(cid:52)Bi) (i.e., the ability to evaluate F (A) for any A \u2286 V ), but also\nknowledge of the Bi (the B sequence). Theorem 4.1 shows that without knowledge of B, SH-min is\ninapproximable. In practice, requiring knowledge of B is not a signi\ufb01cant limitation; for all of the\napplications described in Section 2, B is naturally known.\nTheorem 4.1. Let f be a positive polymatroid function. Suppose that the subset B \u2286 V is \ufb01xed\nbut unknown and gB(A) = f (A(cid:52)B). If we only have an oracle for gB, then there is no poly-time\napproximation algorithm for minimizing gB, up to any polynomial approximation factor.\n\n4.1 Unconstrained setting\n\nSubmodular minimization is poly-time in the unconstrained setting [12]. Since a sum of submodular\nfunctions is itself submodular, at \ufb01rst glance it might then seem that the sum of fi in SH-min can\nbe minimized in poly-time. However, recall from Example 3.1.1 that the fi\u2019s are not necessarily\nsubmodular in the optimization variable, A. This means that the question of SH-min\u2019s hardness,\neven in the unconstrained setting, is an open question. Theorem 4.2 resolves this question for\nthe heterogeneous case, showing that it is NP-hard and that no algorithm can do better than a\n4/3-approximation guarantee. The question of hardness in the homogeneous case remains open.\nTheorem 4.2. The unconstrained and heterogeneous version of SH-min is NP-hard. Moreover, no\npoly-time algorithm can achieve an approximation factor better than 4/3.\n\nSince unconstrained SH-min is NP-hard, it makes sense to consider approximation algorithms for\nthis problem. We \ufb01rst provide a simple 2-approximation, UNION-SPLIT (see Algorithm 1). This\nalgorithm splits f (A(cid:52)B) = f ((A \\ B) \u222a (B \\ A)) into f (A \\ B) + f (B \\ A), then applies standard\nsubmodular minimization (see e.g. [14]) to the split function. Theorem 4.3 shows that this algorithm\nis a 2-approximation for SH-min. It relies on Lemma 4.2.1, which we state \ufb01rst.\nLemma 4.2.1. Let f be a positive monotone subadditive function. Then, for any A, B \u2286 V :\n\nf (A(cid:52)B) \u2264 f (A \\ B) + f (B \\ A) \u2264 2f (A(cid:52)B).\n\n(2)\n\n4\n\n\fAlgorithm 1 UNION-SPLIT\n\nDe\ufb01ne F (cid:48)(Y ) =(cid:80)m\n\nInput: F, B\nDe\ufb01ne f(cid:48)\nOutput: SUBMODULAR-OPT (F (cid:48))\n\ni (Y ) = fi(Y \\ Bi) + fi(Bi \\ Y )\n\ni=1 f(cid:48)\n\ni (Y )\n\nAlgorithm 2 BEST-B\n\nInput: F , B\nA \u2190 B1\nfor i = 2, . . . , m do\n\nif F (Bi) < F (A): A \u2190 Bi\n\nOutput: A\n\nAlgorithm 3 MAJOR-MIN\n\nInput: F, B, C\nA \u2190 \u2205\nrepeat\n\nc \u2190 F (A)\nSet w \u02c6F as in Equation 3\nA \u2190 MODULAR-MIN (w \u02c6F , C)\n\nuntil F (A) = c\nOutput: A\n\nTheorem 4.3. UNION-SPLIT is a 2-approximation for unconstrained SH-min.\n\nF (A) = (cid:80)m\n\nRestricting to the homogeneous setting, we can provide a different algorithm that has a bet-\nter approximation guarantee than UNION-SPLIT. This algorithm simply checks the value of\ni=1 f (A(cid:52)Bi) for each Bi and returns the minimizing Bi. We call this algorithm\nBEST-B (Algorithm 2). Theorem 4.4 gives the approximation guarantee for BEST-B. This result\nis known [15], as the proof of the guarantee only makes use of metricity and homogeneity (not\nsubmodularity), and these properties are common to much other work. We provide the proof in our\nnotation for completeness though.\nTheorem 4.4. For m = 1, BEST-B exactly solves unconstrained SH-min. For m > 1, BEST-B is a\n\n(cid:1)-approximation for unconstrained homogeneous SH-min.\n\n(cid:0)2 \u2212 2\n\nm\n\n4.2 Constrained setting\n\nIn the constrained setting, the SH-min problem becomes more dif\ufb01cult. Essentially, all of the\nhardness results established in existing work on constrained submodular minimization applies to\nthe constrained SH-min problem as well. Theorem 4.5 shows that, even for a simple cardinality\nconstraint and identical fi (homogeneous setting), not only is SH-min NP-hard, but also it is hard to\napproximate with a factor better than \u2126(\nTheorem 4.5. Homogeneous SH-min is NP-hard under cardinality constraints. Moreover, no\n, where \u03baf =\nalgorithm can achieve an approximation factor better than \u2126\n1 \u2212 minj\u2208V\n\ndenotes the curvature of f. This holds even when m = 1.\n\n\u221a\nn\nn\u22121)(1\u2212\u03baf )\n\n\u221a\nn).\n\nf (j|V \\j)\n\n(cid:16)\n\n(cid:17)\n\n\u221a\n\n1+(\n\nf (j)\n\nWe can also show similar hardness results for several other combinatorial constraints including matroid\nconstraints, shortest paths, spanning trees, cuts, etc. [16, 17]. Note that the hardness established\nin Theorem 4.5 depends on a quantity \u03baf , which is also called the curvature of a submodular\nfunction [18, 16]. Intuitively, this factor measures how close a submodular function is to a modular\nfunction. The result suggests that the closer the function is being modular, the easier it is to optimize.\nThis makes sense, since with a modular function, SH-min can be exactly minimized under several\ncombinatorial constraints. To see this for the cardinality-constrained case, \ufb01rst note that for modular\nfi, the corresponding F -function is also modular. Lemma 4.5.1 formalizes this.\n\nLemma 4.5.1. If the fi in SH-min are modular, then F (A) =(cid:80)m\nvector wF \u2208 Rn, such that F (A) = C +(cid:80)\n\nGiven Lemma 4.5.1, from the de\ufb01nition of modularity we know that there exists some constant C and\nj\u2208A wF (j). From this representation it is clear that F\ncan be minimized subject to the constraint |A| \u2265 k by choosing as the set A the items corresponding\nto the k smallest entries in wF . Thus, for modular fi, or fi with small curvature \u03bafi, such constrained\nminimization is relatively easy.\nHaving established the hardness of constrained SH-min, we now turn to considering approximation\nalgorithms for this problem. Unfortunately, the UNION-SPLIT algorithm from the previous section\n\ni=1 fi(A(cid:52)Bi) is also modular.\n\n5\n\n\frequires an ef\ufb01cient algorithm for submodular function minimization, and no such algorithm exists\nin the constrained setting; submodular minimization is NP-hard even under simple cardinality con-\nstraints [19]. Similarly, the BEST-B algorithm breaks down in the constrained setting; its guarantees\ncarry over only if all the Bi are within the constraint set C. Thus, for the constrained SH-min problem\nwe instead propose a majorization-minimization algorithm. Theorem 4.6 shows that this algorithm\nhas an O(n) approximation guarantee, and Algorithm 3 formally de\ufb01nes the algorithm.\nEssentially, MAJOR-MIN proceeds by iterating the following two steps: constructing \u02c6F , a modular\nupper bound for F at the current solution A, then minimizing \u02c6F to get a new A. \u02c6F consists of\nsuperdifferentials [20, 21] of F \u2019s component submodular functions. We use the superdifferentials\nde\ufb01ned as \u201cgrow\u201d and \u201cshrink\u201d in [22]. De\ufb01ning sets S, T as S = V \\ j, T = A(cid:52)Bi for \u201cgrow\u201d, and\nS = (A(cid:52)Bi) \\ j, T = \u2205 for \u201cshrink\u201d, the w \u02c6F vector that represents the modular \u02c6F can be written:\n\nm(cid:88)\n\n(cid:26)fi(j | S) if j \u2208 A(cid:52)Bi\n\ni=1\n\nw \u02c6F (j) =\n\nfi(j | T ) otherwise,\n\nTheorem 4.6. MAJOR-MIN is guaranteed to improve the objective value, F (A) =(cid:80)m\n\n(3)\nwhere f (Y | X) = f (Y \u222a X) \u2212 f (X) is the gain in f-value when adding Y to X. We now state the\nmain theorem characterizing algorithm MAJOR-MIN\u2019s performance on SH-min.\ni=1 fi(A(cid:52)Bi),\nat every iteration. Moreover, for any constraint over which a modular function can be exactly\napproximation guarantee, where A\u2217 is\noptimized, it has a\nthe optimal solution of SH-min.\n\n1+(|A\u2217(cid:52)Bi|\u22121)(1\u2212\u03bafi (A\u2217(cid:52)Bi))\n\n|A\u2217(cid:52)Bi|\n\nmaxi\n\n(cid:17)\n\n(cid:16)\n\nWhile MAJOR-MIN does not have a constant-factor guarantee (which is possible only in the uncon-\nstrained setting), the bounds are not too far from the hardness of the constrained setting. For example,\nin the cardinality case, the guarantee of MAJOR-MIN is\n1+(n\u22121)(1\u2212\u03baf ), while the hardness shown in\nTheorem 4.5 is \u2126\n\n(cid:17)\n\n(cid:16)\n\n\u221a\nn\n\nn\n\n.\n\n1+(n\u22121)(1\u2212\u03baf )\n\n5 Maximization of the submodular Hamming metric\n\nWe next characterize the hardness of SH-max (the diversi\ufb01cation problem) and describe approximation\nalgorithms for it. We \ufb01rst show that all versions of SH-max, even the unconstrained homogeneous\none, are NP-hard. Note that this is a non-trivial result. Maximization of a monotone function such\nas a polymatroid is not NP-hard; the maximizer is always the full set V . But, for SH-max, despite\nthe fact that the fi are monotone with respect to their argument A(cid:52)Bi, they are not monotone with\nrespect to A itself. This makes SH-max signi\ufb01cantly harder. After establishing that SH-max is\nNP-hard, we show that no poly-time algorithm can obtain an approximation factor better 3/4 in the\nunconstrained setting, and a factor of (1 \u2212 1/e) in the constrained setting. Finally, we provide a\nsimple approximation algorithm which achieves a factor of 1/4 for all settings.\nTheorem 5.1. All versions of SH-max (constrained or unconstrained, heterogeneous or homoge-\nneous) are NP-hard. Moreover, no poly-time algorithm can obtain a factor better than 3/4 for the\nunconstrained versions, or better than 1 \u2212 1/e for the cardinality-constrained versions.\nWe turn now to approximation algorithms. For the unconstrained setting, Lemma 5.1.1 shows that\nsimply choosing a random subset, A \u2286 V provides a 1/8-approximation in expectation.\nLemma 5.1.1. A random subset is a 1/8-approximation for SH-max in the unconstrained (homoge-\nneous or heterogeneous) setting.\n\nAn improved approximation guarantee of 1/4 can be shown for a variant of UNION-SPLIT (Algo-\nrithm 1), if the call to SUBMODULAR-OPT is a call to a SUBMODULAR-MAX algorithm. Theorem 5.2\nmakes this precise for both the unconstrained case and a cardinality-constrained case. It might also be\nTheorem 5.2. Maximizing \u00afF (A) =(cid:80)m\nof interest to consider more complex constraints, such as matroid independence and base constraints,\nbut we leave the investigation of such settings to future work.\n(cid:80)m\ni=1 (fi(A \\ Bi) + fi(Bi \\ A)) with a bi-directional greedy\nalgorithm [23, Algorithm 2] is a linear-time 1/4-approximation for maximizing F (A) =\ni=1 fi(A(cid:52)Bi), in the unconstrained setting. Under the cardinality constraint |A| \u2264 k, using the\n\nrandomized greedy algorithm [24, Algorithm 1] provides a 1\n\n2e -approximation.\n\n6\n\n\fTable 3: mV-ROUGE averaged over the 14 datasets (\u00b1\nstandard deviation).\n\nHM\n\n0.38 \u00b1 0.14\n\nSP\n\n0.43 \u00b1 0.20\n\nTP\n\n0.50 \u00b1 0.26\n\nTable 4: # of wins (out of 14 datasets).\n\nHM SP TP\n3\n10\n\n1\n\n6 Experiments\nTo demonstrate the effectiveness of the submodular Hamming metrics proposed here, we apply them\nto a metric minimization task (clustering) and a metric maximization task (diverse k-best).\n6.1 SH-min application: clustering\n\nWe explore the document clustering problem described in Section 2, where the groundset V is all\nunigram features and Bi contains the unigrams of document i. We run k-means clustering and\nf (A(cid:52)Bi).\neach iteration \ufb01nd the mean for cluster Cj by solving: \u00b5j \u2208 argminA:|A|\u2265(cid:96)\nThe constraint |A| \u2265 (cid:96) requires the mean to contain at least (cid:96) unigrams, which helps k-means to\ncreate richer and more meaningful cluster centers. We compare using the submodular function\n\n(cid:112)|Y \u2229 W| (SM), to using Hamming distance (HM). The problem of \ufb01nding \u00b5j\n\nf (Y ) =(cid:80)\n\ni\u2208Cj\n\nW\u2208W\n\n(cid:80)\n\nabove can be solved exactly for HM, since it is a modular function. In the SM case, we apply MAJOR-\nMIN (Algorithm 3). As an initial test, we generate synthetic data consisting of 100 \u201cdocuments\u201d\nassigned to 10 \u201ctrue\u201d clusters. We set the number of \u201cword\u201d features to n = 1000, and partition the\nfeatures into 100 word classes (the W in the submodular function). Ten word classes are associated\nwith each true document cluster, and each document contains one word from each of these word\nclasses. That is, each word is contained in only one document, but documents in the same true cluster\nhave words from the same word classes. We set the minimum cluster center size to (cid:96) = 100. We use\nk-means++ initialization [25] and average over 10 trials. Within the k-means optimization, we enforce\nthat all clusters are of equal size by assigning a document to the closest center whose current size\nis < 10. With this setup, the average accuracy of HM is 28.4% (\u00b12.4), while SM is 69.4% (\u00b110.5).\nThe HM accuracy is essentially the accuracy of a random assignment of documents to clusters; this\nmakes sense, as no documents share words, rendering the Hamming distance useless. In real-world\ndata there would likely be some word overlap though; to better model this, we let each document\ncontain a random sampling of 10 words from the word clusters associated with its document cluster.\nIn this case, the average accuracy of HM is 57.0% (\u00b16.8), while SM is 88.5% (\u00b18.4). The results\nfor SM are even better if randomization is removed from the initialization (we simply choose the next\ncenter to be one with greatest distance from the current centers). In this case, the average accuracy\nof HM is 56.7% (\u00b17.1), while SM is 100% (\u00b10.0). This indicates that as long as the starting point\nfor SM contains one document from each cluster, the SM optimization will recover the true clusters.\nMoving beyond synthetic data, we applied the same method to the problem of clustering NIPS papers.\nThe initial set of documents that we consider consists of all NIPS papers1 from 1987 to 2014. We \ufb01lter\nthe words of a given paper by \ufb01rst removing stopwords and any words that don\u2019t appear at least 3 times\nin the paper. We further \ufb01lter by removing words that have small tf-idf value (< 0.001) and words that\noccur in only one paper or in more than 10% of papers. We then \ufb01lter the papers themselves, discarding\nany that have fewer than 25 remaining words and for each other paper retaining only its top (by tf-idf\nscore) 25 words. Each of the 5,522 remaining papers de\ufb01nes a Bi set. Among the Bi there are 12,262\nunique words. To get the word clusters W, we \ufb01rst run the WORD2VEC code of [26], which generates\na 100-dimensional real-valued vector of features for each word, and then run k-means clustering with\nEuclidean distance on these vectors to de\ufb01ne 100 word clusters. We set the center size cardinality\nconstraint to (cid:96) = 100 and set the number of document clusters to k = 10. To initialize, we again use\nk-means++ [25], with k = 10. Results are averaged over 10 trials. While we do not have groundtruth\nlabels for NIPS paper clusters, we can use within-cluster distances as a proxy for cluster goodness\n(lower values, indicating tighter clusters, are better). Speci\ufb01cally, we compute: k-means-score =\ng(\u00b5j(cid:52)Bi). With Hamming for g, the average ratio of HM\u2019s k-means-score to SM\u2019s\nis 0.916 \u00b1 0.003. This indicates that, as expected, HM does a better job of optimizing the Hamming\nloss. However, with the submodular function for g, the average ratio of HM\u2019s k-means-score to SM\u2019s\nis 1.635 \u00b1 0.038. Thus, SM does a signi\ufb01cantly better job optimizing the submodular loss.\n\n(cid:80)k\n\n(cid:80)\n\ni\u2208Cj\n\nj=1\n\n1Papers were downloaded from http://papers.nips.cc/.\n\n7\n\n\f6.2 SH-max application: diverse k-best\nIn this section, we explore a diverse k-best image collection summarization problem, as de-\nscribed in Section 2. For this problem, our goal is to obtain k summaries, each of size\nl, by selecting from a set consisting of n (cid:29) l images. The idea is that either:\n(a) the\nuser could choose from among these k summaries the one that they \ufb01nd most appealing,\nor (b) a (more computationally expensive) model could be applied to re-rank these k sum-\nmaries and choose the best. As is described in Section 2, we obtain the kth summary Ak,\ni=1 f (A(cid:52)Ai).\n\ngiven the \ufb01rst k \u2212 1 summaries A1:k\u22121 via: Ak = argmaxA\u2286V,|A|=(cid:96) g(A) +(cid:80)k\u22121\ng(A) =(cid:80)\n\nFigure 1: An example photo montage (zoom in to\nsee detail) showing 15 summaries of size 10 (one\nper row) from the HM approach (left) and the TP\napproach (right), for image collection #6.\n\nFor g we use the facility location function:\ni\u2208V maxj\u2208A Sij, where Sij is a sim-\nilarity score for images i and j. We compute\nSij by taking the dot product of the ith and jth\nfeature vectors, which are the same as those\nused by [8]. For f we compare two differ-\nent functions: (1) f (A(cid:52)Ai) = |A(cid:52)Ai|, the\nHamming distance (HM), and (2) f (A(cid:52)Ai) =\ng(A(cid:52)Ai), the submodular facility location dis-\ntance (SM). For HM we optimize via the stan-\ndard greedy algorithm [13]; since the facil-\nity location function g is monotone submod-\nular, this implies an approximation guarantee\nof (1 \u2212 1/e). For SM, we experiment with\ntwo algorithms: (1) standard greedy [13], and\n(2) UNION-SPLIT (Algorithm 1) with standard\ngreedy as the SUBMODULAR-OPT function. We\nwill refer to these two cases as \u201csingle part\u201d (SP)\nand \u201ctwo part\u201d (TP). Note that neither of these optimization techniques has a formal approximation\nguarantee, though the latter would if instead of standard greedy we used the bi-directional greedy\nalgorithm of [23]. We opt to use standard greedy though, as it typically performs much better in prac-\ntice. We employ the image summarization dataset from [8], which consists of 14 image collections,\neach of which contains n = 100 images. For each image collection, we seek k = 15 summaries of\nsize (cid:96) = 10. For evaluation, we employ the V-ROUGE score developed by [8]; the mean V-ROUGE\n(mV-ROUGE) of the k summaries provides a quantitative measure of their goodness. V-ROUGE\nscores are normalized such that a score of 0 corresponds to randomly generated summaries, while a\nscore of 1 is on par with human-generated summaries.\nTable 3 shows that SP and TP outperform HM in terms of mean mV-ROUGE, providing support for\nthe idea of using submodular Hamming distances in place of (modular) Hamming for diverse k-best\napplications. TP also outperforms SP, suggesting that the objective-splitting used in UNION-SPLIT\nis of practical signi\ufb01cance. Table 4 provides additional evidence of TP\u2019s superiority, indicating that\nfor 10 out of the 14 image collections, TP has the best mV-ROUGE score of the three approaches.\nFigure 1 provides some qualitative evidence of TP\u2019s goodness. Notice that the images in the green\nrectangle tend to be more redundant with images from the previous summaries in the HM case than\nin the TP case; the HM solution contains many images with a \u201csky\u201d theme, while TP contains more\nimages with other themes. This shows that the HM solution lacks diversity across summaries. The\nquality of the individual summaries also tends to become poorer for the later HM sets; considering\nthe images in the red rectangles overlaid on the montage, the HM sets contain many images of tree\nbranches here. By contrast, the TP summary quality remains good even for the last few summaries.\n\n7 Conclusion\n\nIn this work we de\ufb01ned a new class of distance functions: submodular Hamming metrics. We\nestablished hardness results for the associated SH-min and SH-max problems, and provided approxi-\nmation algorithms. Further, we demonstrated the practicality of these metrics for several applications.\nThere remain several open theoretical questions (e.g., the tightness of the hardness results and the\nNP-hardness of SH-min), as well as many opportunities for applying submodular Hamming metrics\nto other machine learning problems (e.g., the prediction application from Section 2).\n\n8\n\n\fReferences\n[1] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129\u2013137,\n\n1982.\n\n[2] T. Hazan, S. Maji, J. Keshet, and T. Jaakkola. Learning Ef\ufb01cient Random Maximum A-Posteriori Predictors\n\nwith Non-Decomposable Loss Functions. In NIPS, 2013.\n\n[3] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs Using Graph Cuts. In ECCV, 2008.\n[4] A. Osokin and P. Kohli. Perceptually Inspired Layout-Aware Losses for Image Segmentation. In ECCV,\n\n2014.\n\n[5] J. Yu and M. Blaschko. Learning Submodular Losses with the Lovasz Hinge. In ICML, 2015.\n[6] D. Batra, P. Yadollahpour, A. Guzman, and G. Shakhnarovich. Diverse M-Best Solutions in Markov\n\nRandom Fields. In ECCV, 2012.\n\n[7] H. Lin and J. Bilmes. A Class of Submodular Functions for Document Summarization. In ACL.\n[8] S. Tschiatschek, R. Iyer, H. Wei, and J. Bilmes. Learning Mixtures of Submodular Functions for Image\n\nCollection Summarization. In NIPS, 2014.\n\n[9] P. Halmos. Measure Theory. Springer, 1974.\n[10] S. Jegelka and J. Bilmes. Approximation Bounds for Inference using Cooperative Cuts. In ICML, 2011.\n[11] M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam. Submodular Secretary Problem and Extensions.\n\nTechnical report, MIT, 2010.\n\n[12] W. H. Cunningham. On Submodular Function Minimization. Combinatorica, 3:185 \u2013 192, 1985.\n[13] G. Nemhauser, L. Wolsey, and M. Fisher. An Analysis of Approximations for Maximizing Submodular\n\nSet Functions I. 14(1), 1978.\n\n[14] Satoru Fujishige. Submodular Functions and Optimization. Elsevier, 2 edition, 2005.\n[15] D. Gus\ufb01eld. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology.\n\nCambridge University Press, 1997.\n\n[16] R. Iyer, S. Jegelka, and J. Bilmes. Curvature and Ef\ufb01cient Approximation Algorithms for Approximation\n\nand Minimization of Submodular Functions. In NIPS, 2013.\n\n[17] G. Goel, C. Karande, P. Tripathi, and L. Wang. Approximability of combinatorial problems with multi-agent\n\nsubmodular cost functions. In FOCS, 2009.\n\n[18] J. Vondr\u00b4ak. Submodularity and Curvature: The Optimal Algorithm. RIMS Kokyuroku Bessatsu, 23, 2010.\n[19] Z. Svitkina and L. Fleischer. Submodular Approximation: Sampling-Based Algorithms and Lower Bounds.\n\nIn FOCS, 2008.\n\n[20] S. Jegelka and J. Bilmes. Submodularity Beyond Submodular Energies: Coupling Edges in Graph Cuts. In\n\nCVPR, 2011.\n\n[21] R. Iyer and J. Bilmes. The Submodular Bregman and Lov\u00b4asz-Bregman Divergences with Applications. In\n\nNIPS, 2012.\n\n[22] R. Iyer, S. Jegelka, and J. Bilmes. Fast Semidifferential-Based Submodular Function Optimization. In\n\nICML, 2013.\n\n[23] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A Tight Linear Time (1/2)-Approximation for\n\nUnconstrained Submodular Maximization. In FOCS, 2012.\n\n[24] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. Submodular maximization with cardinality\n\nconstraints. In SODA, 2014.\n\n[25] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. In SODA, 2007.\n[26] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and\n\nPhrases and their Compositionality. In NIPS, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1759, "authors": [{"given_name": "Jennifer", "family_name": "Gillenwater", "institution": "University of Washington"}, {"given_name": "Rishabh", "family_name": "Iyer", "institution": "University of Washington, Seattle"}, {"given_name": "Bethany", "family_name": "Lusch", "institution": "University of Washington"}, {"given_name": "Rahul", "family_name": "Kidambi", "institution": "University of Washington"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington, Seattle"}]}