{"title": "Optimizing affinity-based binary hashing using auxiliary coordinates", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 648, "abstract": "In supervised binary hashing, one wants to learn a function that maps a high-dimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as an iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets. In addition, our framework facilitates the design of optimization algorithms for arbitrary types of loss and hash functions.", "full_text": "Optimizing Af\ufb01nity-Based Binary Hashing\n\nUsing Auxiliary Coordinates\n\nRamin Raziperchikolaei\n\nMiguel \u00b4A. Carreira-Perpi \u02dcn\u00b4an\n\nEECS, University of California, Merced\nrraziperchikolaei@ucmerced.edu\n\nEECS, University of California, Merced\nmcarreira-perpinan@ucmerced.edu\n\nAbstract\n\nIn supervised binary hashing, one wants to learn a function that maps a high-\ndimensional feature vector to a vector of binary codes, for application to fast im-\nage retrieval. This typically results in a dif\ufb01cult optimization problem, nonconvex\nand nonsmooth, because of the discrete variables involved. Much work has simply\nrelaxed the problem during training, solving a continuous optimization, and trun-\ncating the codes a posteriori. This gives reasonable results but is quite suboptimal.\nRecent work has tried to optimize the objective directly over the binary codes and\nachieved better results, but the hash function was still learned a posteriori, which\nremains suboptimal. We propose a general framework for learning hash functions\nusing af\ufb01nity-based loss functions that uses auxiliary coordinates. This closes the\nloop and optimizes jointly over the hash functions and the binary codes so that\nthey gradually match each other. The resulting algorithm can be seen as an iter-\nated version of the procedure of optimizing \ufb01rst over the codes and then learning\nthe hash function. Compared to this, our optimization is guaranteed to obtain bet-\nter hash functions while being not much slower, as demonstrated experimentally\nin various supervised datasets. In addition, our framework facilitates the design of\noptimization algorithms for arbitrary types of loss and hash functions.\n\nInformation retrieval arises in several applications, most obviously web search. For example, in\nimage retrieval, a user is interested in \ufb01nding similar images to a query image. Computationally,\nthis essentially involves de\ufb01ning a high-dimensional feature space where each relevant image is\nrepresented by a vector, and then \ufb01nding the closest points (nearest neighbors) to the vector for the\nquery image, according to a suitable distance. For example, one can use features such as SIFT or\nGIST [23] and the Euclidean distance for this purpose. Finding nearest neighbors in a dataset of\nN images (where N can be millions), each a vector of dimension D (typically in the hundreds)\nis slow, since exact algorithms run essentially in time O(N D) and space O(N D) (to store the\nimage dataset). In practice, this is approximated, and a successful way to do this is binary hashing\n[12]. Here, given a high-dimensional vector x \u2208 RD, the hash function h maps it to a b-bit vector\nz = h(x) \u2208 {\u22121, +1}b, and the nearest neighbor search is then done in the binary space. This\nnow costs O(N b) time and space, which is orders of magnitude faster because typically b < D\nand, crucially, (1) operations with binary vectors (such as computing Hamming distances) are very\nfast because of hardware support, and (2) the entire dataset can \ufb01t in (fast) memory rather than slow\nmemory or disk.\n\nThe disadvantage is that the results are inexact, since the neighbors in the binary space will not be\nidentical to the neighbors in the original space. However, the approximation error can be controlled\nby using suf\ufb01ciently many bits and by learning a good hash function. This has been the topic of\nmuch work in recent years. The general approach consists of de\ufb01ning a supervised objective that has\na small value for good hash functions and minimizing it. Ideally, such an objective function should\nbe minimal when the neighbors of any given image are the same in both original and binary spaces.\nPractically, in information retrieval, this is often evaluated using precision and recall. However, this\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fideal objective cannot be easily optimized over hash functions, and one uses approximate objectives\ninstead. Many such objectives have been proposed in the literature. We focus here on af\ufb01nity-based\nloss functions, which directly try to preserve the original similarities in the binary space. Speci\ufb01cally,\nwe consider objective functions of the form\nmin L(h) = PN\n\nn,m=1 L(h(xn), h(xm); ynm)\n\n(1)\n\nwhere X = (x1, . . . , xN ) is the high-dimensional dataset of feature vectors, minh means minimiz-\ning over the parameters of the hash function h (e.g. over the weights of a linear SVM), and L(\u00b7)\nis a loss function that compares the codes for two images (often through their Hamming distance\nkh(xn) \u2212 h(xm)k) with the ground-truth value ynm that measures the af\ufb01nity in the original space\nbetween the two images xn and xm (distance, similarity or other measure of neighborhood; [12]).\nThe sum is often restricted to a subset of image pairs (n, m) (for example, within the k nearest\nneighbors of each other in the original space), to keep the runtime low. Examples of these objec-\ntive functions (described below) include models developed for dimension reduction, be they spectral\nsuch as Laplacian Eigenmaps [2] and Locally Linear Embedding [24], or nonlinear such as the Elas-\ntic Embedding [4] or t-SNE [26]; as well as objective functions designed speci\ufb01cally for binary\nhashing, such as Supervised Hashing with Kernels (KSH) [19], Binary Reconstructive Embeddings\n(BRE) [14] or sequential Projection Learning Hashing (SPLH) [29].\n\nIf the hash function h was a continuous function of its input x and its parameters, one could simply\napply the chain rule to compute derivatives over the parameters of h of the objective function (1) and\nthen apply a nonlinear optimization method such as gradient descent. This would be guaranteed to\nconverge to an optimum under mild conditions (for example, Wolfe conditions on the line search),\nwhich would be global if the objective is convex and local otherwise [21]. Hence, optimally learning\nthe function h would be in principle doable (up to local optima), although it would still be slow\nbecause the objective can be quite nonlinear and involve many terms.\n\nIn binary hashing, the optimization is much more dif\ufb01cult, because in addition to the previous is-\nsues, the hash function must output binary values, hence the problem is not just generally nonconvex,\nbut also nonsmooth. In view of this, much work has sidestepped the issue and settled on a simple\nbut suboptimal solution. First, one de\ufb01nes the objective function (1) directly on the b-dimensional\ncodes of each image (rather than on the hash function parameters) and optimizes it assuming con-\ntinuous codes (in Rb). Then, one binarizes the codes for each image. Finally, one learns a hash\nfunction given the codes. Optimizing the af\ufb01nity-based loss function (1) can be done using spec-\ntral methods or nonlinear optimization as described above. Binarizing the codes has been done in\ndifferent ways, from simply rounding them to {\u22121, +1} using zero as threshold [18, 19, 30, 33],\nto optimally \ufb01nding a threshold [18], to rotating the continuous codes so that thresholding intro-\nduces less error [11, 32]. Finally, learning the hash function for each of the b output bits can\nbe considered as a binary classi\ufb01cation problem, where the resulting classi\ufb01ers collectively give\nthe desired hash function, and can be solved using various machine learning techniques. Several\nworks (e.g. [16, 17, 33]) have used this approach, which does produce reasonable hash functions\n(in terms of retrieval measures such as precision/recall).\n\nIn order to do better, one needs to take into account during the optimization (rather than after the\noptimization) the fact that the codes are constrained to be binary. This implies attempting directly the\ndiscrete optimization of the af\ufb01nity-based loss function over binary codes. This is a daunting task,\nsince this is usually an NP-complete problem with N b binary variables altogether, and practical\napplications could make this number as large as millions or beyond. Recent works have applied\nalternating optimization (with various re\ufb01nements) to this, where one optimizes over a usually small\nsubset of binary variables given \ufb01xed values for the remaining ones [16, 17], and this did result in\nvery competitive precision/recall compared with the state-of-the-art. This is still slow and future\nwork will likely improve it, but as of now it provides an option to learn better binary codes.\n\nOf the three-step suboptimal approach mentioned (learn continuous codes, binarize them, learn hash\nfunction), these works manage to join the \ufb01rst two steps and hence learn binary codes [16, 17]. Then,\none learns the hash function given these binary codes. Can we do better? Indeed, in this paper we\nshow that all elements of the problem (binary codes and hash function) can be incorporated in a\nsingle algorithm that optimizes jointly over them. Hence, by initializing it from binary codes from\nthe previous approach, this algorithm is guaranteed to achieve a lower error and learn better hash\nfunctions. Our framework can be seen as an iterated version of the two-step approach: learn binary\ncodes given the current hash function, learn hash functions given codes, iterate (note the emphasis).\n\n2\n\n\fThe key to achieve this in a principled way is to use a recently proposed method of auxiliary coor-\ndinates (MAC) for optimizing \u201cnested\u201d systems, i.e., consisting of the composition of two or more\nfunctions or processing stages. MAC introduces new variables and constraints that cause decoupling\nbetween the stages, resulting in the mentioned alternation between learning the hash function and\nlearning the binary codes. Section 1 reviews af\ufb01nity-based loss functions, section 2 describes our\nMAC-based proposed framework, section 3 evaluates it in several supervised datasets, using linear\nand nonlinear hash functions, and section 4 discusses implications of this work.\n\nRelated work Although one can construct hash functions without training data [1, 15], we fo-\ncus on methods that learn the hash function given a training set, since they perform better, and our\nemphasis is in optimization. The learning can be unsupervised [5, 11], which attempts to preserve\ndistances in the original space, or supervised, which in addition attempts to preserve label similarity.\nMany objective functions have been proposed to achieve this and we focus on af\ufb01nity-based ones.\nThese create an af\ufb01nity matrix for a subset of training points based on their distances (unsupervised)\nor labels (supervised) and combine it with a loss function [14, 16, 17, 19, 22]. Some methods opti-\nmize this directly over the hash function. For example, Binary Reconstructive Embeddings [14] use\nalternating optimization over the weights of the hash functions. Supervised Hashing with Kernels\n[19] learns hash functions sequentially by considering the difference between the inner product of\nthe codes and the corresponding element of the af\ufb01nity matrix. Although many approaches exist,\na common theme is to apply a greedy approach where one \ufb01rst \ufb01nds codes using an af\ufb01nity-based\nloss function, and then \ufb01ts the hash functions to them (usually by training a classi\ufb01er). The codes\ncan be found by relaxing the problem and binarizing its solution [18, 30, 33], or by approximately\nsolving for the binary codes using some form of alternating optimization (possibly combined with\nGraphCut), as in two-step hashing [10, 16, 17], or by using relaxation in other ways [19, 22].\n\n1 Nonlinear embedding and af\ufb01nity-based loss functions for binary hashing\n\nThe dimensionality reduction literature has developed a number of objectives of the form (1) (often\ncalled \u201cembeddings\u201d) where the low-dimensional projection zn \u2208 Rb of each high-dimensional\ndata point xn \u2208 RD is a free, real-valued parameter. The neighborhood information is encoded in\nthe ynm values (using labels in supervised problems, or distance-based af\ufb01nities in unsupervised\nproblems). An example is the elastic embedding [4], where L(zn, zm; ynm) has the form:\n\nnm kzn \u2212 zmk2 + \u03bby\u2212\ny+\n\nnm exp (\u2212 kzn \u2212 zmk2), \u03bb > 0\n\n(2)\nwhere the \ufb01rst term tries to project true neighbors (having y+\nnm > 0) close together, while the second\nrepels all non-neighbors\u2019 projections (having y\u2212\nnm > 0) from each other. Laplacian Eigenmaps [2]\nand Locally Linear Embedding [24] result from replacing the second term above with a constraint\nthat \ufb01xes the scale of Z, which results in an eigenproblem rather than a nonlinear optimization, but\nalso produces more distorted embeddings. Other objectives exist, such as t-SNE [26], that do not\nseparate into functions of pairs of points. Optimizing nonlinear embeddings is quite challenging,\nbut much progress has been done recently [4, 6, 25, 27, 28, 31]. Although these models were\ndeveloped to produce continuous projections, they have been successfully used for binary hashing\ntoo by truncating their codes [30, 33] or using the two-step approach of [16, 17].\n\nOther loss functions have been developed speci\ufb01cally for hashing, where now zn is a b-bit vector\n(where binary values are in {\u22121, +1}). For example (see a longer list in [16]), for Supervised\nHashing with Kernels (KSH) L(zn, zm; ynm) has the form\nzm \u2212 bynm)2\n\n(3)\n\n(zT\nn\n\nb kzn \u2212 zmk2 \u2212 ynm)2 where ynm = 1\n\nwhere ynm is 1 if xn, xm are similar and \u22121 if they are dissimilar. Binary Reconstructive Embed-\n2 kxn \u2212 xmk2. The exponential variant of\ndings [14] uses ( 1\nSPLH [29] proposed by Lin et al. [16] (which we call eSPLH) uses exp(\u2212 1\nzn). Our ap-\nproach can be applied to any of these loss functions, though we will mostly focus on the KSH loss\nfor simplicity. When the variables Z are binary, we will call these optimization problems binary\nembeddings, in analogy to the more traditional continuous embeddings for dimension reduction.\n\nb ynmzT\n\nn\n\n2 Learning codes and hash functions using auxiliary coordinates\n\nThe optimization of the loss L(h) in eq. (1) is dif\ufb01cult because of the thresholded hash function,\nwhich appears as the argument of the loss function L. We use the recently proposed method of\n\n3\n\n\fauxiliary coordinates (MAC) [7, 8], which is a meta-algorithm to construct optimization algorithms\nfor nested functions. This proceeds in 3 stages. First, we introduce new variables (the \u201cauxiliary\ncoordinates\u201d) as equality constraints into the problem, with the goal of unnesting the function. We\ncan achieve this by introducing one binary vector zn \u2208 {\u22121, +1}b for each point. This transforms\nthe original, unconstrained problem into the following equivalent, constrained problem:\n\nminh,Z PN\n\nn=1 L(zn, zm; ynm) s.t. z1 = h(x1), \u00b7 \u00b7 \u00b7 , zN = h(xN ).\n\n(4)\n\nWe recognize as the objective function the \u201cembedding\u201d form of the loss function, except that the\n\u201cfree\u201d parameters zn are in fact constrained to be the deterministic outputs of the hash function h.\n\nSecond, we solve the constrained problem using a penalty method, such as the quadratic-penalty\nor augmented Lagrangian [21]. We discuss here the former for simplicity. We solve the following\nminimization problem (unconstrained again, but dependent on \u00b5) while progressively increasing \u00b5,\nso the constraints are eventually satis\ufb01ed:\n\nmin LP (h, Z; \u00b5) =\n\nN\n\nX\n\nn,m=1\n\nL(zn, zm; ynm) + \u00b5\n\nN\n\nX\n\nn=1\n\nkzn \u2212 h(xn)k2\n\ns.t.\n\nz1, . . . , zN \u2208\n{\u22121, +1}b.\n\n(5)\n\nkzn \u2212 h(xn)k2 is proportional to the Hamming distance between the binary vectors zn and h(xn).\n\nThird, we apply alternating optimization over the binary codes Z and the parameters of the hash\nfunction h. This results in iterating the following two steps (described in detail later):\n\nZ step Optimize the binary codes z1, . . . , zN given h (hence, given the output binary codes\nh(x1), . . . , h(xN ) for each of the N images). This can be seen as a regularized binary\nembedding, because the projections Z are encouraged to be close to the hash function out-\nputs h(X). Here, we try two different approaches [16, 17] with some modi\ufb01cations.\n\nh step Optimize the hash function h given the binary codes Z. This simply means training b binary\n\nclassi\ufb01ers using X as inputs and Z as labels.\n\nThis is very similar to the two-step (TSH) approach of Lin et al. [16], except that the latter learns the\ncodes Z in isolation, rather than given the current hash function, so iterating the two-step approach\nwould change nothing, and it does not optimize the loss L. More precisely, TSH corresponds to\noptimizing LP for \u00b5 \u2192 0+. In practice, we start from a very small value of \u00b5 (hence, initialize MAC\nfrom the result of TSH), and increase \u00b5 slowly while optimizing LP , until the equality constraints\nare satis\ufb01ed, i.e., zn = h(xn) for n = 1, . . . , N . The supplementary material gives the overall\nMAC algorithm to learn a hash function by optimizing an af\ufb01nity-based loss function.\n\nTheoretical results We can prove the following under the assumption that the Z and h steps are\nexact (suppl. mat.). 1) The MAC algorithm stops after a \ufb01nite number of iterations, when Z = h(X)\nin the Z step, since then the constraints are satis\ufb01ed and no more changes will occur to Z or h. 2)\nThe path over the continuous penalty parameter \u00b5 \u2208 [0, \u221e) is in fact discrete. The minimizer (h, Z)\nof LP for \u00b5 \u2208 [0, \u00b51] is identical to the minimizer for \u00b5 = 0, and the minimizer for \u00b5 \u2208 [\u00b52, \u221e)\nis identical to the minimizer for \u00b5 \u2192 \u221e, where 0 < \u00b51 < \u00b52 < \u221e. Hence, it suf\ufb01ces to take an\ninitial \u00b5 no smaller than \u00b51 and keep increasing it until the algorithm stops. Besides, the interval\n[\u00b51, \u00b52] is itself partitioned in a \ufb01nite set of intervals so that the minimizer changes only at interval\nboundaries. Hence, theoretically the algorithm needs only run for a \ufb01nite set of \u00b5 values (although\nthis set can still be very big). In practice, we increase \u00b5 more aggressively to reduce the runtime.\n\nThis is very different from the quadratic-penalty methods in continuous optimization [21], which\nwas the setting considered in the original MAC papers [7, 8]. There, the minimizer varies continu-\nously with \u00b5, which must be driven to in\ufb01nity to converge to a stationary point, and in so doing it\ngives rise to ill-conditioning and slow convergence.\n\n2.1 h step: optimization over the parameters of the hash function, given the binary codes\n\nGiven the binary codes z1, . . . , zN , since h does not appear in the \ufb01rst term of LP , this simply\ninvolves \ufb01nding a hash function h that minimizes\nn=1 kzn \u2212 h(xn)k2 = Pb\n\nn=1 (zni \u2212 hi(xn))2\n\nminh PN\n\ni=1 minhi PN\n\nwhere zni \u2208 {\u22121, +1} is the ith bit of the binary vector zn. Hence, we can \ufb01nd b one-bit hash\nfunctions in parallel and concatenate them into the b-bit hash function. Each of these is a binary\n\n4\n\n\fclassi\ufb01cation problem using the number of misclassi\ufb01ed patterns as loss. This allows us to use a\nregular classi\ufb01er for h, and even to use a simpler surrogate loss (such as the hinge loss), since this\nwill also enforce the constraints eventually (as \u00b5 increases). For example, we can \ufb01t an SVM by\noptimizing the margin plus the slack and using a high penalty for misclassi\ufb01ed patterns. We discuss\nother classi\ufb01ers in the experiments.\n\n2.2 Z step: optimization over the binary codes, given the hash function\nAlthough the MAC technique has signi\ufb01cantly simpli\ufb01ed the original problem, the step over Z is\nstill complex. This involves \ufb01nding the binary codes given the hash function h, and it is an NP-\ncomplete problem in N b binary variables. Fortunately, some recent works have proposed practical\napproaches for this problem based on alternating optimization: a quadratic surrogate method [16],\nand a GraphCut method [17]. In both methods, the starting point is to apply alternating optimization\nover the ith bit of all points given the remaining bits are \ufb01xed for all points (for i = 1, . . . , b), and\nto solve the optimization over the ith bit approximately. This would correspond to the \ufb01rst step in\nthe two-step hashing of Lin et al. [16]. These methods, in their original form, can be applied to the\nloss function over binary codes, i.e., the \ufb01rst term in LP . Here, we explain brie\ufb02y our modi\ufb01cation\nto these methods to make them work with our Z step objective (the regularized loss function over\ncodes, i.e., the complete LP ). The full explanation can be found in the supplementary material.\nSolution using a quadratic surrogate method [16] This is based on the fact that any loss function\nthat depends on the Hamming distance of two binary variables can be equivalently written as a\nquadratic function of those two binary variables. We can then write the \ufb01rst term in LP as a binary\nquadratic problem using a certain matrix A \u2208 RN \u00d7N (computed using the \ufb01xed bits), and the second\nterm (on \u00b5) is also quadratic. The optimization for the ith bit can then be equivalently written as\n\n2\n\ns.t. z\n\n(i) \u2208 {\u22121, +1}N\n\n(6)\n\nminz\n\n(i)\n\nzT\n\n(i)\n\nAz\n\n(i) + \u00b5 (cid:13)(cid:13)\n\nz\n\n(i) \u2212 hi(X)(cid:13)(cid:13)\n\nwhere hi(X) = (hi(x1), . . . , hi(xN ))T and z\n(i) are vectors of length N (one bit per data point).\nThis is still an NP-complete problem (except in special cases), and we approximate it by relaxing it\n(i) \u2208 [\u22121, 1]N , minimizing it using L-BFGS-B [34]\nto a continuous quadratic program (QP) over z\nand binarizing its solution.\nSolution using a GraphCut algorithm [17] To optimize LP over the ith bit of each image (given\nall the other bits are \ufb01xed), we have to minimize the NP-complete problem of eq. (6) over N bits.\nWe can apply the GraphCut algorithm [3], as proposed by the FastHash algorithm of Lin et al. [17].\nThis proceeds as follows. First, we assign all the data points to different, possibly overlapping groups\n(blocks). Then, we minimize the objective function over the binary codes of the same block, while\nall the other binary codes are \ufb01xed, then proceed with the next block, etc. (that is, we do alternating\noptimization of the bits over the blocks). Speci\ufb01cally, to optimize over the bits in block B, ignoring\nthe constants, we can rewrite equation (6) in the standard form for the GraphCut algorithm as:\n\nminz\n\n(i,B) Pn\u2208B Pm\u2208B vnmznizmi + Pn\u2208B unmzni\n\nwhere vnm = anm, unm = 2 Pm6\u2208B anmzmi \u2212 \u00b5hi(xn). To minimize the objective function using\nthe GraphCut algorithm, the blocks have to de\ufb01ne a submodular function. In our case, this can be\neasily achieved by putting points with the same label in one block ([17] give a simple proof of this).\n\n3 Experiments\n\nWe have tested our framework with several combinations of loss function, hash function, number\nof bits, datasets, and comparing with several state-of-the-art hashing methods (see suppl. mat.). We\nreport a representative subset to show the \ufb02exibility of the approach. We use the KSH (3) [19] and\neSPLH [29] loss functions. We test quadratic surrogate and GraphCut methods for the Z step in\nMAC. As hash functions (for each bit), we use linear SVMs (trained with LIBLINEAR; [9]) and\nkernel SVMs (with 500 basis functions).\n\nWe use the following labeled datasets: (1) CIFAR [13] contains 60 000 images in 10 classes. We use\nD = 320 GIST features [23] from each image. We use 58 000 images for training and 2 000 for test.\n(2) In\ufb01nite MNIST [20]. We generated, using elastic deformations of the original MNIST hand-\nwritten digit dataset, 1 000 000 images for training and 2 000 for test, in 10 classes. We represent\neach image by a D = 784 vector of raw pixels. Because of the computational cost of af\ufb01nity-based\nmethods, previous work has used training sets limited to a few thousand points [14, 16, 19, 22].\nWe train the hash functions in a subset of 10 000 points of the training set, and report precision and\nrecall by searching for a test query on the entire dataset (the base set).\n\n5\n\n\f5.8\n\nL\nn\no\ni\nt\nc\nn\nu\nf\n\n5.6\n\n5.4\n\ns\ns\no\nl\n\n5.2\n\n \n\n2\n\nker\u2212MACcut\nlin\u2212MACcut\nker\u2212MACquad\nlin\u2212MACquad\nker\u2212cut\nlin\u2212cut\nker\u2212quad\nlin\u2212quad\nker\u2212KSH\n\n5.8x 106\n\n \n\n5.7\n\n5.6\n\n5.5\n\n5.4\n\n48\n\n45\n\n40\n\n35\n\nn\no\ni\ns\ni\nc\ne\nr\np\n\n49\n\n45\n\n40\n\n10\n\n12\n\n30\n600\n\n700\n\n800\nk\n\n900\n\n1000\n\n35\n600\n\n700\n\n900\n\n1000\n\n800\nk\n\nKSH loss\n\nx 106\n\neSPLH loss\n\nKSH precision\n\neSPLH precision\n\n8\n\n6\n\n4\n10\niterations\n\n12\n\n14\n\n2\n\n6\n\n4\niterations\n\n8\n\nFigure 1: Loss function L and precision for k retrieved points for KSH and eSPLH loss functions\non CIFAR dataset, using b = 48 bits.\n\nWe report precision (and precision/recall in the suppl. mat.) for the test set queries using as ground\ntruth (set of true neighbors in original space) all the training points with the same label. The retrieved\nset contains the k nearest neighbors of the query point in the Hamming space. We report precision\nfor different values of k to test the robustness of different algorithms.\n\nThe main comparison point are the quadratic surrogate and GraphCut methods of Lin et al. [16, 17],\nwhich we denote in this section as quad and cut, respectively, regardless of the hash function that\n\ufb01ts the resulting codes. Correspondingly, we denote the MAC version of these as MACquad and\nMACcut, respectively. We use the following schedule for the penalty parameter \u00b5 in the MAC\nalgorithm (regardless of the hash function type or dataset). We initialize Z with \u00b5 = 0, i.e., the\nresult of quad or cut. Starting from \u00b51 = 0.3 (MACcut) or 0.01 (MACquad), we multiply \u00b5 by 1.4\nafter each iteration (Z and h step).\n\nOur experiments show our MAC algorithm indeed \ufb01nds hash functions with a signi\ufb01cantly and con-\nsistently lower objective value than rounding or two-step approaches (in particular, cut and quad);\nand that it outperforms other state-of-the-art algorithms on different datasets, with MACcut beating\nMACquad most of the time. The improvement in precision makes using MAC well worth the rela-\ntively small extra runtime and minimal additional implementation effort it requires. In all our plots,\nthe vertical arrows indicate the improvement of MACcut over cut and of MACquad over quad.\n\nThe MAC algorithm \ufb01nds better optima The goal of this paper is not to introduce a new af\ufb01nity-\nbased loss or hash function, but to describe a generic framework to construct algorithms that opti-\nmize a given combination thereof. We illustrate its effectiveness here with the CIFAR dataset, with\ndifferent sizes of retrieved neighbor sets, and using 16 to 48 bits. We optimize two loss functions\n(KSH from eq. (3) and eSPLH), and two hash functions (linear and kernel SVM). In all cases, the\nMAC algorithm achieves a better hash function both in terms of the loss and of the precision/recall.\nWe compare 4 ways of optimizing the loss function: quad [16], cut [17], MACquad and MACcut.\n\nFor each point xn in the training set, we use \u03ba+ = 100 positive and \u03ba\u2212 = 500 negative neighbors,\nchosen at random to have the same or a different label as xn, respectively. Fig. 1 (panels 1 and 3)\nshows the KSH loss function for all the methods (including the original KSH method in [19]) over\niterations of the MAC algorithm (KSH, quad and cut do not iterate), as well as precision and recall.\nIt is clear that MACcut (red lines) and MACquad (magenta lines) reduce the loss function more than\ncut (blue lines) and quad (black lines), respectively, as well as the original KSH algorithm (cyan), in\nall cases: type of hash function (linear: dashed lines, kernel: solid lines) and number of bits b = 16\nto 48 (suppl. mat.). Hence, applying MAC is always bene\ufb01cial. Reducing the loss nearly always\ntranslates into better precision and recall (with a larger gain for linear than for kernel hash functions,\nusually). The gain of MACcut/MACquad over cut/quad is signi\ufb01cant, often comparable to the gain\nobtained by changing from the linear to the kernel hash function within the same algorithm.\n\nWe usually \ufb01nd cut outperforms quad (in agreement with [17]), and correspondingly MACcut out-\nperforms MACquad. Interestingly, MACquad and MACcut end up being very similar even though\nthey started very differently. This suggests it is not crucial which of the two methods to use in\nthe MAC Z step, although we still prefer cut, because it usually produces somewhat better optima.\nFinally, \ufb01g. 1 (panels 2 and 4) also shows the MACcut results using the eSPLH loss. All settings\nare as in the \ufb01rst KSH experiment. As before, MACcut outperforms cut in both loss function and\nprecision/recall using either a linear or a kernel SVM.\n\nWhy does MAC learn better hash functions? In both the two-step and MAC approaches, the\nstarting point are the \u201cfree\u201d binary codes obtained by minimizing the loss over the codes without\n\n6\n\n\fKSH loss\n\nx 106\n\neSPLH loss\n\n5.6x 106\n\n \n\nL\nn\no\ni\nt\nc\nn\nu\nf\n\ns\ns\no\nl\n\n5.5\n\n5\n\n4.5\n\n4\n\n \n\n16\n\nker\u2212MACcut\nlin\u2212MACcut\nker\u2212cut\nlin\u2212cut\nfree codes\n\n5.4\n\n5.2\n\n5\n\n4.8\n\n32\n\nnumber of bits b\n\n4.6\n\n16\n\n48\n\n32\n\nnumber of bits b\n\n48\n\n{\u22121, +1}b\u00d7N\n\ncodes from optimal\n\nhash function\n\ncodes realizable\nby hash functions\n\nfree binary\n\ncodes\n\ntwo-step codes\n\nFigure 2: Panels 1\u20132: like \ufb01g. 1 but showing the value of the error function E(Z) of eq. (7) for the\n\u201cfree\u201d binary codes, and for the codes produced by the hash functions learned by cut (the two-step\nmethod) and MACcut, with linear and kernel hash functions. Panel 3: illustration of free codes,\ntwo-step codes and optimal codes realizable by a hash function, in the space {\u22121, +1}b\u00d7N.\n\nthem being the output of a particular hash function. That is, minimizing (4) without the \u201czn =\nh(xn)\u201d constraints:\n\nminZ E(Z) = PN\n\nn=1 L(zn, zm; ynm),\n\nz1, . . . , zN \u2208 {\u22121, +1}b.\n\n(7)\n\nThe resulting free codes try to achieve good precision/recall independently of whether a hash func-\ntion can actually produce such codes. Constraining the codes to be realizable by a speci\ufb01c family of\nhash functions (say, linear), means the loss E(Z) will be larger than for free codes. How dif\ufb01cult is\nit for a hash function to produce the free codes? Fig. 2 (panels 1\u20132) plots the loss function for the\nfree codes, the two-step codes from cut, and the codes from MACcut, for both linear and kernel hash\nfunctions in the same experiment as in \ufb01g. 1. It is clear that the free codes have a very low loss E(Z),\nwhich is far from what a kernel function can produce, and even farther from what a linear function\ncan produce. Both of these are relatively smooth functions that cannot represent the presumably\ncomplex structure of the free codes. This could be improved by using a very \ufb02exible hash function\n(e.g. using a kernel function with many centers), which could better approximate the free codes, but\n1) a very \ufb02exible function would likely not generalize well, and 2) we require fast hash functions for\nfast retrieval anyway. Given our linear or kernel hash functions, what the two-step cut optimization\ndoes is \ufb01t the hash function directly to the free codes. This is not guaranteed to \ufb01nd the best hash\nfunction in terms of the original problem (1), and indeed it produces a pretty suboptimal function. In\ncontrast, MAC gradually optimizes both the codes and the hash function so they eventually match,\nand \ufb01nds a better hash function for the original problem (although it is still not guaranteed to \ufb01nd\nthe globally optimal function of problem (1), which is NP-complete).\n\nFig. 2 (right) shows this conceptually. It shows the space of all possible binary codes, the contours of\nE(Z) (green) and the set of codes that can be produced by (say) linear hash functions h (gray), which\nis the feasible set {Z \u2208 {\u22121, +1}b\u00d7N: Z = h(X) for linear h}. The two-step codes \u201cproject\u201d the\nfree codes onto the feasible set, but these are not the codes for the optimal hash function h.\n\nRuntime The runtime per iteration for our 10 000-point training sets with b = 48 bits and \u03ba+ =\n100 and \u03ba\u2212 = 500 neighbors in a laptop is 2\u2019 for both MACcut and MACquad. They stop after 10\u2013\n20 iterations. Each iteration is comparable to a single cut or quad run, since the Z step dominates\nthe computation. The iterations after the \ufb01rst one are faster because they are warm-started.\n\nComparison with binary hashing methods Fig. 3 shows results on CIFAR and In\ufb01nite MNIST.\nWe create af\ufb01nities ynm for all methods using the dataset labels as before, with \u03ba+ = 100 similar\nneighbors and \u03ba\u2212 = 500 dissimilar neighbors. We compare MACquad and MACcut with Two-Step\nHashing (quad) [16], FastHash (cut) [17], Hashing with Kernels (KSH) [19], Iterative Quantization\n(ITQ) [11], Binary Reconstructive Embeddings (BRE) [14] and Self-Taught Hashing (STH) [33].\nMACquad, MACcut, quad and cut all use the KSH loss function (3). The results show that MACcut\n(and MACquad) generally outperform all other methods, often by a large margin, in nearly all situ-\nations (dataset, number of bits, size of retrieved set). In particular, MACcut and MACquad are the\nonly ones to beat ITQ, as long as one uses suf\ufb01ciently many bits.\n\n4 Discussion\n\nThe two-step approach of Two-Step Hashing [16] and FastHash [17] is a signi\ufb01cant advance in\n\ufb01nding good codes for binary hashing, but it also causes a maladjustment between the codes and the\n\n7\n\n\fb = 16 . . . . . . CIFAR . . . . . . b = 64\n\nb = 16 . . . Inf. MNIST . . . b = 64\n\n40\n\nn\no\ni\ns\ni\nc\ne\nr\np\n\n36\n\n32\n\n28\n\n \n\n24\n500\n\n600\n\n700\n\n800\n\nk\n\n \n\nMACcut\nMACquad\ncut\nquad\nKSH\nITQ\nBRE\nSTH\n900\n\n1000\n\n40\n\n36\n\n32\n\n28\n\n24\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nk\n\n77\n\n74\n\n71\n\n68\n\n65\n\n \n\n77\n\n74\n\n71\n\n68\n\n65\n\nMACcut\nMACquad\ncut\nquad\nKSH\nITQ\nBRE\nSTH\n\n \n\n62\n5000 6000 7000 8000 9000 10000\n\nk\n\n62\n5000 6000 7000 8000 9000 10000\n\nk\n\nFigure 3: Comparison with binary hashing methods on CIFAR (left) and In\ufb01nite MNIST (right),\nusing a linear hash function, using b = 16 to 64 bits (suppl. mat.). Each plot shows the precision for\nk retrieved points, for a range of k.\n\nhash function, since the codes were learned without knowledge of what kind of hash function would\nuse them. Ignoring the interaction between the loss and the hash function limits the quality of the\nresults. For example, a linear hash function will have a harder time than a nonlinear one at learning\nsuch codes. In our algorithm, this tradeoff is enforced gradually (as \u00b5 increases) in the Z step as a\nregularization term (eq. (5)): it \ufb01nds the best codes according to the loss function, but makes sure\nthey are close to being realizable by the current hash function. Our experiments demonstrate that\nsigni\ufb01cant, consistent gains are achieved in both the loss function value and the precision/recall in\nimage retrieval over the two-step approach. Note that the objective (5) is not an ad-hoc combination\nof a loss over the hash function and a loss over the codes; it follows by applying MAC to the well-\nde\ufb01ned top-level problem (1), and it solves it in the limit of large \u00b5 (up to local optima).\n\nWhat is the best type of hash function to use? The answer to this is not unique, as it depends on\napplication-speci\ufb01c factors: quality of the codes produced (to retrieve the correct images), time to\ncompute the codes on high-dimensional data (since, after all, the reason to use binary hashing is\nto speed up retrieval), ease of implementation within a given hardware architecture and software\nlibraries, etc. Our MAC framework facilitates this choice considerably, because training different\ntypes of hash functions simply involves reusing an existing classi\ufb01cation algorithm within the h\nstep, with no changes to the Z step.\n\n5 Conclusion\n\nWe have proposed a general framework for optimizing binary hashing using af\ufb01nity-based loss func-\ntions. It improves over previous, two-step approaches based on learning binary codes \ufb01rst and then\nlearning the hash function. Instead, it optimizes jointly over the binary codes and the hash func-\ntion in alternation, so that the binary codes eventually match the hash function, resulting in a better\nlocal optimum of the af\ufb01nity-based loss. This was possible by introducing auxiliary variables that\nconditionally decouple the codes from the hash function, and gradually enforcing the corresponding\nconstraints. Our framework makes it easy to design an optimization algorithm for a new choice of\nloss function or hash function: one simply reuses existing software that optimizes each in isolation.\nThe resulting algorithm is not much slower than the two-step approach\u2014it is comparable to iterating\nthe latter a few times\u2014and well worth the improvement in precision/recall.\n\nThe step over the hash function is essentially a solved problem if using a classi\ufb01er, since this can\nbe learned in an accurate and scalable way using machine learning techniques. The most dif\ufb01cult\nand time-consuming part in our approach is the optimization over the binary codes, which is NP-\ncomplete and involves many binary variables and terms in the objective. Although some techniques\nexist [16, 17] that produce practical results, designing algorithms that reliably \ufb01nd good local optima\nand scale to large training sets is an important topic of future research.\n\nAnother direction for future work involves learning more sophisticated hash functions that go be-\nyond mapping image features onto output binary codes using simple classi\ufb01ers such as SVMs. This\nis possible because the optimization over the hash function parameters is con\ufb01ned to the h step and\ntakes the form of a supervised classi\ufb01cation problem, so we can apply an array of techniques from\nmachine learning and computer vision. For example, it may be possible to learn image features that\nwork better with hashing than standard features such as SIFT, or to learn transformations of the input\nto which the binary codes should be invariant, such as translation, rotation or alignment.\n\nAcknowledgments Work supported by NSF award IIS\u20131423515.\n\n8\n\n\fReferences\n\n[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high\n\ndimensions. Comm. ACM, 51(1):117\u2013122, Jan. 2008.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15(6):1373\u20131396, June 2003.\n\n[3] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision. IEEE PAMI, 26(9):1124\u20131137, Sept. 2004.\n\n[4] M. Carreira-Perpi\u02dcn\u00b4an. The elastic embedding algorithm for dimensionality reduction. ICML, 2010.\n\n[5] M. Carreira-Perpi\u02dcn\u00b4an and R. Raziperchikolaei. Hashing with binary autoencoders. CVPR, 2015.\n\n[6] M. Carreira-Perpi\u02dcn\u00b4an and M. Vladymyrov. A fast, universal algorithm to learn parametric nonlinear\n\nembeddings. NIPS, 2015.\n\n[7] M. Carreira-Perpi\u02dcn\u00b4an and W. Wang. Distributed optimization of deeply nested systems. arXiv:1212.5921\n\n[cs.LG], Dec. 24 2012.\n\n[8] M. Carreira-Perpi\u02dcn\u00b4an and W. Wang. Distributed optimization of deeply nested systems. AISTATS, 2014.\n\n[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, Aug. 2008.\n\n[10] T. Ge, K. He, and J. Sun. Graph cuts for supervised binary coding. ECCV, 2014.\n\n[11] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A Procrustean approach to\n\nlearning binary codes for large-scale image retrieval. IEEE PAMI, 35(12):2916\u20132929, Dec. 2013.\n\n[12] K. Grauman and R. Fergus. Learning binary hash codes for large-scale image search.\n\nIn R. Cipolla,\nS. Battiato, and G. Farinella, editors, Machine Learning for Computer Vision, pages 49\u201387. Springer-\nVerlag, 2013.\n\n[13] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, U. Toronto, 2009.\n\n[14] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n\n[15] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE PAMI, 34(6):1092\u20131104, 2012.\n\n[16] G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general two-step approach to learning-based hashing.\n\nICCV, 2013.\n\n[17] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for\n\nhigh-dimensional data. CVPR, 2014.\n\n[18] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. ICML, 2011.\n\n[19] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. CVPR, 2012.\n\n[20] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines using selective sampling.\nIn L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, pages\n301\u2013320. MIT Press, 2007.\n\n[21] J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, second edition, 2006.\n\n[22] M. Norouzi and D. Fleet. Minimal loss hashing for compact binary codes. ICML, 2011.\n\n[23] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. Int. J. Computer Vision, 42(3):145\u2013175, May 2001.\n\n[24] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500):2323\u20132326, Dec. 22 2000.\n\n[25] L. J. P. van der Maaten. Barnes-Hut-SNE. In Int. Conf. Learning Representations (ICLR 2013), 2013.\n\n[26] L. J. P. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. JMLR, 9:2579\u20132605, Nov. 2008.\n\n[27] M. Vladymyrov and M. Carreira-Perpi\u02dcn\u00b4an. Partial-Hessian strategies for fast learning of nonlinear em-\n\nbeddings. ICML, 2012.\n\n[28] M. Vladymyrov and M. Carreira-Perpi\u02dcn\u00b4an. Linear-time training of nonlinear low-dimensional embed-\n\ndings. AISTATS, 2014.\n\n[29] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large scale search. IEEE PAMI, 2012.\n\n[30] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS, 2009.\n\n[31] Z. Yang, J. Peltonen, and S. Kaski. Scalable optimization for neighbor embedding for visualization.\n\nICML, 2013.\n\n[32] S. X. Yu and J. Shi. Multiclass spectral clustering. ICCV, 2003.\n\n[33] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. SIGIR, 2010.\n\n[34] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: FORTRAN subroutines for\nlarge-scale bound-constrained optimization. ACM Trans. Mathematical Software, 23(4):550\u2013560, 1997.\n\n9\n\n\f", "award": [], "sourceid": 347, "authors": [{"given_name": "Ramin", "family_name": "Raziperchikolaei", "institution": "UC Merced"}, {"given_name": "Miguel", "family_name": "Carreira-Perpinan", "institution": "UC Merced"}]}