{"title": "Learning a Metric Embedding for Face Recognition using the Multibatch Method", "book": "Advances in Neural Information Processing Systems", "page_first": 1388, "page_last": 1389, "abstract": "This work is motivated by the engineering task of achieving a near state-of-the-art face recognition on a minimal computing budget running on an embedded system. Our main technical contribution centers around a novel training method, called Multibatch, for similarity learning, i.e., for the task of generating an invariant ``face signature'' through training pairs of ``same'' and ``not-same'' face images. The Multibatch method first generates signatures for a mini-batch of $k$ face images and then constructs an unbiased estimate of the full gradient by relying on all $k^2-k$ pairs from the mini-batch. We prove that the variance of the Multibatch estimator is bounded by $O(1/k^2)$, under some mild conditions. In contrast, the standard gradient estimator that relies on random $k/2$ pairs has a variance of order $1/k$. The smaller variance of the Multibatch estimator significantly speeds up the convergence rate of stochastic gradient descent. Using the Multibatch method we train a deep convolutional neural network that achieves an accuracy of $98.2\\%$ on the LFW benchmark, while its prediction runtime takes only $30$msec on a single ARM Cortex A9 core. Furthermore, the entire training process took only 12 hours on a single Titan X GPU.", "full_text": "A Appendix: Proof of Theorem 1\n\nWe \ufb01rst show that the estimate is unbiased.\nE\u21e1 `\u21e1(i),\u21e1(j)(z). Therefore,\n\nL(z) =\n\n1\n\nk2 k Xi6=j2[k]\n\nL(z) =\n\nIndeed, for every i 6= j we can rewrite L(z) as\nk2 k Xi6=j2[k]\n\n`\u21e1(i),\u21e1(j)(z) = E\u21e1\n\nL\u21e1(z) ,\n\nE\u21e1\n\n1\n\nwhich proves that the multibatch estimate is unbiased.\nNext, we turn to analyze the variance of the multibatch estimate. let I \u21e2 [k]4 be all the indices\ni, j, s, t s.t. i 6= j, s 6= t, and we partition I to I1 [ I2 [ I3, where I1 is the set where i = s and j = t,\nI2 is when all indices are different, and I3 is when i = s and j 6= t or i 6= s and j = t. Then:\nE\u21e1 krL\u21e1(z) rL(z)k2 =\ndXr=1\n\n(rr`\u21e1(i),\u21e1(j)(z) rrL(z)) (rr`\u21e1(s),\u21e1(t)(z) rrL(z))\n\n(k2 k)2 E\u21e1 X(i,j,s,t)2I\n\n3Xq=1 X(i,j,s,t)2Iq\n\n(r`\u21e1(i),\u21e1(j)(z) rL(z)) \u00b7 (r`\u21e1(s),\u21e1(t)(z) rL(z))\n\n(k2 k)2\n\nE\u21e1\n\n=\n\n1\n\n1\n\nFor every r, denote by A(r) the matrix with A(r)\nEi6=j A(r)\n\ni,j = 0, and that\n\ni,j = rr`i,j(z) rrL(z). Observe that for every r,\n\nXr\n\nE\ni6=j\n\n(A(r)\n\ni,j )2 = E\n\ni6=j kr`i,j(z) rL(z)k2.\n\nTherefore,\n\nE\u21e1 krL\u21e1(z) rL(z)k2 =\n\ndXr=1\n\n1\n\n(k2 k)2\n\n3Xq=1 X(i,j,s,t)2Iq\n\nE\u21e1\n\nA(r)\n\u21e1(i),\u21e1(j)A(r)\n\n\u21e1(s),\u21e1(t)\n\nLet us momentarily \ufb01x r and omit the superscript from A(r). We consider the value of\nE\u21e1 A\u21e1(i),\u21e1(j)A\u21e1(s),\u21e1(t) according to the value of q.\n\n\u2022 For q = 1: we obtain E\u21e1 A2\n\n\u21e1(i),\u21e1(j) which is the variance of the random variable rr`i,j(z)\n\nrrL(z).\n\u2022 For q = 2: When we \ufb01x i, j, s, t which are all different, and take expectation over \u21e1,\nthen all products of off-diagonal elements of A appear the same number of times in\nE\u21e1 A\u21e1(i),\u21e1(j)A\u21e1(s),\u21e1(t). Therefore, this quantity is proportional toPp6=r vpvr, where v\nis the vector of all non-diagonal entries of A. SincePp vp = 0, we obtain (using Lemma 1)\nthatPp6=r vpvr \uf8ff 0, which means that the entire sum for this case is non-positive.\n\u2022 For q = 3: Let us consider the case when i = s and j 6= t, and the derivation for the case\nwhen i 6= s and j = t is analogous. The expression we obtain is E\u21e1 A\u21e1(i),\u21e1(j)A\u21e1(i),\u21e1(t).\nThis is like \ufb01rst sampling a row and then sampling, without replacement, two indices from\nthe row (while not allowing to take the diagonal element). So, we can rewrite the expression\nas:\n\nA\u21e1(i),\u21e1(j)A\u21e1(s),\u21e1(t) = E\n\nE\u21e1\n\nE\n\nAi,jAi,t\n\nj,t2[m]\\{i}:j6=t\n\ni\u21e0[m]\ni\u21e0[m]\u2713 E\n\uf8ff E\n\nj6=i\n\nAi,j\u25c62\n\n( \u00afAi)2 ,\n\n= E\n\ni\u21e0[m]\n\n(5)\n\nwhere we denote \u00afAi = Ej6=i Ai,j and in the inequality we used again Lemma 1.\n\nFinally, the bound on the variance follows by observing that the number of summands in I1 is k2 k\nand the number of summands in I3 is O(k3). This concludes our proof.\n\n10\n\n\f[vi])2\n\n[vsvt] \uf8ff (E\n\ni\n\nProof For simplicity, we use E[v] for Ei[vi] and E[v2] for Ei[v2\n\nLemma 1 Let v 2 Rn be any vector. Then,\nE\ns6=t\nIn particular, if Ei[vi] = 0 thenPs6=t vsvt \uf8ff 0.\nnXs=1\nnXt=1\nnXs=1\nnXt=1\nvt \nn2 n\nn2 n\nn\nn2 n E[v]2 \nn2 n E[v2]\n=\nn2 n\n\uf8ff 0 + E[v]2\n\n(E[v]2 E[v2]) +\n\nn2 n\n\nn2 n\n\nvsvt \n\nn2\n\nn\n\nvsvt =\n\nE\ns6=t\n\n=\n\n=\n\ni ]. Then:\n\n1\n\n1\n\nv2\ns\n\nv2\ns\n\nnXs=1\nnXs=1\n\nn2 n\nn2 n E[v]2\n\n1\n\n1\n\nvs\n\n11\n\n\f", "award": [], "sourceid": 765, "authors": [{"given_name": "Oren", "family_name": "Tadmor", "institution": "OrCam"}, {"given_name": "Tal", "family_name": "Rosenwein", "institution": "Orcam"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "OrCam"}, {"given_name": "Yonatan", "family_name": "Wexler", "institution": "OrCam"}, {"given_name": "Amnon", "family_name": "Shashua", "institution": "OrCam"}]}