0, which depends on \u03b1, such\n\n,\n\nthat\n\n1/2 \u00b7 vec(\u2206)(cid:62)\u22072L(\u0398) vec(\u2206) \u2265 1/2 \u00b7 E(cid:104)\n\nm\u03b1\n\n(cid:0)(cid:104)M, \u0398\u2217(cid:105),(cid:104)M(cid:48), \u0398\u2217(cid:105)(cid:1) \u00b7 (cid:104)M \u2212 M(cid:48), \u2206(cid:105)2(cid:105)\n\n(4.4)\nwhere \u22072 denotes taking the second-order derivative with respect to vec(\u0398) \u2208 Rd1d2, and the\nexpectation is taken with respect to the joint distribution of M and M(cid:48).\n\n\u2265 \u03b2/(d1d2) \u00b7 (cid:107)\u2206(cid:107)2\nF ,\n\nProof. See \u00a7B.7 for a detailed proof.\n\nAs shown in Proposition 4.2, the \ufb01rst term in (4.4) is the quadratic form associated with the population\nHessian matrix evaluated at \u0398. Such a quadratic form is universally lower bounded by the second\nterm, which does not depend on \u0398, and is further lower bounded by \u03b2/(d1d2)\u00b7(cid:107)\u2206(cid:107)2\nF . In other words,\nunder the conditions of Proposition 4.2, the population loss function de\ufb01ned in (4.3) is strongly\nconvex.\nRecall that {yi}i\u2208[n] consists of n independent observed responses of the pairwise measurements.\nThe following regularity condition is on the boundedness of yi (i \u2208 [n]).\nAssumption 4.3 (Boundedness). There exists \u03bd > 0 such that maxi\u2208[n] |yi| \u2264 \u03bd.\n\nThe regularity condition on boundedness is used to simplify our subsequent discussion. Even when\nthe response of pairwise measurement has unbounded support, Assumption 4.3 holds with high\nprobability for a properly chosen \u03bd, which possibly depends on the sample size n. For example, when\nY is a sub-Gaussian random variable, by union bound it suf\ufb01ces to set \u03bd = C\nlog n for a suf\ufb01ciently\nlarge absolute constant C > 0. Furthermore, we remark that we this assumption is required only for\nestablishing concentration results. However, our contrastive estimator is still valid even when this\nassumption is violated. Moreover, in this case, we can use a truncation argument to establish a similar\nrate of convergence.\n\nWe are now ready to present the main theorem. Recall that(cid:98)\u0398 is the contrastive estimator de\ufb01ned in\nbetween(cid:98)\u0398 and \u0398\u2217 in Frobenius norm.\n\n(3.2) and \u0398\u2217 is the true score matrix with rank r. The following theorem upper bounds the difference\n\n\u221a\n\nTheorem 4.4. We assume that Assumptions 4.1 and 4.3 hold and n \u2264 C/(\u03b14\u03b22) \u00b7 (d1d2)2/3 \u00b7\nlog(d1 + d2). Let the regularization parameter in (3.2) be\n\n\u03bb = C(cid:48)\u03bd \u00b7(cid:0)1(cid:14)(cid:112)n \u00b7 min{d1, d2} + 1/n(cid:1) \u00b7 log(d1 + d2).\n\u2264 C(cid:48)(cid:48) max(cid:8)1/(\u03b1\u03b2), \u03bd/\u03b2(cid:9) \u00b7\n\nThen the following inequality holds with probability at least 1 \u2212 C/(d1 + d2),\n1\u221a\nd1d2\n\n\u00b7(cid:13)(cid:13)(cid:98)\u0398 \u2212 \u0398\u2217(cid:13)(cid:13)F\n\nr max{d1, d2}\n\n(cid:18)(cid:114)\n\n(cid:19)\n\n\u00b7 log(d1 + d2).\n(4.5)\n\n\u221a\n\n+\n\nrd1d2\nn\n\nn\n\nHere \u03b2 is the constant speci\ufb01ed in Proposition 4.2.\n\nProof Sketch. See \u00a7A of the appendix for a detailed proof. Our proof consists of two key steps. First,\nby combining concentration inequalities with a peeling argument, we establish the strong convexity\nof the contrastive loss function de\ufb01ned in (3.4) within a restricted subset of the parameter space. The\nmajor challenge is that the contrastive loss function is a sample average over the virtual dataset V\nde\ufb01ned in (3.1), in which the data points are dependent. Such dependency prohibits us from applying\n\n7\n\n\fstandard concentration inequalities. To overcome this dif\ufb01culty, we decouple the dependent sample\naverage into a hierarchical average that involves permutations of data points. This allows us to\neliminate the dependency and establish concentration inequalities via moment generating functions.\nSee \u00a7B.1 for the proof of the strong convexity and see \u00a7B.3 for the details of the decoupling method.\nSecond, we upper bound the gradient of the contrastive loss function at the true parameter in spectral\nnorm. The major challenge is that the aforementioned decoupling of dependent sample average\nprohibits us from applying the standard matrix Bernstein inequality. Instead, we resort to Taylor\u2019s\nexpansion of the moment generating function and apply the matrix Bernstein inequality to each order\nof moment. See \u00a7B.2 for more details. Finally, by combing the two steps with a geometric analysis,\nwe establish the rate of convergence for the contrastive estimator.\n\n\u221a\n\n\u221a\n\nTheorem 4.4 establishes the statistical rate of convergence of the contrastive estimator (cid:98)\u0398 de\ufb01ned\nin (3.2). As speci\ufb01ed by the parameter space de\ufb01ned in (3.3), roughly speaking, (cid:107)\u0398\u2217(cid:107)F scales\nthe regularization parameter \u03bb, the rescaled estimation error is of the order(cid:112)r max{d1, d2}/n +\nd1d2. Hence, on the left-hand side of (4.5), we scale the estimation error in Frobenius\nwith\nnorm with 1/\nd1d2 for proper normalization. Our result shows that, with a suitable choice of\nthe error is(cid:112)r max{d1, d2}/n. Note that, since the \u0398\u2217 \u2208 Rd1\u00d7d2 is rank-r, the total number of\n\u221a\nrd1d2/n up to a logarithmic factor. In particular, for n (cid:29) min{d1, d2}, the leading term of\nunknown parameters in \u0398\u2217 if of the order r max{d1, d2}. Hence, our result in (4.5) achieves the\noptimal rate of convergence for estimating \u0398\u2217 up to a logarithmic factor.\nIt is worth mentioning that the upper bound of the sample size n required by Theorem 4.4 may be not\nnecessary. It arises due to the peeling argument in the proof, which also appears in previous literature\n(Lu and Negahban, 2015; Negahban et al., 2017), but can be eliminated by a more delicate analysis.\nHowever, for the simplicity of discussion, we do not pursue this direction in this paper.\n\n5 Numerical Experiments\n\nWe lay out the simulation results in this section and demonstrate the accuracy of the contrastive\n\nestimator(cid:98)\u0398 stated in Theorem 4.4. Throughout our numerical experiments, for the sake of simplicity,\n\nwe always use an equal number of users and items, that is, we set d1 = d2 = d. We also use Bernoulli\ndistribution as the distribution of the pairwise measurements, which is speci\ufb01ed in (2.1). For a \ufb01xed d\nand r, in order to generate an underlying score matrix \u0398\u2217 \u2208 Rd\u00d7d of rank at most r, we \ufb01rst generate\ntwo matrices U \u2208 Rd\u00d7r and V \u2208 Rr\u00d7d, whose entries are independently and identically distributed\nstandard normal random variables. Note that the sum of each row in the underly score matrix needs\nto be zero for the sake of identi\ufb01ability. Therefore, for each row of V , we subtract the sample mean\nl=1 Vil. Finally, we set\n\nof that row from each entry, that is, we compute (cid:101)V as (cid:101)Vij = Vij \u2212 1/d \u00b7(cid:80)d\n\u0398\u2217 = U(cid:101)V . Then, we generate n pairwise measurements following the mechanism speci\ufb01ed in \u00a72.\n\nnd \u00b7 log(2d), as suggested in Theorem 4.4.\n\n1/d \u00b7 (cid:107)(cid:98)\u0398 \u2212 \u0398\u2217(cid:107)F scales of order 1/\n\nNote that the optimization problem in (3.2) is a convex optimization with nuclear norm regularization,\nfor which we solve by using the proximal gradient method (Cai et al., 2010). For the regularization\nparameter \u03bb, we set \u03bb in (3.2) as 0.5 \u00b7 1/\nRecall that Theorem 4.4 implies that with high probability, the dimensional-rescaled estimation error\nn when r and d are \ufb01xed, and increases with r as well as d. To\nsupport the conclusion of Theorem 4.4, we perform two sets of experiments, whose results are plotted\nin Figure 1. In the \ufb01rst set, the rank is \ufb01xed at r = 2 and different dimensions d \u2208 {5, 10, 15, 20} are\ntested. In the second set, the dimension is \ufb01xed at d = 20, and different ranks r \u2208 {2, 4, 6, 8} are\ntested. In both cases, we run simulations with sample size n \u2208 {200, 400, . . . , 3000}. In Figure 1.\nn, for different choices of\n(r, d). As shown in the \ufb01gures, for \ufb01xed choices (r, d), the points approximately lies on a straight line\n\u221a\nthat goes through the origin, implying that the rescaled estimation errors is approximately proportional\nto 1/\n\n(a)-(b), we plot the rescaled estimation error 1/d \u00b7 (cid:107)(cid:98)\u0398 \u2212 \u0398\u2217(cid:107)F against 1/\n\nn. Moreover, its slope increases as r or d increases, which is consistent with Theorem 4.4.\n\n\u221a\n\n\u221a\n\n\u221a\n\n8\n\n\fFigure 1: Plot of rescaled estimation error 1/d \u00b7 (cid:107)(cid:98)\u0398 \u2212 \u0398\u2217(cid:107)F against 1/\n\nn, where n is the sample\nsize. In (a), the rank r is \ufb01xed at r = 2, and the dimension d is taken as d = 5, 10, 15, 20. In (b), the\ndimension d is \ufb01xed at d = 20, and the rank is taken as r = 2, 4, 6, 8.\n\n(a)\n\n(b)\n\u221a\n\n6 Conclusion\n\nIn this paper, we consider the problem of learning from pairwise measurements where the distribution\nof measurements follows from a natural exponential family distribution, where the base measure\nis assumed unknown. For such a semiparaemtric model, we use a data augmentation technique\nto construct a contrastive estimator that is consistent and achieves nearly the optimal statistical\nrate of convergence. Moreover, our estimating procedure is agnostic to the base measure of the\nexponential family distribution, thus achieving the invariance to model misspeci\ufb01cation within the\nnatural exponential family.\n\nReferences\nBOUCHERON, S., LUGOSI, G. and MASSART, P. (2013). Concentration inequalities: A nonasymp-\n\ntotic theory of independence. Oxford University Press.\n\nBRADLEY, R. A. and TERRY, M. E. (1952). Rank analysis of incomplete block designs: I. the\n\nmethod of paired comparisons. Biometrika 39 324\u2013345.\n\nCAI, J.-F., CAND\u00c8S, E. J. and SHEN, Z. (2010). A singular value thresholding algorithm for matrix\n\ncompletion. SIAM Journal on Optimization 20 1956\u20131982.\n\nCAND\u00c8S, E. J. and RECHT, B. (2009). Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics 9 717.\n\nCAO, Y. and XIE, Y. (2015). Poisson matrix completion. In Information Theory (ISIT), 2015 IEEE\n\nInternational Symposium on. IEEE.\n\nCHAN, K. C. G. (2012). Nuisance parameter elimination for proportional likelihood ratio models\n\nwith nonignorable missingness and random truncation. Biometrika 100 269\u2013276.\n\nDIAO, G., NING, J. ET AL. (2012). Maximum likelihood estimation for semiparametric density ratio\n\nmodel. The International Journal of Biostatistics 8 1\u201329.\n\nGEER, S. A. (2000). Empirical Processes in M-estimation, vol. 6. Cambridge university press.\n\nGOPALAN, P. K., CHARLIN, L. and BLEI, D. (2014). Content-based recommendations with poisson\n\nfactorization. In Advances in Neural Information Processing Systems.\n\nGUIVER, J. and SNELSON, E. (2009). Bayesian inference for plackett-luce ranking models. In\n\nproceedings of the 26th annual international conference on machine learning. ACM.\n\n9\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25cf\u25a0\u25c6\u25b2\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25a0\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25c6\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25b2\u25cf\u25a0\u25c6\u25b2\fGUNASEKAR, S., RAVIKUMAR, P. and GHOSH, J. (2014). Exponential family matrix completion\n\nunder structural constraints. In International Conference on Machine Learning.\n\nJAIN, L., JAMIESON, K. G. and NOWAK, R. (2016). Finite sample prediction and recovery bounds\n\nfor ordinal embedding. In Advances In Neural Information Processing Systems.\n\nKESHAVAN, R. H., MONTANARI, A. and OH, S. (2010). Matrix completion from noisy entries.\n\nJournal of Machine Learning Research 11 2057\u20132078.\n\nKHETAN, A. and OH, S. (2016). Computational and statistical tradeoffs in learning to rank. In\n\nAdvances in Neural Information Processing Systems (NIPS).\n\nKLOPP, O., LOUNICI, K. and TSYBAKOV, A. B. (2017). Robust matrix completion. Probability\n\nTheory and Related Fields 169 523\u2013564.\n\nLAFOND, J. (2015). Low rank matrix completion with exponential family noise. In Conference on\n\nLearning Theory.\n\nLEDOUX, M. (2005). The concentration of measure phenomenon. 89, American Mathematical Soc.\n\nLIANG, K.-Y. and QIN, J. (2000). Regression analysis under non-standard situations: a pair-\nwise pseudolikelihood approach. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology) 62 773\u2013786.\n\nLU, Y. and NEGAHBAN, S. N. (2015). Individualized rank aggregation using nuclear norm regu-\nlarization. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton\nConference on. IEEE.\n\nLUCE, R. D. (2005). Individual choice behavior: A theoretical analysis. Courier Corporation.\n\nNEGAHBAN, S., OH, S. and SHAH, D. (2012). Iterative ranking from pairwise comparisons. In\n\nAdvances in neural information processing systems.\n\nNEGAHBAN, S., OH, S. and SHAH, D. (2016). Rankcentrality: Ranking from pairwise comparisons.\n\nOperations Research 65 266\u2013287.\n\nNEGAHBAN, S., OH, S., THEKUMPARAMPIL, K. K. and XU, J. (2017). Learning from comparisons\n\nand choices. arXiv preprint arXiv:1704.07228 .\n\nNING, Y., ZHAO, T., LIU, H. ET AL. (2017). A likelihood ratio framework for high-dimensional\n\nsemiparametric regression. The Annals of Statistics 45 2299\u20132327.\n\nPLACKETT, R. L. (1975). The analysis of permutations. Applied Statistics 193\u2013202.\n\nRAJKUMAR, A. and AGARWAL, S. (2014). A statistical convergence perspective of algorithms for\n\nrank aggregation from pairwise data. In International Conference on Machine Learning.\n\nSAMBASIVAN, A. V. and HAUPT, J. D. (2018). Minimax lower bounds for noisy matrix completion\n\nunder sparse factor models. IEEE Transactions on Information Theory .\n\nSHAH, N., BALAKRISHNAN, S., BRADLEY, J., PAREKH, A., RAMCHANDRAN, K. and WAIN-\nWRIGHT, M. (2015). Estimation from pairwise comparisons: Sharp minimax bounds with topology\ndependence. In Arti\ufb01cial Intelligence and Statistics.\n\nSOUFIANI, H. A., CHEN, W., PARKES, D. C. and XIA, L. (2013). Generalized method-of-moments\n\nfor rank aggregation. In Advances in Neural Information Processing Systems.\n\nTROPP, J. A. ET AL. (2015). An introduction to matrix concentration inequalities. Foundations and\n\nTrends R(cid:13) in Machine Learning 8 1\u2013230.\n\nYANG, Z., NING, Y. and LIU, H. (2014). On semiparametric exponential family graphical models.\n\narXiv preprint arXiv:1412.8697 .\n\n10\n\n\f", "award": [], "sourceid": 7985, "authors": [{"given_name": "Yi", "family_name": "Chen", "institution": "Northwestern University"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Yuchen", "family_name": "Xie", "institution": "Northwestern University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton, Phd student"}]}