
Submitted by Assigned_Reviewer_15
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper provides an application of the tensor decomposition method on a crowdsourcing problem. The setup is the following: there are m workers labeling n items for a set of k labels. We observe the labels z_{i, j} in {1...k} for i=1...m and j=1...n. We want to infer the true labels y_{j} for j=1...n, where the true labels are generated by a set of multiview models corresponding to the workers.
The proposed recovery of the underlying model parameters is a straightforward application of the tensor decomposition method for multiview models (Anandkumar et al., 2012). The authors also perform a local search via EM following the spectral method: initializing EM with spectral estimates have been shown to be effective (Chaganty and Liang, 2013), as acknowledged by the authors.
Notably, however, the authors analyze the maximumlikelihood estimates of EM in this setting and show that they recover the true labels whp. They also show the optimality of the convergence rate with a minimax argument.
To my knowledge, an attempt to analyze the behavior of EM initialized with spectral estimates is novel, and I think it's a fruitful and worthwhile direction.
Specific comments:
 It's probably helpful to more explicitly account for why you need the EM step for the results in Theorem 2. Wouldn't it be possible to argue similar results directly on the spectral estimates? If not, why?
 On a related note, the experiments are missing an obvious baseline: using the output of the spectral stage. It's probably much worse than following up with EM, but it's an important experiment.
 Change "(l, c)th" to "(c, l)th" in Introduction.
 As for the theshold delta, smoothing might be a cleaner solution and may give similar effects.
 A conclusion is needed.
Postfeedback comments:
 I understand the spectral method depends on various problemspecific constants, but it's still not clear to me why "its convergence rate is suboptimal". Please clarify.
 I was suggesting an experiment of performing only the Estep on the spectral estimates. I'm assuming that OptD&S: 1st iteration in Figure 1 still performs an entire iteration of EM followed by another EStep?
Q2: Please summarize your review in 12 sentences
This paper provides an application of the tensor decomposition method and EM to a crowdsourcing problem. It also presents an analysis of estimation consistency and convergence optimality. Submitted by Assigned_Reviewer_17
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Thank you to the authors for their thoughtful feedback. I agree that there are pros and cons to both methodofmoments and optimization methods, and a detailed discussion of them is not needed in this paper (which, after all, aims to unite them!). But since the goal is to analyze an algorithm for a specific, practical problem, I think it is important to at least make sure the assumptions  the main one being that the DawidSkene model is the process actually generating the data  are clear so that a reader can decide whether they are plausible for a given application.
Finally, though the dependence on m may sometimes be more important than the dependence on n, I think it is important to explain this limitation when discussing optimality in the abstract and introduction, where the claims currently seem overstated. Even weakened in this way, they will still be interesting claims backed up by strong results.
previous review below
In this work the authors propose a twostage method for extracting consensus labels from the noisy ones assigned by possibly unreliable annotators. In the first stage, the method applies spectral techniques under the DawidSkene model of label noise to obtain estimates of each annotator's label confusion matrix. It then uses these estimates to initialize an EM algorithm, which alternately updates the label estimates and the confusion matrices to improve likelihood. The idea to initialize EM with spectral methods is not new, but in this paper the authors claim to show that the result of the twostage process is nearoptimal in an asymptotic sense. Finally, experiments on several realworld datasets show that the proposed method performs comparably to the best existing method.
As noted in the introduction, EM algorithms (like the one by Dawid and Skene) generally set out to optimize nonconvex likelihood functions, which, being hard, makes it difficult to obtain any kind of performance guarantees. The move to spectral methods allows for results like Theorem 1, which shows that, given enough data, parameter estimates will converge to the true parameters. However, there is a tradeoff here, which is not mentioned: the guarantee only applies if the model is correct. That is, if annotators have different confusion tendencies on different items, or if their performance drifts over time, or if they in any other way deviate from the DawidSkene model, then Theorem 1 will likely not hold (nor will Theorem 2, in fact). This problem is much less severe for optimization algorithms, like EM, that have the ability to try and fit an imperfect model to real data; they may not have global optimality guarantees, but their objectives continue to guide them in settings where spectral theory simply fails.
The absence of this consideration is at the core of my biggest concern about the paper. In attempting to prove the qualifications of a method for handling the unreliability and unpredictability of third party annotators, the paper assumes a high degree of regularity in their behavior. To paraphrase a saying, some models are useful, but all of them are wrong; I have mixed feelings about a theory that assumes otherwise. At the very least, a thoughtful discussion of this issue would be a valuable addition to the paper.
Having said all of that, I am excited about the direction the authors of this paper have pursued. In practice, the spectral+EM strategy has become popular; presumably the idea is to get a "best of both worlds" kind of result, where the spectral method hopefully gives a good initialization for EM, saving time and improving the final result, but in any case is probably no worse than random initialization. Because of this I find the idea of theory for the combined algorithm highly interesting. I'm not sure, however, whether Theorem 2 really fulfills this promise. In particular, the assumption that the input C matrices are within alpha of the truth (for alpha that is small compared to the truth) seems to require that the first stage has done basically all of the work. In fact, the Estep used to obtain result (a) of Theorem 2 is just choosing the maximumlikelihood y using the given C. Arguably EM is not really needed at all for the theory.
On the other hand, the experiments almost make the case for the opposite possibility: that EM does most of the work. Whether initialized by the spectral method or simple majority voting, the two EM methods perform about the same. In some cases the spectral initialization leads to immediate convergence, suggesting it has found a good initialization, but even when the initialization is bad, 50 iterations of EM seem to bring it in line. Perhaps, then, the most practical use of this method is as a timesaving technique; it would be interesting to see the runtimes in that case.
I found Theorem 3 a bit hard to parse; presumably the \hat y and \hat mu in the infima are supposed to be mappings from the observed data to the quantity in question? At first glance they appear to just be concrete values, which makes the claims seem a little strange. In any case, I found the theorem itself interesting, but I'm not sure I agree that it shows that the convergence rate is "near optimal" since it doesn't address n, which is usually the default convergence clock and arguably has the more dubious scaling in Equation 9, with large powers of k and a hardtointerpret singular value. In my opinion this makes the optimality claims in the abstract and intro somewhat overstated. Q2: Please summarize your review in 12 sentences
Overall, I have mixed feelings about this paper. On the one hand, the work tackles an interesting (and difficult) problem, managing to successfully bring together several different types of theory as well as fairly extensive empirical analysis. On the other hand, I'm not convinced that the theory is as useful as it is made out to be, both because the assumptions are strong, and because the lower bounds do not consider n. The experiments suggest that the biggest benefit might be in runtime, not error rate. Submitted by Assigned_Reviewer_37
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a twostage algorithm for estimating the parameters of a DawidSkene estimator (used for inferring true labels from noisy crowdsourced data). The algorithm relies on initializing EM using the methodofmoments (MoM), where the MoM portion of the estimation is performed via spectral (i.e., tensor) decomposition. To my knowledge, this is the first work to provide (relatively) strong optimality guarantees on such a combination. Work by Chagnaty et al in ICML 2014, for example, provides weaker guarantees on combining *composite* likelihood EM with spectral learning (though that work considers a more general latentvariable context). The authors show not only that the MoMinitialized EM algorithm converges to a globally optimal solution, they also show that the convergence rate is optimal in a minimax sense. These results do rely on a rather restrictive assumption (see comment 1 below), and the "problem specific constants" in the bound (i.e., those that are ignored in the comparison to the optimal minimax rates) are quite large (see comment 2 below). Nevertheless, the theoretical analysis is quite thorough, potentially providing insights for how MoMEM hybrids can be analyzed in more general contexts, and the empirical results demonstrate that the algorithm is competitive with other stateoftheart techniques, outperforming the alternatives in a significant proportion of cases.
Overall, this is a wellwritten paper and is a significant contribution to both the crowdsourcing and spectral learning communities. At a high level, I would like to see more discussion of what aspects of this particular problem made the elucidated bounds obtainable (i.e., what specific properties of this problem were crucial in proving the optimality of the MoMEM combination), as this combination is not provably optimal in general latentvariable settings, but I understand that space does not necessarily permit such a discussion.
Comments:
1. The theoretical results assume that all entries of the worker confusion matrices are greater than some positive constant, $\rho$. This seems to a be *very* restrictive assumption in that it assumes that the workers always have some positive probability of confusing any two labels and thus seems to imply that the elucidated convergence rates will only hold in relatively "noisy" labeling problems (i.e., where all pairs of labels may be "mixed up" by workers). I think that the results are still compelling, despite this strong assumption; however, I think that the implications of the assumption should made more explicit via a brief comment in the primary text. For example, what proportion of crowdsourcing applications can be expected to satisfy this assumption? Do the authors think that the presented realdata experiments satisfy this assumption? (If not then this is strong evidence that the assumption is not strictly necessary for the empirical performance of the algorithm).
2. In the bound on $n$ in equation (9), the samplecomplexity increases rapidly as the $\sigma_L$ decreases, and the constant $\rho$ also appears in the denominator. I think that a brief mention of the impact of these terms in the primary text would be useful. Essentially, it seems that the bounds require that $S_{a,b}$ is quite wellconditioned (otherwise $\sigma^13_L$ could be extremely small) and that workers are likely to mix up all pairwise label combinations, e.g. that there are not two labels that are almost never mixed up (otherwise $\rho$ could be extremely small). A brief comment on these terms could simultaneously address the question posed in Comment 1 as well. Even a simple intuitive explanation of what sorts of crowdsourcing applications can be expected to be "wellbehaved" with respect to these constants would be informative.
Minor Comments:
 Line 260: Missing "$\min$" in definition of $\pi_{\min}$.
 In Table 2, for the the RTE data, the entry for OPTD&S is bolded but the value is the same as for MVD&S. Both or neither should be bolded.
PostFeedback Comments:
W.r.t comment 1 above, authors pointed out that this problem can be alleviated by adding a small amount of noise to worker's labelings. This is certainly true, though it leads to a somewhat odd situation where adding noise is necessary to make the theory work by ensuring a positive $\rho$ but also degrades performance by decreasing $\bar{D}$. Essentially there is a tradeoff between $\rho$ and $\bar{D}$ (as larger $\rho$'s improve the bound for $n$) and it would be interesting if the author's could comment on a heuristic choice for the amount of noise to add, given this tradeoff. Also, is this perturbation actually necessary in practice (e.g., did the authors perform such a perturbation in the empirical analysis)?
W.r.t comment 2, the authors have agreed to address this issue via some more comments in the final draft.
As a final note, an important point that came up during discussion of the paper was that the claims of optimality, though wellfounded, somewhat obscure an important issue. Specifically, the optimality of convergence is w.r.t. $m$ and not $n$, whereas it is perhaps more standard for $n$ to be used. The authors gave reasonable justification for this in their response, but the paper would be strengthened by the addition of such an explicit, brief justification. Q2: Please summarize your review in 12 sentences
This wellwritten paper presents a compelling application of the spectral method of moments with thorough empirical and theoretical analyses. It should be of interest to researchers in both the fields of crowdsourcing and spectral learning. Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point. We thank all reviewers for their careful reading and constructive comments.
Assigned_Reviewer_15
The EM step is essential here to achieve the optimal rate. As shown in Theorem 1, the spectral method provides a consistent estimator, but its sample complexity depends on a large problemspecific constant such that its convergence rate is suboptimal. In our algorithm, we first use the spectral method to obtain an $\alpha$accurate initialization (Eq. (8)), and then employ EM to boost it to optimal (Theorem 2b).
In addition, it is worth noting that the spectral method only outputs an estimate of confusion matrices. We need postprocessing to obtain an estimate of the true labels. Here we perform an Estep, which corresponds to calculating the posterior distribution of labels. As shown in Theorem 2 and Theorem 3, this procedure results in an optimal estimate of the true labels. An empirical evaluation of this procedure is plotted in Figure 1 as “OptD&S: 1st iteration”.
Assigned_Reviewer_17
We may have a somewhat different perspective on the spectral method and EM than is taken by the reviewer. Inferentially, we note that the spectral method is the method of moments (MoM) and EM is the method of maximum likelihood (ML). Both of these general approaches to inference require models for their analysis, in particular for analysis of consistency and rates of convergence, the focus of our paper, and the focus of much literature in theoretical statistics. Another kind of analysis is performance under model misspecification, which we agree is of significant interest, but which is beyond our current scope. Suffice it to say that the MoM and ML have advantages and disadvantages under various kinds of model misspecification and it will require a major effort of its own to explore these issues.
We want to clarify that both the spectral method and EM are necessary in the theoretical perspective we have presented. Without the spectral initialization, the EM output is heuristic. Without the EM algorithm, the spectral method cannot attain the optimal convergence rate. When they are combined, the success of the EM algorithm only requires a roughly accurate initialization, which overcomes the limitations of both. Please refer to the feedback to Reviewer 15 for more detail on this issue.
Empirically, we are happy to report the runtime of our algorithm in our revision.
As for the lower bound, both $\hat y$ and $\hat \mu$ are estimators that map the dataset to the target values. The infimum chooses the best possible estimator. Theorem 3 shows that both the required number of workers m and the error of estimating the confusion matrix are nearly unimprovable, for **any** value of n. The second part of constraint (9) (involving n) might be improvable. In real dataset we usually have n >> m. So we think that the optimality for m is more important than for n.
Assigned_Reviewer_37
We will provide more intuitive explanation to the impact of problemspecific constraints, and discuss when the conditions are satisfied by real problems.
The assumption that all confusion entries must be nonzero can be addressed by a simple argument. Suppose that the underlying model doesn't satisfy the assumption. Given any observed label, we replace it by a random label in {1,...,k} with probability $\rho$. In this modified generative model, every entry of the confusion matrix is lower bounded by $\rho/k$. If we run the proposed algorithm on this perturbed dataset, then the theory goes through and Theorem 2 holds. The random perturbation will make the constant $\bar D$ smaller than its original value, but the change is minor for small $\rho$.
 