Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
- The authors propose a way to perform structured output prediction when the dependence between 'parts' are localized. The model is learned by breaking the structure into parts and performing kernel ridge regression on the parts. They show elaborate convergence rate analysis in the estimation. The theoretical analysis is the strong part of this paper. - One of my concern for this paper is whether the locality assumption made in this paper is valid in general. In a lot of computer vision and NLP applications the latest research is about capturing long range dependencies. The correlation in Figure 1 is highly concentrated at the central patch because it's the average of many different images, but on individual images the correlation patten can be very different. But the learning method does not capture these dependencies by breaking the learning problem as part-based regression. It would be useful to test this assumption on more real-world applications. - One difficulty with doing this learning by parts is inference, i.e., how to put the parts back together to make one single prediction during test time. The authors suggest using stochastic gradient in the evaluation of (7). But given the non-differentiability of most structured prediction problem wrt to the output Z, I am not sure this general scheme would perform well. Also, Algorithm 3 given in the Appendix involves projection onto the set of feasible structures Z, which can be very difficult. - Another concern for this paper is the range of applicability of this model. The method is evaluated on a single fingerprint direction prediction problem, and compared against several kernel ridge regression based methods. It would be more interesting to test on more traditional problems such as sequence learning, segmentation with MRF, etc, to see if the locality assumption used in the new method actually leads to any improvements. Given this is described as a general structured prediction method, more than one example application should be included to illustrate the model. For example, the authors can quite easily implement Example 2 suggested in their paper. - Given the proposed method is quite new compared to traditional structured prediction approaches, I believe more examples and experiments need to be given to make sure the model is not a just method with nice theoretical properties but limited real-world applications.
Strengths: The paper is very clear and the small parts remaining dark (such as what the authors refer to when dealing with unclearly defined parts or losses decomposable by part in general) are studied in detail in the supplementary. This works is a natural follow up to the previous work by Ciliberto et al which introduced a learning bound similar to Theorem 4 in a more general setting. Showing how to take advantage of the specificities of the data and characterising this in terms of improvement on statistical guarantees is very appreciated in this domain. Weaknesses: Despite the shown results and the details added in the appendix K, I think that the experimental part remains the weak part of this paper. The results displayed are convincing but I am disappointed that the authors did not tried their approach on more popular problems mentioned in the supplementary such as hierarchical classification. Even if this could be improved (in order to be at the level of the theoretical treatment), the proposed content is already solid and does not change my decision concerning the quality of this work. Remark: - The use of the sequence example at different step of the paper is really useful, however I'm a bit surprised that you mention in Example 2 a 'common' practice in the context of CRF corresponding to using as a scoring loss the Hamming distance over entire parts of the sequence. I've never seen this type of approach and am only aware of works reporting the hamming loss defined node wise. It would be great if you could point out some references there. - After reading the paper a few times, I still think that the notation $\Delta(z,y|x)$ is a bit strange and I would have preferred something of the form $\Delta(z,y)$ since in practice the losses you mention never takes into account the input data and $z$ is already a function of $x$. Maybe this is only personal taste and will be contradicted by the other reviewers. Minor remarks : missing brackets [ ] in theorem 4.
Originality ------------ The authors propose a new setting for structured prediction, and relates it well to the state of the art. By developing their new framework, the authors are also able to explain the efficiency of some existing structured prediction method (restriction kernel). Quality --------- Although I didn't go through all the proofs in the appendix, the paper looks technically sound. All the claims presented are proven in detail. The authors provide a very complete piece of work, and the appendix extend even more the proposed framework. Clarity -------- The paper is clearly written and organized. The authors provide all the elements to understand the paper, and it should be enough to reproduce the results. Figure 3, the Left plot is very difficult to read in black and white. Significance ---------------. The presented results are significant as it address a difficult task by exploiting structure in the data, and improve over existing methods. The proposed algorithm is proven to be consistent and the authors provide and study generalization rates, justifying the usefulness of the method. I have read the author response and my opinion remains the same.