Reviews: Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	75
Title:	Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition

Reviewer 1

Summary

The paper presents an interesting extension to CNN training by employing an incremental boosting technique to train CNNs for Facial Action Unit recognition. The method is evaluated on two publicly available datasets and includes a reasonably thorough evaluation of the algorithm. The results of the approach look promising and are able to improve CNN based state-of-the-art performance of AU recognition.

Qualitative Assessment

The need for incremental boosting rather than just regular boosting is because of the use of minibatch training. However, the authors do not discuss the option of not using batching. Would not using batching lead to the same algorithm for boosted CNN and incrementally boosted CNN? This should be clarified more in the paper. Is the number of epochs always 1? From Algorithm 1 it only appears that only mini-batches are used. If there are no epochs, do the same batches ever get repeated? This should be clarified in the paper. Why is the paper using Equation 1 to simulate the sigmoid function and not using the sigmoid directly? How is \lambda_j determined? This should be clarified. Step 6 in Algorithm 1 is not explained in detail, how are weights alpha computed? How are the active features selected by boosting? It would be interesting to see how many features are selected by boosting, how it is affected by iterations, does the number differ a lot across different AUs? It is interesting to see that IB seems to be more robust to learning rate and to the number of input neurons, why do the authors thing that that is the case? Would be nice to have a discussion. Would it be possible to exted the approach to multi task learning? As this is especially relevant for AU training as training each individual AU is very time consuming. Why was BP4D dataset not used for evaluation alongside DISFA and SEMAINE? It is the largest publicly available annotated dataset of AUs and it would be interesting to see how incremental boosting would work on it. For reproducibility it is important to know more details about the type of warping applied to the images, what landmarks are selected, how is the triangulation and warping performed. Is it an affine warp? Piece-wise affine warp, some sort of frontalization? For training the CNN, how was validation performed? Was early stopping implemented? It would be interesting to see the analysis of the parameter eta, does the same trend appear on the DISFA dataset. The results in Tables 1 and 2 would be possibly clearer if the ratio of occurrence for each AU was shown, especially if boosting helps specifically for rare AUs. The paper mentions that AU coded training images are limited providing the numbers of images in the datasets. While the numbers of actual images may appear high, the number of people and the general diversity of the data is actually quite low. This fact could be expanded upon in the motivation. In order to claim significant improvements in performance statistical tests are needed (Section 4.6) Trivia: - line 18, unites -> units - line 77, "the learning based", should it be "the feature learning based"? - a citation is not a noun, instead of writing "Different from [24]" you should use "Different from Asthana et al.[24]" - Fig 1. -> Figure 1 - line 109, the images patches -> the image patches - The spacing above Equation 9 is too small, makes it difficult to read - Avoid using words like "dramatically" when talking about performance improvement. - There seems to be a spacing issue around Table 1 - Table 2 spacing is too small

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 2

Summary

This paper proposes a classifier boosting approach with weak classifiers chosen as convolutional neural networks. Typical backpropogation learning is extended to boosting layer and feature selection is performed with the help of strong and weak classifiers. A variety of experiments have been conducted on two facial AU databases. A linear combination L_2 boosting error is applied for empirical minimization. The proposed IB-CNN approach differs from the B-CNN by application of the forming the hypothesis of the strong classifier incrementally. A comprehensive model selection procedure is carried out to justify the promises of the proposed architecture.

Qualitative Assessment

Incremental boosting idea for better feature selection and enhanced classification makes sense, but boosting idea could be better motivated as an alternative by making an argument against creating deeper and deeper networks. In this case, a comparison with a deep network to show boosting is an alternative with some theoretical guarantees to deep networks could be justified. IB-CNN can be claimed to have middle complexity between CNN and B-CNN (incrementality ensures smoothness in boosting model space). Authors claim (line 56, line 117) to obtain more complex decision boundaries and to alleviate overfitting. However overfitting usually arises from complex boundaries. Please clarify in detail. Furthermore, it would be nice to see sample boundaries by visualizing them in 2d. Although the results look promising, could you please make a discussion on why B-CNN is performing poorly in 2AFC (SEMAINE) and F1 (DISFA) measures compared to the base CNN? This discussion can also continue with why IB-CNN works. Is the same setting used for other CNN baselines? It would be helpful to indicate this. How does the variances in minibatch size effects the performance of incremental boosting? Is 100 chosen heuristically, experimentally? IB-CNN is shown to perform better than CNN and B-CNN, however comparison to state-of-the-art in facial AU recognition is not given. Such comparisons should be included in the paper to show effectiveness of the proposed approach. Wordings "limited" and "insufficient" are very much alike, better to use only one of them. Especially in the introduction, repetitions are confusing.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 3

Summary

This paper presents an Incremental Boosting CNN (IB-CNN) for recognizing Action Units. The system is particularly appropriated for training CNNs when just a small amount of labeled data is available. The authors compare their architecture with a traditional CNN architecture, showing an improvement of the Action Units recognition when using the proposed IB-CNN.

Qualitative Assessment

The paper proposes a new framework of CNN integrating boosting with an incremental learning approach. The presented results show how the proposed approach performs better than a traditional CNN. The results also show an improvement over other state-of-the-art CNN methods in the problem of Action Unit recognition. The paper presents interesting ideas that can be used for other problems with small training datasets. In my opinion there are some aspects in this paper that need more clarification or a deeper discussion. For instance, the weak classifier h seems more complicated than one would expect. I'm sure there is a justification of why to use a one-level decision tree and why to use the parameter etha as described for controlling the slope of the function, but these can look like arbitrary choices if they are not better justified. Actually the way theta is defined is a little bit confusing. On the other hand a deeper discussions and interpretations in section 4.5 would be nice to see. For example, I see the parameter etha does affect the F1 score significantly when it is ranged from 0.5 to 16, while from 0.1 to 0.5 shows a much bigger impact. What is the intuition of that? The same happens with the analysis of the number of input units in the IB layer. From the experiments it is clear that it affects the result, but I think authors should explore better why. Maybe visualizing somehow what the units are doing can be an interesting strategy.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 4

Summary

The paper presents and validates a new method to improve classification of facial action units (AUs) from single frames. The challenges and applications of this problem are discussed. The presented method is based on a traditional convolutional neural network (CNN) with a boosting algorithm on top of fully connected features (B-CNN). The authors suggest an incremental B-CNN (IB-CNN), in which the weights of boosting classifiers are trained in a soft way, better accumulating information from all batches. The theoretical formulations and description of the method are presented. In the experimental part, the classification results for two quite popular datasets are presented, compared with baseline methods (CNN and B-CNN) and with previous works, improvements are noticeable. Additional experiments analyzing properties of the developed method are also presented.

Qualitative Assessment

Some answers in the rebuttal appeared useful for me, so I can keep the same scores. “Technical quality” The overall quality of experiments is good. - Ideally, it would be also great to present results on the CK+ database to compare this submission with [21] (which is very related) and to better compare it to [28]. - Line 249: “Data analysis of the parameter η:” The interpretation and analysis of the result in this experiment is missing. Maybe it is not necessary to present results of this experiment? - It would be interesting to see how many active neurons (i.e., with α_j > 0) were in each case in the experiment corresponding to Fig. 5. Is the proposed IB-CNN more robust to the number of input neurons because the number of active neurons is always the same after training? - For some AUs the results are worse than using a CNN or a B-CNN on both datasets? Do the authors have any explanation? - Also, how the authors chose β (beta) in (2) and (8)? “Novelty/originality” The core idea of incrementally updating weights of classifiers (Eq. (5)) seems to be novel. However, little is said about the work on “Facial Expression Recognition via a Boosted Deep Belief Network” (ref [21] in the submission), which seems to more related than other works discussed. For instance, definition of h(x_ij , λ_j) in (1) of the submission is analogous to (7) in [21], whereas (2) is analogous to (8) in [21], and (4) is very similar to (9) in [21]. So, [21] should be better credited. “Potential impact or usefulness” The paper presents results for facial action classification. Experimentally, the proposed method is shown to be useful for this task only. But the method is quite general and can be potentially useful in any task where the number of (positive) training samples is small, i.e. can be applied to other (image, video) datasets. As the authors say in lines 288-289: “The IB-CNN is a general machine learning method and can be adopted by other learning tasks, especially those with limited and insufficient training data”. Ideally, this conclusion should be supported in the paper by testing their method on other databases, not necessary facial expressions, to extend their impact. For instance, most of video datasets (KTH, HMDB51) with human actions have a relatively small number of samples per class. To sum up, the paper could have a greater impact (i.e., oral level) if it were more general (from the experimental perspective), something like “Incremental Boosting Convolutional Neural Network for Limited Training Data”. However, admittedly the methods particularly suited for facial action recognition can be indeed very useful in practice. “Clarity and presentation” Overall, the paper is well written, except for the following issues which can be fixed in a short time. - There must be more space between paragraphs, headings, tables and figures. So, the authors need to remove something less significant (e.g., "Data analysis of the parameter η") from the final version of the paper. Headings to sections, subsections and paragraphs should have consistent case. Please, consult the NIPS style file. - For Tables 1, 2 three digits after decimal look not good, i.e., consider “30.4±6.5” instead of “0.304±0.065”. Then also replace “F1” with “F1x100”. This will free some space, which can be used to add columns describing what are AU2, 12, etc., and, more interestingly, their frequencies in the datasets. For instance, the authors say in lines 287-288: “Furthermore, the IB-CNN is more effective in recognizing infrequent AUs with insufficient training data” and in lines 279-280: “To deal with insufficient positive samples in a mini-batch.” In Tables 1, 2 it would be great to provide some numbers relating to the frequencies of occurrence to support these statements. - The paragraph starting at line 83 is a little bit confusing. The authors say that “A few approaches developed new objective functions to improve recognition performance...”. Then, in the same paragraph they say that “Hinton et al. [23] utilized a dropout ...”, which is not really related to the objective functions nor to making decision (dropout is not used during inference, so it is still the inner product). Consider polishing this paragraph. It’s probably better to say “A few works developed improved training procedures...”. - Fig. 1 has description for (a) and (b), but there is no (a) and (b) on the figure. - The authors are suggested to follow punctuation rules in expressions, which are part of sentences (in fact, all expressions (1)-(10) in the paper miss commas and dots). - Lines 142: “However, the information captured previously, e.g., the weights and thresholds of the active neurons, is discarded for a new batch.” And related line 170: “Compared to the B-CNN, the IB-CNN exploits the information from all mini-batches.” Although, Fig. 2 compares a B-CNN and an IB-CNN, in text these sentences may sound confusing. Could the authors specify more concretely what is discarded and what is not, i.e. which weights and thresholds? Otherwise, it sounds like all information from the previous batch is discarded and it is not clear then how the network can be trained. - Can the authors solve the issue with references to make them hyperlinked?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

A novel method for incremental boosting CNN is proposed. It selects discriminative neurons in the fully connected layer via incremental boosting on successive data mini-batches. It is argued that this approaches improves generalisation of CNNs given small sample size problems. In experiments on facial AU recognition using two benchmarks, the proposed method shows improvements over baselines and two state-of-the-arts.

Qualitative Assessment

Below are some questions and comments, followed by my overall review opinion. It is not discussed if or not the proposed method based on boosting is able to handle multi-class problems? Or binary-class problems. Defining the loss function as summation of a strong-classiﬁer loss and weak-classiﬁer losses is quite empirical. Experimental evidences and/or more justifications are needed. Incremental boosting CNN, should it be better at generalisation necessarily? An attempt to do more optimisation than randomisation as e.g. in [23] generally leads to more overfitting in my opinion. The proposed method (idea and formulation) is fairly straightforward from existing boosting theory and incremental boosting. Not certain if it is a convincing idea to treat the fully connected layer as weak classifiers and apply an adaptive feature selection strategy… what if we apply such to convolutional layers or both? On overall, the methodological (or formulation-wise) contributions are not major. In experiments, was the pre-processing (66 landmarks using the state-of-the-art [25] and face alignment and warping using the landmarks) kind of common in all compared methods? Or was it more exploited to help obtain better accuracies of the proposed method? It is kind of predictable to see better accuracies compared to booting CNN that does not utilises all data mini-batches together. On average, the improvements obtained over standard CNNs are about 5% for F1 and 2AFC and on the two benchmarks. Table 3 and 4 show comparisons with state-of-the-art CNN methods [11,28], where the proposed method shows a fair improvement. However it is less clear to see how significant these improvements are, if these correspond to intuitions behind method-wise differences, and if same or similar pre-processing was done in all methods. Parameter analyses are fairly sufficient (some aspects are still lacking) and good to support the proposed methods. In summary, the method-wise contributions are marginal; ideas and formulations are straightforward from boosting algorithms, intuitions behind the proposed method (avoiding overfitting, small data size etc) are not very certain. The obtained improvements seem good, however, not that impressive.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

Reviewer 6

Summary

In this paper, the Incremental Boosting CNN model (IB-CNN) is proposed for facial action unit recognition. The IB-CNN combines the boosting method, Convolutional Neural Network (CNN) and incremental learning method. Specifically, it selects the features (fully connected layer) from CNN to train weak classifiers to generate the strong classifiers in the boosting manner. A novel loss function that considers the loss functions for both strong and weak classifiers is proposed. In addition, instead of training the Boosting CNN model using standard back-propagation with mini batches, the parameters of the weak and strong classifiers are updated incrementally with different mini batches. Experiments are performed on several benchmark databases.

Qualitative Assessment

The proposed method is interesting in that it combines the Boosting, CNN and incremental learning idea into a unified model. There are a few issues with the proposed method. First, the motivation of the proposed method is unclear. What’s the benefit by combining boosting with CNN? The CNN features are already discriminative. What’s the benefit by adding the incremental learning idea? Some justifications are needed. Second, the algorithm is not clearly described. a) How to get Equ. (6) from Equ. (5)? Should H_I^{t-1}(.) depend on h^{t-1}(x_{ij}; \lambda_j) and other earlier models, instead of h^{t}(x_{ij}; \lambda_j)? It seems like that h^{t-1}(.) and other earlier models are gone for Equ. (6). b) In the incremental boosting method, the selected features and classifiers in all the iterations will contribute to the final results equally. However, the feature selection and classifier learned in the previous iteration relies on the initial CNN parameters which may not be accurate. Third, there are some issues about the experiments. a) The comparisons of the proposed method with other state-of-the-art works are limited. There is only one baseline method. More algorithms should be included (e.g. [7][19][A-3]). b) There should be evaluation on the CK+ database [A-1][A-2], which is a widely used AU recognition database. c) Why B-CNN is even worse than CNN in table 1&2? It contradicts the justification in the paper that there is benefit by combining boosting with CNN. In summary, the proposed method combines CNN, boosting with incremental learning. Some details of the proposed method is not well justified. The experimental evaluations are inadequate. [A-1] Kanade, T., Cohn, J. F., & Tian, Y. (2000). Comprehensive database for facial expression analysis. Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG'00), Grenoble, France, 46-53. [A-2] Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z., & Matthews, I. (2010). The Extended Cohn-Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-specified expression. Proceedings of the Third International Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB 2010), San Francisco, USA, 94-101. [A-3] Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang. Facial Landmark Detection by Deep Multi-task Learning, in Proceedings of European Conference on Computer Vision (ECCV), 2014 ----------------------------------------------------------------- After reading the rebuttal, I have a better understanding of the approach. Only the strong classifiers with parameters \alpha are incrementally learned, while the weak classifiers, features are learned through typical gradient descent method. I think it's better to clarify the notation for H^t(.). Therefore, I improve my rating. I strongly encourage the authors to include more experimental comparisons with other state-of-the-art works about AU recognition on benchmark databases (e.g. CK+) in the revised version.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)