Paper ID: 727
Title: Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Overview:
The authors propose the Gibbs error criterion for active learning; seeking the samples that maximize the expected Gibbs error under the current posterior. They propose a greedy algorithm that maximises this criterion (Max-GEC). The objective reduces to maximising a specific instance of the Tsallis entropy of the predictive distribution which is very similar to Maximum Entropy Sampling (MES) which uses the Shannon entropy of the predictive distribution. They consider the non-adaptive, adaptive and batch settings separately, and in each setting they prove using submodularity results that the greedy approach achieves near-maximal performance compared to optimal policy. They show how to implement their fully adaptive policy (approximately) in CRFs with application to named entity recognition, and implement the batch algorithm with a Naive Bayes classifier, with application to a text classification task.

Quality:
Their algorithm appears to be a sensible approach, with the Gibbs error being very closely related to the Bayes error. Although it appears very similar to a number of approaches in the literature (MES, Query by Committee), it exhibits some useful theoretical and practical properties. Primarily, they are able to show near-optimal bounds in the adaptive greedy setting with probabilistic predictions; previous approaches have been shown only to be optimal in the non-adaptive or noiseless case. A second advantage is that their objective permits a simpler computation of their objective with CRFs when approximating integrals over the posterior with samples; it would be interesting to see a discussion of whether this rearrangement is extensible to other models?

However, I have a concern about the practicality of the algorithm. In particular in the non-adaptive/batch case, a sum over the product space of all possible labellings of the batch (S) is required (Eqns. 2, 4). When applying batch Max-GEC to the NB classifier they approximate the sum using Gibbs sampling. Given how large the space of possible labelings of the batch could be, this may require a very large number of samples to get a reasonable estimate, the requirement to compute/estimate this sum seems to restrict the batch algorithm to either small batches or models in which computing predictions is cheap.

The experiment section is fairly convincing; using two tasks and a number of different datasets, they show max-GEC usually outperforms a number of baselines, although the improvement over LeastConf seems marginal at best. A concern I have is that for the CRF model, SegEnt/LeastConf produce almost as good results as Max-GEC, which is perhaps unsurprising given the similarity between the algorithms, however, for SegEnt/LeastConf the authors use only the MAP hypothesis to compute uncertainties, and for Max-GEC they show improvement by integrating over parameters (using samples). They should compare also to SegEnt and LeastConf with parameter integration, these approaches may be more sensitive to accurate estimation of uncertainties and I am unconvinced that there would necessarily be a performance difference after accounting for parameter uncertainty.

Clarity:
The paper is largely clearly written, however, although they do define the notation, I find some of the choices of notation somewhat confusing. In particular use of a generic p to denote the posterior as opposed to the prior p_0 is a bit unclear (perhaps it could be subscripted by the data used to train the model). Also the use of the symbol traditionally used for ‘conditioning’ (as in y_{A|X}) to denote the labelling of X according to A makes the paper harder to read. The use of E(\rho) to denote the set of unlabelled examples looks a lot like an expectation, also regarding expectations it would be useful to subscript them with the distribution over which the expectation is being taken e.g. for an expectation under the prior E_y -> E_{p_0(y)}.

Significance:
Although the optimality proofs provide an interesting insight into this particular criterion, and provide a decent theoretical contribution, the algorithm itself is sufficiently similar to a number of proposed approaches and so this paper does not represent a very significant practical contribution to the field of active learning.

Q2: Please summarize your review in 1-2 sentences
Maximising Gibbs error is intuitively a sensible approach for active learning, and the optimality guarantees presented in the paper verify this, furthermore the experiments with CRFs and NB classifiers show reasonable performance. However, the algorithm itself is very similar to a large number of proposed approaches based upon version space reduction/entropy sampling, I also have some concerns about the practicality/extensibility of the batch algorithm due to the requirement to compute an exponentially hard sum.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)


The authors consider an active learning policy of choosing examples
with the highest Gibbs error. They consider three setting (1)
non-adaptive, (2) adaptive, and (3) batch. The implication is that by
choosing those examples with highest Gibbs error, the overall error
probability will be minimized. While this has intuitive appeal, I am
unaware of any formal proofs that show this to be the case. Certainly,
SVMs rely on a similar strategy (and there are formal proofs). In any
event, a reference to any analysis on this choice would greatly
improve the paper.

Regarding the non-adaptive policy, the authors state that this is
equivalent to selecting examples prior to labeling. However, the
greedy selection of equation (2) at least implicitly includes the
labels. Otherwise, how is it possible to compute the sequential
Tsallis entropy term?

In the adaptive policy the authors show (in supplementary) that the
greedy criterion satisfies an adaptive montone submodular property and
hence may exploit known bounds on such reward functions.

The batch setting (a bit misnamed) intersperses posterior updates with
greedy selections of small batches of data. It is a compromise between
the non-adaptive and adaptive approaches.

One strong criticism is that the notation is constantly being
redefined. For example, the Gibbs error of line 110 is defined adn
then apparently discarded in section 2.1 in favor of \epsilon_g (line
134), which is then discarded in favor of g_{p_0} (line 137). I can
appreciate trying to simplify the notation, but these changes make it
difficult to follow the agruments.

The authors propose a sampling approach for approximating the Gibbs
error in exponential models (e.g. CRFs) and batch Gibbs sampling for
Naive Bayes models. Experiments are performed on two tasks
(recognition and text classification) using the two models and
associated estimators of Gibbs error.

Provding the related work section at the end seems to be an odd choice.


line 108 y_u is described as the "true labeling" of the data. It later appears as a subscript in Eq. 1 implying that it is a random variable. Just after Eq. 1 "for a fixed labeling" also refers to y_u. Please clarify.

line 109 "For a prior p_o..." prior what? This same symbol is used to describe the posterior over mappings given data+labelings in line 098. They do not appear to be the same thing.

line 134 The steps from E_{y_s} to Tsallis entropy are not obvious (at least to me).

minor comments (i.e. no need to rebut):

Is the set of mappings finite or countably infinite. Please clarify...perhaps it doesn't matter.

line 089 using p[h] for p_0[h|D] is a bit distracting as the same notation is used to denote a marginal event probability in the very next sentence.



Q2: Please summarize your review in 1-2 sentences
The authors exploit submodular properties of Gibbs error and its relation to Tsallis entropy to establish guarantees for greedy methods for online learning.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper considers pool-based active learning and batch mode active learning using a greedy algorithm which selects examples to label to maximize a quantity called the policy Gibbs error. The proposed algorithms can be seen as generalizations of prior work on version-space reduction algorithms, and benefits from similar (constant-factor) approximation guarantees (based on the adaptive submodularity of the policy Gibbs error).

The method is flexible, and can be used whenever the policy Gibbs error can be computed in practice. The authors evaluate their algorithm with two applications -- entity identification with CRFs and Bayesian transductive naive Bayes models -- with modest improvements on prior work. Overall this is a nice paper in an important area.

The exposition is good, though section 2 could stand to be improved.
A sentence defining the Gibbs classifer would be nice.

The authors claim their work does not require an explicit noise model, in contrast to earlier work. It would be nice to point out what noise models their methods can handle (i.e., that the noise model is implicit in the probabilistic model and is limited by computational concerns).

There appear to be some interesting connections between the Tsallis entropy and the progress measures in prior work (where terms like 1 - sum_i p_i^2 often appear).
Q2: Please summarize your review in 1-2 sentences
See the second paragraph above.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Review 1

1. Models other than CRF
We have discussed a conditional model in section 3.1 and a generative model in section 3.2. The conditional model in 3.1 is the exponential model, and hence it covers a wide class of models, including linear chain CRF, semi-Markov CRF and sparse higher-order semi-Markov CRF. As can be seen in the last set of equations in 3.1, computing our criterion basically consists of (1) the partition function, and (2) sampling of \lambda. If we further use the MAP estimate of \lambda, then only (1) is needed. Hence our criterion is applicable to classes of models for which the partition function can be computed efficiently, e.g., tree CRF.

2. Large number of samples for large batch sizes and restriction to small batch sizes
We agree this is a limitation of the batch algorithm. However, in some real problems, we may prefer small batches to large ones. For example, if we have a small number of annotators and labeling one example takes a long time, we may want to select a batch size that matches the number of annotators. In this case, the annotators can label the examples concurrently while we can make use of the labeled examples as soon as they are available. If we choose a large batch, it would take a long time to label the whole batch and we cannot use the labeled examples until all the examples in the batch are labeled.

3. Different bounding constant in Theorem 3
The batch algorithm has a different bounding constant because it uses two levels of approximation to compute the batch policy: At each iteration, it approximates the optimal batch by greedily choosing one example at a time using equation 4 (1st approximation). Then it uses these chosen batches to approximate the optimal batch policy (2nd approximation). In the fully adaptive and non-adaptive approaches, we only need to make one approximation. In the fully adaptive case, the batch size is 1, so we can always choose the optimal batch at each iteration. Thus, we only need the 2nd approximation. In the non-adaptive case, we only choose 1 batch. So, we only need the 1st approximation.

4. Parameter integration for SegEnt/LeastConf (using samples)
In Bayesian CRF, although we can sample a finite set of models from the posterior, as far as we know, there is currently no simple or efficient way to compute the SegEnt/LeastConf criteria from the samples, except for using only the MAP estimation. The main difficulty is to compute a summation (minimization for the LeastConf criterion) over all the outputs y's in the complex structured models. For max-GEC, the summation can be rearranged to obtain the partition functions, but we cannot do so for SegEnt/LeastConf. This is an advantage of using max-GEC since it can be computed efficiently from the samples using known inference algorithms.

Review 2

1. Proof that our criterion minimizes the overall error probability
We are also unaware of any formal proof of this. However, in Bayesian setting, the Gibbs error upper bounds the Bayes error, and hence serves as a motivation for us to investigate this criterion.

2. Computing the sequential Tsallis entropy in non-adaptive policy
In equation 2, we do not know the true labeling of S_i. So, we take a summation over all the possible labelings of S_i \cup {x}. For example, if there are two labels, this summation is over 2^{|S_i|+1} labelings. In other words, we compute the Tsallis entropy for the distribution p_0[y_{S_i \cup {x}}; S_i \cup {x}] over these 2^{|S_i|+1} labelings. Since the support of this distribution (the number of labelings) is exponentially large, we can approximate the Tsallis entropy using Algorithm 2.

3. On redefinition of Gibbs error
The Gibbs error of line 110 is different from the definition at line 134/137. The error of line 110 is the error along one path of the policy tree, while the error at line 134/137 is the average error of the whole (non-adaptive) policy tree. We will improve the clarity of the notations.

4. Clarification for y_U at line 108
For the unlabeled examples, we do not know their true labeling. So, we can think of the true labeling as a random variable, whose probability is determined by the model. Given a model, which is any distribution on the hypothesis space (p_0 in this case), the probability of a labeling y_U can be computed using the equation at line 090 and is equal to p_0[y_U; U] (defined at line 105 for any distribution p).
If a fixed y_U is really the true labeling, we can compute the error of a Gibbs classifier on the set selected by a policy \pi (this set corresponds to exactly one path from the root to a leaf of the policy tree of \pi). This error is defined as \mathcal{E}(...) at line 110. However, we do not know which y_U is really the true labeling. So, we have to take an expectation over all the possible y_U in the definition of the policy Gibbs error (equation 1).
The sentence below equation 1 is to explain the above fact that when a y_U is really the true labeling, the policy will select examples along exactly one path down the policy tree.

5. Clarification for p_0 at line 109
The prior p_0 is a probability distribution over the hypothesis space \mathcal{H} (line 088). For any probability distribution on the hypothesis space (including both the prior p_0 and the posterior p), we can induce a probability for any event A using the equation at line 090.

6. The steps from E_{y_s} to Tsallis entropy at line 134
Since E_{y_S}[.] is with respect to the distribution p_0[y_S;S], we have:
E_{y_S}[1 - p_0[y_S;S]] = 1 - E_{y_S}[p_0[y_S;S]] = 1 - \sum_{y_S}(p_0[y_S;S] * p_0[y_S;S]) = 1 - \sum_{y_S}(p_0[y_S;S])^2.
This is the Tsallis entropy for the distribution p_0[y_S;S] over all the possible labelings of S.

Review 3
We agree with most points from reviewer 3 and will make the suggested changes.

Notes: The notation has changed in the final paper to increase clarity.