Paper ID:1294
Title:Optimal Teaching for Limited-Capacity Human Learners
Current Reviews

Submitted by Assigned_Reviewer_16

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper focuses on how to choose optimal training examples for people learning to discriminate categories. The authors develop an optimal teacher model that selects training examples in order to minimize generalization error, assuming that people make classification decisions in accordance with the GCM, a widely used categorization model. They test their model with an experiment and find that the best teacher is one that assumes that people have a limited memory capacity that only allows them to retrieve a few previous examples to compare to a new item. This teacher chooses "idealized" training sets rather than representative ones.

Overall, this is an interesting and well-written paper that accomplishes what it sets out to do: provide a normative basis for why idealized training sets improve human learning in categorization tasks. The analysis appears technically sound, and the experiment is well designed. Moreover, because this paper makes contact with both the machine learning and cognitive science literatures, it is likely to be of interest to a broad audience at NIPS. Therefore, I recommend acceptance.

My biggest criticism is that this paper is framed in a fairly narrow way. Namely, the paper seems to be motivated by the fact that previous categorization research has only used heuristics rather than rigorous methods to create idealized categories. This is an adequate motivation, but it seems that this topic has broader implications -- perhaps for curriculum design. I think the paper would benefit from some discussion of how the analysis the authors develop could be applied in more real-world contexts. I think the authors should also cite the work on pedagogical reasoning by Patrick Shafto and Noah Goodman and discuss how it is related. Shafto and Goodman's work is also about choosing optimal training examples for learners, although their goals are somewhat different.

Other comments
- The empirical test of the model here is not very compelling. The results show that the optimal teacher does better than a random teacher, but not by much (and a random teacher is a pretty low bar to begin with). I don't think this is something the authors must address, but if there is a sensible alternative model to include besides a random model, it would make the test more compelling.
- Figure 1: Related to the previous comment, why did you use this category distribution instead of some other one? Because you only considered one distribution, it's not clear if your results will generalize to other situations.
- Lines 43-46: I think this is a poor opening paragraph for the paper. It sets up the false expectation that this paper is about how people make category judgments but the paper seems to be primarily about how to best choose training examples to help people learn categories.
- Line 171: "problem (5)" should be "equation (5)"?
- Line 176: "We now removed" --> "We now remove"?
Q2: Please summarize your review in 1-2 sentences
An interesting and rigorous paper likely to be of interest to a broad audience at NIPS.

Submitted by Assigned_Reviewer_21

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The work presented here is solid and innovative from a cognitive science
perspective, and the manuscript is well written and for the most part clear. I
have some issues and questions for the authors, which I suspect can be
addressed.

Major comments:

On line 031 of the abstract, the authos conclude, "We find that the optimal
teacher recommends idealized training sets." I believe that the idealization
of training sets is a presumption of the 'optimal' teacher, so the teacher is
really an 'optimal idealized-training-set' teacher. I would be happy if the
authors simply stated at the outset that they were going to limit their
exploration to a search for idealized training sets, based on past
psychological research. Because they seem unwilling to make such a statement,
they introduce idealization in a slippery backdoor kind of way (see line 175,
the sentence beginning "We do not have evidence...")

It is true that the various idealized training sets outperform one particular
nonidealized set (the random set), but I don't think the right comparisons
have been done to conclude that idealization is the optimal strategy for
limited-capacity human learners.

It would be useful to say a few words about what relevance idealization has to
training real-world tasks. Idealization is a different idea than simply using
prototypes: idealization involves labeling nondeterministic stimuli with their
predominant classification. But with naturalistic tasks, one may not be able
to ascertain the degree of ambiguity of a stimulus. Although the paper is not
about idealization per se, it presumes that the teacher can idealize stimuli.

The authors report that the optimal teacher selects clump-far examples
for a limited-capacity learner (line 222). I would like to know more about
these clump-far sets. Figure 2A shows that the 'far' clump is part way but not
all the way to the stimulus dimension extrema. Why not stimuli further
from the boundary? What's the trade off that lands the 'far' stimuli at around
.15 and .85? Is the optimal set really 5 identical examples adjacent to
5 identical examples, or is that simply how the figure is drawn?

I didn't go back and look at the Nosofsky & Palmieri Psych Rev paper (reference
[17]), but it's difficult to intuit the claim that \gamma relates to memory
capacity (line 110). My intuitions are further shot by the comment (lines
227-299): "The high-capacity GCM...is sensitive only to the placement of
training items adjacent to the decision boundary". This doesn't sound like a
high capacity model.

Until the last paragraph of the article (line 429), the authors did not
touch on an issue that bothered me from the outset: that order
effects are ignored. To the degree that human memory is capacity limited,
it is also recency biased, and the fact that recency doesn't come into play
further weakens my acceptance of the low-lambda GCM as a model of
limited-capacity instance-based learning.

MINOR COMMENTS:

line 025: In the abstract and the introduction, the authors comment that
idealization runs counter to ML practice, where one aims to match statistics of
the training and test sets. However, with limited training set size, my hunch
is that idealization would benefit ML classifiers as well on the particular
task studied in this paper. For example, comparing a 'spread' idealized
training set to a 'spread' non-idealized set, It seems likely that the
variability in the inferred decision boundary will be smaller with the
idealized set.

057: What does "idealized in an ad hoc or heuristic fashion" mean?

192, "discuss the experiment with human learners": There's ambiguity in the
prepositional phrase attachment. You might want to re-word.

205: I'm lost on the notation. Where are the empirical probabilities shown?
Is the $y_j = 1$ term in the equation for $f^{(cond)}$ the observed outcome
of an experimental trial?

Figure 4: It might be a good idea to zoom in to focus on the 50-75%
performance range.
Q2: Please summarize your review in 1-2 sentences
Interesting ideas and innovative approach, but only a small step into a set of complex issues. The results obtained -- both theoretical and experimental -- are sensible but hardly surprising.

Submitted by Assigned_Reviewer_27

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper the authors consider how to optimally train humans to make categorical judgments. In particular, they are interested in finding the optimal fixed training set that maximizes test accuracy for a specific cognitive model (the Generalize Context Model, GCM). From this model, they derive an approximate objective function and minimize it to find optimal training sets for models with large and small capacity constraints. They then test the efficacy of these different training sets on 600 mTurkers and show that the “Clump-far” and “Spread” test sets, that are optimal for small capacity learners, are the best for training humans.

This is a really nice paper. It’s well and the results are novel and interesting. It’s very appropriate for NIPS.

I have no major comments.

Minor comments:

Was the study IRB approved? I would assume so but this should be stated. In the unlikely event that the study was not IRB approved the paper should not be published until approval is obtained.

I wasn’t completely sure how the “optimal classifier” was defined. From the text this just sounds like it corresponds to the “true” class labels? Is that correct? If so perhaps a simpler way to report the results (e.g. in Figure 3) would be to simply say they correspond to the accuracy with which humans classify the stimuli. If its something different from this – perhaps closer to an ideal observer making judgments under some perceptual noise – then that should be more explicitly stated.
Q2: Please summarize your review in 1-2 sentences
The authors study the problem of optimally training humans to perform visual categorizations.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all three reviewers for their positive and constructive comments.

Reviewers 16 and 21 raise a question that empirical model comparisons might not be adequate and the random sets being a low-bar comparison.
-We would like to clarify that, we compared low- and high-capacity versions of the same model (GCM) that only differ in \gamma parameter (lines 206-210). We believe that this is a rigorous comparison in which human performance closely matches the low-capacity GCM performance (Figure 4, lines 396-405). Furthermore, our aim here was not testing whether idealized sets outperform the random set. We applied the optimal teacher to see its recommendations for a low- and high-capacity models and whether idealization emerges from applying the optimal teacher to the low-capacity model (lines 58-61, Figure 2). Our empirical finding is that humans do well with the low-capacity training set compared to the high-capacity and random sets (lines 396-405, Figure 4).

Reviewers 16 and 21 ask how the work developed is applicable to real world problems.
- Real world stimuli do present new challenges as the reviewers point out but we believe that they can be addressed within our framework. We tried to communicate this by citing a recently published article Hornsby and Love 2014 which shows application of idealization to mammogram classification (line 50).

Our clarifications to some specific comments by the reviewers are as following.

Reviewer 16
----------------
We agree with the reviewer that this work might have broader implications and the paper will benefit from a broader Introduction. We also agree that reformulating the opening paragraph will make the intentions of the paper clearer. We also thank Reviewer 16 for pointing out interesting work by Shafto and Goodman, which we agree is relevant.

Why did you use this category distribution instead of some other one? It's not clear if your results will generalize to other situations.
-We used this test distribution as it contains representative items in the whole range. Using other distributions is an interesting idea and could be pursued as future research direction. This can be included in the discussion (line 428) where we touch on variety of other problems to be tackled.

Reviewer 21
----------------
The teacher is really an 'optimal idealized-training-set' teacher and ‘idealization involves labeling nondeterministic stimuli with their predominant classification’.
- We agree that labeling stimuli with predominant classification is a way to idealize. But we also state another way to idealize “...minimize the saliency of ambiguous cases during training” (lines 50-53, Giguere and Love 2013). In this sense we do not limit our search to idealized training sets but idealization emerges as a recommendation of the optimal teacher as the low-capacity optimal set (Cump-Far, Figure 2) is placed away from the boundary. We will make an effort in the future to make this clearer.

Details on the Clump-Far set and if the stimuli are identical?
-The stimuli obtained from the optimization procedure are not exactly identical but very close to each other in each clump. We show all stimuli here:
0.1191143,0.1191683,0.1192339,0.1348548,0.1351363,0.1351442,0.1351473,0.135148,0.1351491,0.1352406,0.8647532,0.8648722,0.8648736,0.8648761,0.864884,0.8648982,0.8649009,0.8808524,0.8808894,0.8809059
Rounding the stimuli to 3 decimals causes insignificant change in the error rate (0.2448796 to 0.2448797), thus the stimuli in a clump can be considered identical in terms of optimization score.
We thank the reviewer for pointing out the drawing issue as the numbers of items in the Clump-Far in Figure 2 are wrong. We apologies for this mistake and will correct the figure, i.e. the numbers on Clump-Far from left should be 3,7,7,3. Please note that our results and conclusions are not affected by this.
The location of the clumps is defined by the \gamma parameter obtained by fitting empirical data (section 4.1, line 207). Moving the clumps from the optimal position increases the error rate of the low-capacity GCM.

Whether \gamma parameter represents memory capacity and high-capacity model is not high-capacity
-The high-capacity GCM indeed performs well on all the training sets and is thus not sensitive to the location of the items (Table 1 on line 253). The high-capacity model performs poorly only when the stimuli get infinitesimally close to the boundary \theta. Thus our results suggest that the \gamma parameter indeed reflects memory capacity.

Order effects are ignored
-We agree that order effects can affect learning in interesting ways. We point this out in the discussion (line 429). This is the next challenge where our framework can be extended using appropriate learning models.

What does "idealized in an ad hoc or heuristic fashion" mean?
- By “ad hoc or heuristic fashion” we mean guided only by the intuitions of the experimenters in contrast to a rigorous systematic approach.

Where are the empirical probabilities shown? Is the $y_j = 1$ term in the equation for $f^{(cond)}$ the observed outcome of an experimental trial?
-The empirical probabilities were obtained from Experiment 2 of Giguere and Love 2013 (line 196). They are not shown here but can be found in Giguere and Love 2013, Figure 4. The reviewer is correct in observing that the $y_j = 1$ term in the equation for $f^{(cond)}$ is the observed outcome of an human experimental trial.

Reviewer 27
----------------
-The study was approved by an ethics committee.

-The optimal classifier assigns “correct labels” as ˆy = sign(x – θ*) (line 350). We agree that clarifying the caption of Figure 3 might improve readability.