Review for NeurIPS paper: Gibbs Sampling with People

NeurIPS 2020

Gibbs Sampling with People

Review 1

Summary and Contributions: The manuscript introduction Gibbs sampling with people, a method for exploring people’s perceptual spaces. The approach is an extension of Markov chain monte carlo with people such that binary decisions are relaxed to a choice to select from a continuous dimension. The new approach is compared with MCMCMP, showing distinct improvements. Results further demonstrate GSP on three additional domains, and in combination with image synthesis (StyleGAN) and interpretable networks (GANSpace) to illustrate exploration of high dimensional latent spaces.

Strengths: Exploring human perceptual representations is an important problem that arises in many contexts. Existing methods are limited by the guess-and-check nature of exploration eliciting people’s judgements. MCMCP was a promising approach because it relieved the problem of having to systematically explore the space by creating a markov chain of decisions. However, simple binary judgements and the continued need to guess and check (via the proposal distribution) limited applicability. The generalizations presented here make the approach considerably more general, as demonstrated in the experiments. The empirical evaluation is impressive in scale, scope and results.

Weaknesses: Of the theoretical questions, it was particularly nice to see the probability matching interpretation of human decisions was engaged directly. One potential challenge for the utility theoretic formulation is the well-documented phenomenon in which goodness or typicality (more generally, utility judgments) vary across categories. On the one hand, I suppose one could argue the approach is to estimate that, on the other, this no longer necessarily reflects the underlying perceptual space. For example, taller trees are deemed more typical, even though they are statistical outliers. It is hard to say for sure to what degree this is a feature (reflecting quirks of human judgments) or a bug (doing so in a way that misses important statistical and representational aspects of experience). It would be nice to see aggregation applied to MCMCP in the main text as well. The results do seem to suggest that GSP is a bit better, but the real power seems to come from aggregating. As noted in the text parameterizing the space seems important to the effectiveness of the method. It would be nice to have demonstrations of the consequences of different parameterizations and some thoughts on how to this might be done well in general (is PCA really effective?).

Correctness: The analysis seems correct and complete.

Clarity: The paper is very well written.

Relation to Prior Work: The paper reviews prior work (primarily MCMCP, but also others) and clearly demonstrates that the proposed method is an improvement.

Reproducibility: Yes

Additional Feedback: ===Post response==== I have no further updates to my review. I don't know if this will be surfaced to the authors, but the paper was flagged for as a potential ethics concern. I do not concur with that opinion, but it would be helpful if the authors could try to sharpen that point that the method is aimed at understanding biases rather than creating datasets for training machine learning algorithms.

Review 2

Summary and Contributions: - The authors propose Gibbs sampling with people, an alternative to the ‘mcmc with people’ idea for sampling participants ‘mental representations. - A comparison is made between MCMCP and GSP though several experiments, with follow-up experiments examining specific aspects of the method such as aggregation

Strengths: - This is a very nice idea, backed up by a series of experiments that each expand on the ideas in the paper - The writing is very good and the figures clear - The background is well described, and the theory behind the method is described in a good amount of detail - This is certainly very relevant to the NeurIPS community and very novel

Weaknesses: - Some details of the experiments are missing, but that is to be expected given the number of experiments. The appendix clarified my questions

Correctness: - This look all perfectly correct

Clarity: - Clear and well written

Relation to Prior Work: - Nicely builds on previous work (MCMCP)

Reproducibility: Yes

Additional Feedback: - One question that still remained after reading the main text (although partly answered by the appendix) was regarding the total number of responses for GSP and aggregated GSP. If subjects have to make 10 responses per iteration, any benefit per response may be gone? - Related: how do the experimenters avoid subjects merely making the same response 10 times? Just by varying the starting position of the slider? - Devil’s advocate: Might be good to talk about how this differs from e.g. a multidimensional methods of adjustments in psychophysics?

Review 3

Summary and Contributions: This paper describes a new method, Gibb's Sampling with People (GSP), for exploring people's semantic representations. The technique builds on MCMC with people but uses a continuous choice, rather than a binary choice, for selecting the next stimulus in the chain. Theoretically, there are two main contributions: (a) reframing MCMCP in way that doesn't require that people be probability matching (rather than maximizing), and (b) developing the GSP approach. Within both of these, participants are assumed to extract a utility value, made up of an actual utility and a noise component, and the behavior of the sampler is contingent on the scale of the noise component. The paper considers several methods of aggregating samples in order to reduce noice and make it easier to estimate the mode of a distribution. To test GSP, the paper presents a number of studies (4 in the main paper, plus additional variations in the supplemental materials), and establishes that GSP is more effective at mode-seeking than MCMCP, especially with aggregation, and can uncover interesting information about people's semantic representations of categories and continuous perceptual spaces.

Strengths: One of the major strengths of this pape is the breadth of the empirical results, which both serve to illustrate the strengths of the new methodology and apply that methodology in interesting ways. By looking at four different domains, ranging from the relatively simple color stimuli to the high dimensional face stimuli, the experiments are able to go beyond simply showing that GSP has advantages to MCMCP to showing what sort of insights might be gained by exploring complex representations in this way and offering some suggestive evidence for being able to uncover cross-cultural differences in representations and stereotypes. A second strength is in the reframing of MCMCP to not require the assumption that people probability match. Finally, I believe GSP is likely to be adopted by psychology and cognitive science researchers interested in representations, leading to a potentially large impact of this work. While the material is in some ways different from the "typical" NeurIPS paper, I think there is likely to be substantial interest by those who are interested in representations generally as well as those interested in cognitive science specifically.

Weaknesses: Overall, I thought this was a strong paper. The main concerns I had were as follows: (1) Mode-seeking versus showing the distribution: The aggregated results in the first experiment seem to show much more homogeneity than the results for GSP or MCMCP. It seems like one limitation of this approach might be that there is limited exploration of the space, perhaps making it hard to move between modes, and also makes it more difficult to see the full shape of the distribution, which I have often taken to be a goal in work using MCMCP. The movement between optimization and seeking a distribution is discussed to some extent in the paper, but I would be interested in seeing this discussed more (and perhaps whether GP without aggregation is likely to lead to more optimization than MCMCP). In the author response, they have shown additional information suggesting that GSP is more mode-seeking but also does a better job of capturing the distribution. While this doesn't completely get at cases that are more multimodal than colors, it does go a long way to addressing this concern and I look forward to reading more details about the new experiment in the supplement. (2) Possible conflation of participant representations and representations in the stimuli space: In the final experiment, the faces are generated via a GAN and manipulated by automatically generated axes in the space of faces generated by the GAN. The stimuli space thus seems far more subjective and dataset-dependent than in the other cases, and to draw conclusions about people's representations of faces from this, I would want to see additional chains from a GAN trained with a different dataset of images (and perhaps to get some sense of how similar the dimensions are across different training sets). That isn't the main point of this paper, and thus I don't see it as a prohibitive weakness, but it does make me concerned about interpreting the results of that final experiment and whether it would be possible to really gain insights into people's representations about the relevant categories. (And these problems seem likely to be made worse by being in an optimization regime, as that experiment is due to the aggregation - the conclusion that there isn't much multimodality around these concepts also seems surprising and understanding how the results across chains were compared would be helpful). The author response reframes this experiment a bit and also adds an experiment with an additional dataset, which is helpful. Their response makes clear that they are planning to reframe the discussion of the experiment in the paper a bit as well, and I would encourage them to be clear about limitations in terms of assumptions of a single representation for particular concepts: it seems unlikely that people monolithically have a single distribution over concepts like "attractive" and to the extent that Turkers tend to under-represent Black and Latinx populations in the US (at least based on 2016 Pew Research results - I'm not sure about follow up work), it seems like the favoring of Caucasian faces for particular concepts should be placed in a broader context.

Correctness: The claims and method in the paper appear to be correct.

Clarity: Overall, the paper was clearly written. Some of the information about aggregation was a little hard to understand in an initial read through, but the supplementary materials provide additional information that made these parts more clear.

Relation to Prior Work: The paper made clear how this was related to prior work and described what contributions were new.

Reproducibility: Yes

Additional Feedback: My main feedback is summarized above. I thought the work was very interesting, and while the difference in setup is relatively simple, this is a good methodological innovation. As noted above, the face experiment is the one that I find the most dubious, and that's both in terms of the reliance on the underlying dataset and using the results to draw conclusions about people's representations. While I know space is limited, some discussion about what conclusions can be drawn from the results (beyond "here's an example of a trustworthy/intelligent/etc face") would I think be really helpful for illustrating the benefits of the proposed approach for studying high dimensional stimuli to researchers in cognitive science (of which I consider myself one). As expanded upon a bit above, this is also a place where I think there is potential for misunderstanding that perpetuates societal biases; I appreciate the attention to these issues in the Broader Impact statement, and I think bringing some of these ideas into the main paper when discussing this experiment and/or limitations more generally would be very helpful for making clear that these representations are not speaking to intrinsic qualities of the stimuli but about the representations of the people in the study (and that the focus is on the modes, which may mean that representations of subpopulations in the study are even less likely to be represented in the final distribution). While I recognize that any study can be misconstrued, I think making the specifics here very clear could diminish the possibility of this being a paper that is used to reinforce racial and gender stereotypes.