Text-to-Image Diffusion Models are Zero Shot Classifiers

Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Main Conference Track

Bibtex Paper Supplemental


Kevin Clark, Priyank Jaini


The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data.However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks.We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers.The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood.We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot.Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.