All Things ViTs: Understanding and Interpreting Attention in Vision

1Tel Aviv University, Google 2Hugging Face
CVPR 2023 Tutorial
All_Things_ViTs

In this tutorial, we explore the use of attention in vision. From left to right: (i) attention can be used to explain the predictions by the model (e.g., CLIP for an image-text pair) (ii) Examples of probing attention-based models (iii) By manipulating the attention-based explainability maps, one can enforce that the prediction is made based on the right reasons (e.g., foreground vs. background) (iv) The cross-attention maps of multi-modal models can be used to guide generative models (e.g., mitigating neglect in Stable Diffusion).

Abstract

The attention mechanism has revolutionized deep learning research across many disciplines starting from NLP and expanding to vision, speech, and more. Different from other mechanisms, the elegant and general attention mechanism is easily adaptable and eliminates modality-specific inductive biases. As attention becomes increasingly popular, it is crucial to develop tools to allow researchers to understand and explain the inner workings of the mechanism to facilitate better and more responsible use of it. This tutorial focuses on understanding and interpreting attention in the vision and the multi-modal setting. We present state-of-the-art research on representation probing, interpretability, and attention-based semantic guidance, alongside hands-on demos to facilitate interactivity. Additionally, we discuss open questions arising from recent works and future research directions.

Tutorial Outline

The following is an outline of the topics we will cover in the tutorial. A detailed description can be found in this document.

Interpreting Attention
  • Brief history of Interpretability for DNNs
  • Attention vs. Convolutions
  • Using attention as an explanation
Probing Attention
  • Depth and Breadth of Attention Layers
  • Representational Similarities between CNNs and Transformers
  • Probing cross-attention

Leveraging Attention as Explanation
  • Model debiasing
  • Attention-based guidance

Tutorial Logistics

Our tutorial will be conducted in a hybrid manner on June 18, 2022 from 9 AM onwards. We aim to complete our tutorial by 12:00 PM. Due the VISA issues, Sayak won't be able to present in person. So, he will be joining and presenting virtually. However, Hila will be presenting in person. The tutorials will take place at the Vancouver Convention Center.

References

[1] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, Chefer et al.
[2] Do Vision Transformers See Like Convolutional Neural Networks?, Raghu et al.
[3] What do Vision Transformers Learn? A Visual Exploration, Ghiasi et al.
[4] Quantifying Attention Flow in Transformers, Abnar et al.
[5] Optimizing Relevance Maps of Vision Transformers Improves Robustness, Chefer et al.