PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Authors

  • Xiaoyi Dong University of Science and Technology of China
  • Jianmin Bao Microsoft Research Asia
  • Ting Zhang Microsoft Research Asia
  • Dongdong Chen Microsoft Cloud + AI
  • Weiming Zhang University of Science and Technology of China
  • Lu Yuan Microsoft Cloud + AI
  • Dong Chen Microsoft Research Asia
  • Fang Wen Microsoft Research Asia
  • Nenghai Yu University of Science and Technology of China
  • Baining Guo Microsoft Research Asia

DOI:

https://doi.org/10.1609/aaai.v37i1.25130

Keywords:

CV: Representation Learning for Vision, CV: Object Detection & Categorization, CV: Segmentation

Abstract

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment. This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity. We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (88.3%) among methods using only ImageNet-1K data.

Downloads

Published

2023-06-26

How to Cite

Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo, B. (2023). PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 37(1), 552-560. https://doi.org/10.1609/aaai.v37i1.25130

Issue

Section

AAAI Technical Track on Computer Vision I