February 15, 2024
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Written by
Adrien Bardes
Quentin Garrido
Nicolas Ballas
Jean Ponce
Publisher
arxiv
Research Topics
Core Machine Learning
May 07, 2024
Hwanwoo Kim, Xin Zhang, Jiwei Zhao, Qinglong Tian
May 07, 2024
April 04, 2024
Jonathan Lebensold, Maziar Sanjabi, Pietro Astolfi, Adriana Romero Soriano, Kamalika Chaudhuri, Mike Rabbat, Chuan Guo
April 04, 2024
March 28, 2024
Vitoria Barin Pacela, Kartik Ahuja, Simon Lacoste-Julien, Pascal Vincent
March 28, 2024
March 13, 2024
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian
March 13, 2024
Product experiences
Foundational models
Product experiences
Latest news
Foundational models