Computer Vision

Q&A with Abhinav Gupta, winner of the J.K. Aggarwal Prize

April 13, 2021

We’re pleased to congratulate Facebook AI’s Abhinav Gupta on receiving the International Association for Pattern Recognition’s J.K. Aggarwal Prize for his work in unsupervised and self-supervised learning.

Gupta is a research manager at Facebook AI Research and an associate professor at Carnegie Mellon University. He has received several accolades, including the Office of Naval Research (ONR) Young Investigator Award, the IEEE Pattern Analysis and Machine Intelligence Young Researcher Award, the Sloan Research Fellowship, and the Okawa Foundation Grant.

Currently, Gupta is based in Pittsburgh, and his focus is on building AI systems and robots that can learn the same way humans do. That includes applying past experiences to new and novel problems, understanding the visual world, determining what actions to take next, and essentially developing what we’d call common sense. He took a moment to tell us more about the impact his work has had on computer vision and robotics, why AI algorithms should be more like babies, and where he plans to go from here. Here’s an excerpt, edited and condensed for clarity:

Q: What was the research about?

Abhinav Gupta: The visual world is rich yet structured. So we came up with the idea of leveraging the redundancy in visual data to act as supervision for training convolutional neural networks. For context, most techniques use supervised learning, which relies on manually labeled examples to teach computers how to learn specific tasks. However, supervised learning is different from how we learn as humans — and manually labeling the data is a significant bottleneck for engineers and researchers.

We published two papers, “Unsupervised visual representation learning by context prediction” and “Unsupervised learning of visual representations using videos.” Our research proposed a new method to train convolutional neural networks with self-supervised learning instead. Our technique was able to train neural networks without relying on any of the labeled examples that supervised learning requires, enabling AI models to learn concepts that are hard to capture just from labeled data sets. This could help us create more accurate and reliable systems in the future.

Q: How does this relate to robotics research?

AG: The results of our work inspired me personally to think beyond just data sets and benchmark tasks. How can we build systems that learn the way humans do? To me, the most significant difference between machines and humans is embodiment. Humans use physical interactions with the world to understand how the world works. As babies, we push objects, throw things, and put things in our mouths. All these interactions sound tactile on the surface, but they help us learn.

However, computer vision algorithms have to learn about the world from passive data alone. As the next step, we focused on building robots that use self-supervised learning (and lots and lots of data) for robotics tasks and visual representation learning.

Q: What happened when these papers were originally published, and how were they received by the AI community?

AG: They were eye-opening for the community due to the performance of the approaches we proposed. They were competitive when compared with their supervised counterparts, without requiring manual labels to learn. Before this, everyone had assumed that the magic was in the labels. For the first time, we demonstrated that the magic is actually in the visual data.

The robotics paper in ICRA 2016 was even more of a surprise — for both the community and myself! The robotics community had always relied on smaller data sets than those working in computer vision. The use of 50,000 examples was unheard of. Our data consisted of 700 hours of robot-object interactions and was 10x bigger than other contemporary data sets. Our results demonstrated that generalization is possible with extensive amounts of data.

Q: How has this work developed since your initial research? Is it used in any work we see today?

AG: Self-supervised learning has become an important topic in both computer vision and robotics. In the last couple of years, we have seen improvements in the performance of these approaches, advancing the field of AI. Some recent Facebook papers like MOCO, PIRL, and SwAV have demonstrated that self-supervised learning can even outperform supervised learning in a few cases.

As for real-world use cases, self-supervised learning has become important in advancing the use of AI to detect misinformation.

Q: What is your current focus?

AG: In the spirit of building agents that learn like humans, I’ve been looking at cognitive development theories for inspiration. One of the critical features of learning is the different stages. For example, in our early stages as babies, we focus just on passive visual learning — meaning we are not yet interacting with the physical world. As we develop, we use learned models to become curious and perform interactions out of curiosity. How can we build robots that are curious like us? What experiments should we perform so they continue to grow and develop? I am also interested in how we ensure that our research is accessible — for example, PyRobot — and can be reproduced by other researchers.

You can read more about Gupta and his research here.

Written By

Zoe Mara Talamantes

Technology Communications, Editorial Extern