June 23, 2021
We’re pleased to congratulate Facebook AI’s Georgia Gkioxari on receiving the 2021 Pattern Analysis and Machine Intelligence (PAMI) Young Researcher Award, which is given by the Technical Committee on Pattern Analysis and Machine Intelligence (TCPAMI) of the IEEE Computer Society. Gkioxari is a Facebook AI Research Scientist based in our Menlo Park headquarters. Her focus is on endowing machines with the ability to see the same way humans do. She builds systems which perceive scenes, objects, people, and their actions by merely processing visual inputs.
Gkioxari won the Marr Prize for Mask R-CNN in 2017. She has also been named one of the top women in AI by Re-Work. Gkioxari graduated from UC Berkeley with a PhD in EECS under the supervision of Jitendra Malik. She is originally from Athens, Greece.
She took a moment to talk about her work, its impact, and her future research plans.
Georgia Gkioxari: It all started during my undergraduate studies in Greece, and an introduction to computer vision course. That is when I started developing an interest in computer vision and AI. I have been working on computer vision ever since! I am mostly intrigued by perception, namely the ability to recognize and understand from visual inputs. Perception is a field that combines signal processing, machine learning, and pattern recognition, and aspires to mimic the human brain or at least reach its competence.
GG: I have been working on computer vision for a while now. Even though I have worked on a variety of topics, the focus of my research and the focus of my interest in research are around visual understanding: recognizing objects, people, and scenes from images or videos.
It’s a very challenging and multifaceted problem, as it requires capturing low-level cues, how pixels are organized and grouped on a 2D plane, and top-down understanding of entities, attributes, relations, and geometry. To achieve perfect performance on this task, models are likely to reason beyond the 2D plane, even if predictions are planar. Images are 2D, but the world is 3D, and to understand the world we need to understand its 3D nature via 2D views.
I am increasingly curious about all these aspects of recognition and have worked on projects that propose ways to recognize objects in 2D and 3D, outline a person’s pose, understand their actions and interactions with the world around them, and more. All these projects, even though they tackle different tasks, study the same underlying problem: how a group of red, green, and blue pixels can lead to concepts.
GG: I have worked on many exciting projects throughout my career, and with some fantastic collaborators. I will single out perhaps the projects at the early stages of my career, during my graduate studies. I remember my first project, which was on human pose estimation. It was a very intense and dense learning experience. I would come out of the project meetings with my PhD adviser, Jitendra Malik, and I remember being amazed because I had learned something new every time. Working on a research project is something that no textbook can teach you. I was lucky enough to have a great PhD adviser to walk me through it. Since then, the projects I enjoy most are on problems I am passionate about and with people I have fun working with.
GG: Currently, I have been focusing more on visual recognition via 3D reasoning. Understanding the world in 3D from one or few 2D images is something I really believe in. It is how humans learn, and it is what can allow machines to better explain patterns in 2D images, such as occlusions. It’s a very challenging direction for two reasons. First, there are no large-scale annotated data sets for 3D. Unlike 2D annotations, 3D annotations are expensive, sometimes impossible, to collect, and scaling annotation pipelines of millions of images and thousands of objects is just not feasible. So we have to devise new ways to learn. One way is through videos. Second, tools to manipulate and learn with 3D data are not as advanced or developed as for 2D. And efficient tools are important for 3D deep learning. I have been working on inventing new ways to learn 3D from images or videos, and also on building useful tools to be used for research in this direction. PyTorch3D is the materialization of the latter efforts.
GG: One of the most impactful projects I have worked on, along with FAIR researchers Kaiming He, Piotr Dollar, and Ross Girshick, is Mask R-CNN. Mask R-CNN is an end-to-end model that detects object instances by marking their silhouettes from a single image. Mask R-CNN is being used extensively within Facebook and has formed the basis for model extensions since, internally and externally. It’s currently used to power the Smart Camera system in Portal, for example.
GG: The field of CV has changed drastically. The advent of the deep learning wave in 2012 has definitely had its impact. In my view, the most important change was the shift to large data, namely learning from large corpuses. This has revealed new and exciting applications in CV, with models which generalize and work in the wild.
GG: Moving forward, I think the big question we will have to answer in research is the importance of supervised learning and how to move beyond it. With a shift to large data, collecting annotations for billions of examples is just not possible. Also, as we want to solve more complicated tasks, it is the case that we cannot annotate for the task at hand. This means that we have to find ways to leverage limited annotations and learn from them while also benefiting from much more unannotated data.
In addition to technical challenges, it is now more evident than ever that we need to think about the impact of our work in society at a global scale. The repercussions of our work from an ethical standpoint are something that every single AI researcher needs to think about, understand, and deeply analyze in terms of their work and their responsibility to the community.