April 18, 2019
A new approach to object recognition that uses a single neural network to simultaneously recognize distinct foreground objects, such as animals or people (a task called instance segmentation), while also labeling pixels in the image background with classes, such as road, sky, or grass (a task called semantic segmentation). While previous research has mainly explored these two segmentation tasks separately, using different types of network architectures, our work shows that both tasks can be addressed in one unified architecture. The new approach is memory and computationally efficient, and establishes strong baseline performance for the recently introduced panoptic segmentation task, which merges semantic and instance segmentation into a combined task.
The new architecture endows Mask R-CNN, a widely used system for instance segmentation that was developed by Facebook researchers in 2017, with a semantic segmentation branch using a shared feature pyramid network (FPN) backbone. This architecture, called a Panoptic FPN, can generate semantic and instance segmentations in parallel, with accuracy levels equivalent to training two separate single-task models, cutting the overall computation roughly in half. Our tests also demonstrate that when Panoptic FPN is given the same computational resources as two separate networks, it significantly outperforms the individual networks on the COCO and Cityscapes image recognition benchmarks.
In addition to showing that a single-network approach to panoptic segmentation is both effective and easy to implement, this work establishes a baseline for future research. Reducing the memory and computational overhead for rich and coherent image segmentation could have widespread implications for image recognition systems that have to make sense of cluttered real-world environments, where objects move and overlap. Segmenting foreground objects together with background is important for understanding entire scenes and for performing related actions, such as navigating through the dynamic features within a scene.