Check Out our Paper on Instance Segmentation
A few months back, I released a research paper on arxiv, together with my friend Jonathan Mitchell, detailing some of the research we’d done on fast person instance segmentation.
The idea is pretty straightforward. It is a “bottom-up” approach to instance segmentation, meaning that we first segment the image by class (i.e. assign “person” or “background” label to each pixel), then group the person pixels into instances. In order to do this, we employ a basic backbone network, such as ResNet50 or MobileNet (for deploying on mobile devices) together with the decoder module(s) from DeepLabv3+. But instead of just a single (per-class) mask output used in regular semantic segmentation, we add another four-channel output of the same resolution which encodes the bounding box of the instance to which each pixel belongs. Based on the predicted bounding box for each pixel, we can pretty easily and efficiently group together pixels from the same instance.
This approach is inspired by single-stage object detectors like SSD and YOLO, which have per-region bounding box predictions. Our version can essentially be thought of as a very “dense” object detector - each pixel predicts an object class (in the first output of the network) as well as a bounding box for the object present. Just as in SSD, the bounding boxes are encoded as offsets from “prior” or “anchor” bounding boxes, which are centered at each pixel.
Using the output of the network, a pretty simple algorithm involving a non-max supression step and a IoU measure for each person pixel is all that’s necessary to determine the instance masks. This algorithm is fast and easy to implement even on mobile devices.
Please check out our paper, “Bounding Box Embedding for Single Shot Person Instance Segmentation”, on arxiv and contact me with any questions or suggestions.