LaVG first discovers existing object masks without the text information (panoptic cut).
And then it later assigns the class in text descriptions to each object with cross-modal similarity (object grounding).
Results
Quantitative comparison
Performance (mIoU, %) comparison of the state-of-the-art open-vocabulary segmentation methods on seven public benchmarks.
The models with † do not use CLIP.
The models with ✓ involve additional training on top of the pretrained backbone.
The numerical results [12,62,70,81,112,121] are adopted from SCLIP [97].
Comparison of computational cost
Comparison of computational cost measured with an Nvidia RTX 3090.
The columns with † are taken from [21].
The outstanding advantage of training-free models is the computational efficiency from the absence of time, data, and computation burden in additional training.
Compared to another training-free ovseg method [98] based on diffusion models which leverages image/caption generation models with heavy parameters, we use much more lightweight feature embedding models.
However, our iterative process of panoptic cut introduces a longer inference time.