Lazy Visual Grounding

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

ECCV 2024

Dahyun Kang Minsu Cho

Pohang University of Science and Technology (POSTECH)

Qualitative comparison of our lazy grounding and pixel-level text classification baselines on COCO-stuff 164K. Pixel grounding methods often produces spurious correlations with imprecise edges.

Proposed method

Two stages of Lazy Visual Grounding (LaVG)

LaVG first discovers existing object masks without the text information (panoptic cut).
And then it later assigns the class in text descriptions to each object with cross-modal similarity (object grounding).

Results

Quantitative comparison

Performance (mIoU, %) comparison of the state-of-the-art open-vocabulary segmentation methods on seven public benchmarks. The models with † do not use CLIP. The models with ✓ involve additional training on top of the pretrained backbone. The numerical results [12,62,70,81,112,121] are adopted from SCLIP [97].

Comparison of computational cost

Comparison of computational cost measured with an Nvidia RTX 3090. The columns with † are taken from [21]. The outstanding advantage of training-free models is the computational efficiency from the absence of time, data, and computation burden in additional training. Compared to another training-free ovseg method [98] based on diffusion models which leverages image/caption generation models with heavy parameters, we use much more lightweight feature embedding models. However, our iterative process of panoptic cut introduces a longer inference time.

Browse more qualitative results

Comparison with different models

Qualitative comparison of our method and others. The first three columns are directly copied from the paper in the third columns [6,7,33,110,112]. The results are not much cherry-picked as the choice of visualized images is taken from the other work.

Open-vocabulary segmentation results of images in the wild

Segmentation results of LaVG given the text description set. These precise object masks and visual grounding results are produced in a training-free fashion. The photographs are taken by authors with an iPhone 14.