Is CLIP the main roadblock for fine-grained open-world perception?

Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time — a task known as open-vocabulary object detection.

Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings — i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics.

Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details.

Research questions

Q1. Are OVDs failing primarily due to CLIP, or does localization (bounding box prediction) also play a role?

We began our analysis by examining the relationship between open-vocabulary object detectors and their vision-language backbone, specifically focusing on CLIP.

Our results suggest that localization is marginal in the limitations observed in fine-grained open-vocabulary object detection. This demonstrates that the primary problem lies in the interaction between vision and language within the shared latent space.

Q2. If the issue is with CLIP, is it struggling to encode fine-grained information (from either the image or text), or is the image-text matching not functioning correctly?

Assuming that the fine-grained knowledge is present within the CLIP latent space, we hypothesize that the matching scheme used to compare the representations, i.e., the typical cosine similarity, is insufficient to extract this specific information. To explore this possibility, our strategy involves earning a customized similarity function S, which takes as input the two embeddings v and t obtained from the frozen visual and textual encoders. By forcing S to recognize nuanced object properties based only on the embedded information, we can state that successful results in this regard mean that the embeddings inherently encode fine-grained knowledge.

These results show that we can learn a more complex similarity matching between the representations and that nuanced information is indeed present in CLIP embeddings. While fine-grained information exists within the CLIP latent space, the representation is heavily biased towards coarse-grained concepts. This bias causes similar concepts to be positioned too closely within the latent space, making it difficult to detect nuanced differences using traditional cosine similarity

BibTeX

 @inproceedings{bianchi2024clip,
  author       = {Lorenzo Bianchi and
                 Fabio Carrara and
                 Nicola Messina and
                 Fabrizio Falchi},
title        = {Is Clip the Main Roadblock for Fine-Grained Open-World Perception?},
booktitle    = {21st International Conference on Content-Based Multimedia Indexing,
                 {CBMI} 2024, Reykjavik, Iceland, September 18-20, 2024},
pages        = {1--8},
publisher    = {{IEEE}},
year         = {2024},
url          = {https://doi.org/10.1109/CBMI62980.2024.10859215},
doi          = {10.1109/CBMI62980.2024.10859215}
}

Acknowledgements

This work has received financial support by the Horizon Europe Research & Innovation Programme under Grant agreement N. 101092612 (Social and hUman ceNtered XR — SUN project).

MUCES Project Logo This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1 CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).