Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference.
In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details.
We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks.
We evaluated some state-of-the-art models on our benchmark suite. On the y-axis of the first two rows of the graphs, there is the mean Average Precision (mAP) of the models and, on the x-axis, the number of negative captions in the vocabulary. While every model correctly detects objects in the absence of negative captions, they struggle when the number of fine-grained negative captions increases in every configuration. This does not happen in the Trivial benchmark because it does not contain fine-grained differences between positive and negative captions. This suggests that the main problem lies in correctly classifying the attributes of an object rather than localizing an object described by a complex natural language description. Among the attribute types we tested, colors are the easiest to discern, as they are more frequent in common datasets.
Alongside mAP, we report the Median Rank of the correct caption when the vocabulary is sorted by descending scores across all detected objects in the benchmark. We introduced this metric because mAP only considers the maximally activated label. Instead, the median rank helps to better quantify the confidence of each detector in predicting the correct label among the other choices available in the vocabulary. For example, in the Hard benchmark with eight or more negatives, even the top-performing models place the correct caption in the third position or lower for half of the objects.
@inproceedings{bianchi2024devil,
title={The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding},
author={Bianchi, Lorenzo and Carrara, Fabio and Messina, Nicola and Gennaro, Claudio and Falchi, Fabrizio},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={22520--22529},
year={2024}
}
This work has received financial support by the Horizon Europe Research & Innovation Programme under Grant agreement N. 101092612 (Social and hUman ceNtered XR - SUN project).
This work has received financial support by the European Union — Next Generation EU, Mission 4 Component 1 CUP B53D23026090001 (a MUltimedia platform for Content Enrichment and Search in audiovisual archives — MUCES PRIN 2022 PNRR P2022BW7CW).