About CaptionQA
CaptionQA evaluates whether image captions can stand in for images in downstream tasks. We measure utility: how well captions preserve visual information needed for real-world applications.
Benchmark Stats:
• 33,027 questions over 657 images
• 4 domains (Natural, Document, E-commerce, Embodied AI)
• 25 top-level + 69 subcategories
• 50.3 questions per image
Scoring Methodology
The leaderboard ranks models using a Score metric. For each question: 1.0 if correct, 0.0 if incorrect, 1/K + 0.05 if "Cannot answer" (where K = number of choices). Final score = average across all questions (%).
This favors precision over hallucination: saying less but avoiding wrong information scores higher than confidently misleading captions.
Resources
📝 Blog Posts:
Submission
1. Download the dataset from HuggingFace
2. Generate captions with your model
3. (Optional) Evaluate on validation set
4. Email captions to captionqa.team@gmail.com
5. We evaluate and email you the Score results (see Scoring Methodology above)
6. PR a table row to leaderboard repo
Privacy First: Email us only your captions (simple image_id: caption JSON), no personal info needed.
Easy PR: Copy our HTML template, fill in your scores, and add one table row to index.html.
See the detailed submission guide for the complete process.
Citation
@misc{yang2025captionqacaptionusefulimage,
title={CaptionQA: Is Your Caption as Useful as the Image Itself?},
author={Shijia Yang and Yunong Liu and Bohan Zhai and Ximeng Sun and
Zicheng Liu and Emad Barsoum and Manling Li and Chenfeng Xu},
year={2025},
eprint={2511.21025},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21025}
}
| Rank | Model | Type | Size | Overall | Natural | Document | E-comm | Embodied |
|---|---|---|---|---|---|---|---|---|
| 1 2025-Nov-19 |
GPT-5 OpenAI |
Proprietary | - | 90.28 | 88.78 | 90.81 | 94.73 | 86.82 |
| 2 2025-Nov-19 |
Gemini 2.5 Flash |
Proprietary | - | 89.64 | 88.95 | 88.97 | 95.73 | 84.89 |
| 3 2025-Nov-19 |
Gemini 2.5 Pro |
Proprietary | - | 88.98 | 87.89 | 88.66 | 93.91 | 85.45 |
| 4 2025-Nov-19 |
o4-mini OpenAI |
Proprietary | - | 87.23 | 84.66 | 88.14 | 93.18 | 82.94 |
| 5 2025-Nov-19 |
Qwen3-VL Alibaba |
Open-Source | 30B-A3B | 87.02 | 86.14 | 85.89 | 93.9 | 82.15 |
| 6 2025-Nov-19 |
Qwen3-VL Alibaba |
Open-Source | 8B | 86.2 | 85.25 | 85.85 | 93.35 | 80.37 |
| 7 2025-Nov-19 |
Qwen3-VL Alibaba |
Open-Source | 4B | 86.01 | 84.73 | 84.99 | 93.77 | 80.56 |
| 8 2025-Nov-19 |
GPT-4o OpenAI |
Proprietary | - | 84.56 | 82.69 | 82.55 | 91.4 | 81.61 |
| 9 2025-Nov-19 |
GLM-4.1V Zhipu AI |
Open-Source | 9B | 84.28 | 81.67 | 87.86 | 92.04 | 75.56 |
| 10 2025-Nov-19 |
Qwen2.5-VL Alibaba |
Open-Source | 32B | 81.2 | 78.35 | 82.67 | 90.81 | 72.98 |
| 11 2025-Nov-19 |
InternVL3.5 Shanghai AI Lab |
Open-Source | 38B | 79.58 | 78.26 | 78.91 | 86.47 | 74.68 |
| 12 2025-Nov-19 |
Qwen2.5-VL Alibaba |
Open-Source | 72B | 79.12 | 75.26 | 80.56 | 89.07 | 71.6 |
| 13 2025-Nov-19 |
Claude Sonnet 4.5 Anthropic |
Proprietary | - | 78.94 | 76.56 | 83.09 | 88.86 | 67.27 |
| 14 2025-Nov-19 |
InternVL3 Shanghai AI Lab |
Open-Source | 8B | 77.84 | 76.46 | 75.83 | 87.01 | 72.07 |
| 15 2025-Nov-19 |
InternVL3.5 Shanghai AI Lab |
Open-Source | 30B-A3B | 76.96 | 74.58 | 77.72 | 85.79 | 69.75 |
| 16 2025-Nov-19 |
InternVL3.5 Shanghai AI Lab |
Open-Source | 8B | 76.34 | 72.97 | 78.56 | 86.6 | 67.24 |
| 17 2025-Nov-19 |
InternVL3 Shanghai AI Lab |
Open-Source | 14B | 76.06 | 74.16 | 74.17 | 86.17 | 69.75 |
| 18 2025-Nov-19 |
Qwen2.5-VL Alibaba |
Open-Source | 7B | 75.31 | 71.64 | 75.85 | 85.38 | 68.36 |
| 19 2025-Nov-19 |
NVLM-D NVIDIA |
Open-Source | 72B | 71.79 | 73.13 | 65.25 | 78.46 | 70.31 |
| 20 2025-Nov-19 |
InternVL3.5 Shanghai AI Lab |
Open-Source | 1B | 71.51 | 70.82 | 68.08 | 82.69 | 64.46 |
| 21 2025-Nov-19 |
LLaVA-OneVision ByteDance |
Open-Source | 7B | 66.03 | 66.56 | 61.45 | 75.09 | 61.01 |
| 22 2025-Nov-19 |
LLaVA-1.5 UW-Madison |
Open-Source | 7B | 46.96 | 52.51 | 36.48 | 49.0 | 49.84 |
| 23 2025-Nov-19 |
InternVL3 Shanghai AI Lab |
Open-Source | 78B | 36.46 | 38.86 | 34.19 | 38.47 | 34.32 |
| 24 2025-Nov-19 |
Mistral Small 3.1 Mistral AI |
Proprietary | 24B | 33.76 | 35.91 | 30.81 | 34.52 | 33.78 |