logo

CaptionQA

Is Your Caption as Useful as the Image Itself?

About CaptionQA

CaptionQA evaluates whether image captions can stand in for images in downstream tasks. We measure utility: how well captions preserve visual information needed for real-world applications.

Benchmark Stats:
• 33,027 questions over 657 images
• 4 domains (Natural, Document, E-commerce, Embodied AI)
• 25 top-level + 69 subcategories
• 50.3 questions per image

Scoring Methodology

The leaderboard ranks models using a Score metric. For each question: 1.0 if correct, 0.0 if incorrect, 1/K + 0.05 if "Cannot answer" (where K = number of choices). Final score = average across all questions (%).

This favors precision over hallucination: saying less but avoiding wrong information scores higher than confidently misleading captions.

Resources

📝 Blog Posts:

English Blog

Rethinking Multimodality from an Industry Perspective

Chinese Blog / 中文博客

从产业视角重新审视多模态

Submission

1. Download the dataset from HuggingFace

2. Generate captions with your model

3. (Optional) Evaluate on validation set

4. Email captions to captionqa.team@gmail.com

5. We evaluate and email you the Score results (see Scoring Methodology above)

6. PR a table row to leaderboard repo

Privacy First: Email us only your captions (simple image_id: caption JSON), no personal info needed.
Easy PR: Copy our HTML template, fill in your scores, and add one table row to index.html.
See the detailed submission guide for the complete process.

Citation

@misc{yang2025captionqacaptionusefulimage,
  title={CaptionQA: Is Your Caption as Useful as the Image Itself?},
  author={Shijia Yang and Yunong Liu and Bohan Zhai and Ximeng Sun and
          Zicheng Liu and Emad Barsoum and Manling Li and Chenfeng Xu},
  year={2025},
  eprint={2511.21025},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.21025}
}
Overall Leaderboard - Caption Utility Score (%)
Rank Model Type Size Overall Natural Document E-comm Embodied
1
2025-Nov-19
GPT-5
OpenAI
Proprietary - 90.28 88.78 90.81 94.73 86.82
2
2025-Nov-19
Gemini 2.5 Flash
Google
Proprietary - 89.64 88.95 88.97 95.73 84.89
3
2025-Nov-19
Gemini 2.5 Pro
Google
Proprietary - 88.98 87.89 88.66 93.91 85.45
4
2025-Nov-19
o4-mini
OpenAI
Proprietary - 87.23 84.66 88.14 93.18 82.94
5
2025-Nov-19
Qwen3-VL
Alibaba
Open-Source 30B-A3B 87.02 86.14 85.89 93.9 82.15
6
2025-Nov-19
Qwen3-VL
Alibaba
Open-Source 8B 86.2 85.25 85.85 93.35 80.37
7
2025-Nov-19
Qwen3-VL
Alibaba
Open-Source 4B 86.01 84.73 84.99 93.77 80.56
8
2025-Nov-19
GPT-4o
OpenAI
Proprietary - 84.56 82.69 82.55 91.4 81.61
9
2025-Nov-19
GLM-4.1V
Zhipu AI
Open-Source 9B 84.28 81.67 87.86 92.04 75.56
10
2025-Nov-19
Qwen2.5-VL
Alibaba
Open-Source 32B 81.2 78.35 82.67 90.81 72.98
11
2025-Nov-19
InternVL3.5
Shanghai AI Lab
Open-Source 38B 79.58 78.26 78.91 86.47 74.68
12
2025-Nov-19
Qwen2.5-VL
Alibaba
Open-Source 72B 79.12 75.26 80.56 89.07 71.6
13
2025-Nov-19
Claude Sonnet 4.5
Anthropic
Proprietary - 78.94 76.56 83.09 88.86 67.27
14
2025-Nov-19
InternVL3
Shanghai AI Lab
Open-Source 8B 77.84 76.46 75.83 87.01 72.07
15
2025-Nov-19
InternVL3.5
Shanghai AI Lab
Open-Source 30B-A3B 76.96 74.58 77.72 85.79 69.75
16
2025-Nov-19
InternVL3.5
Shanghai AI Lab
Open-Source 8B 76.34 72.97 78.56 86.6 67.24
17
2025-Nov-19
InternVL3
Shanghai AI Lab
Open-Source 14B 76.06 74.16 74.17 86.17 69.75
18
2025-Nov-19
Qwen2.5-VL
Alibaba
Open-Source 7B 75.31 71.64 75.85 85.38 68.36
19
2025-Nov-19
NVLM-D
NVIDIA
Open-Source 72B 71.79 73.13 65.25 78.46 70.31
20
2025-Nov-19
InternVL3.5
Shanghai AI Lab
Open-Source 1B 71.51 70.82 68.08 82.69 64.46
21
2025-Nov-19
LLaVA-OneVision
ByteDance
Open-Source 7B 66.03 66.56 61.45 75.09 61.01
22
2025-Nov-19
LLaVA-1.5
UW-Madison
Open-Source 7B 46.96 52.51 36.48 49.0 49.84
23
2025-Nov-19
InternVL3
Shanghai AI Lab
Open-Source 78B 36.46 38.86 34.19 38.47 34.32
24
2025-Nov-19
Mistral Small 3.1
Mistral AI
Proprietary 24B 33.76 35.91 30.81 34.52 33.78