Open-source evaluation benchmarks + a unified judge—T2I finally has a scoring system that really holds up. But realism and creativity are still the line that separates them.

View Original
BlockBeatNews
Alibaba T2I evaluation Qwen-Image-Bench open source, GPT Image 2 wins the championship and is all-around in five categories
Alibaba Qwen team open-sourced the drawing evaluation benchmark Qwen-Image-Bench and the unified visual judge Q-Judger, used to assess text-to-image (T2I) capabilities. Covering five dimensions: image quality, aesthetics, image alignment, realism, and creativity, with 23 sub-skills and 56 indicators; including 1,000 bilingual prompts in Chinese and English. Eighty professional reviewers conducted blind reviews, with over 130k pairs of annotations, and the agreement between the judge and human scores reached 92%. Among the first batch of 18 models, GPT Image 2 ranks first, with top models still showing a gap in realism and creativity dimensions. Details such as drawing style, gravity, and lighting effects remain common bottlenecks.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned