242 Elo points clear of the next best model and 93% win rate against random models (96% against nano banana) while Gemini 3.1 (second best) sits at 67%. That’s quite the leap.
LM Arena is a particularly bad comparison site too. Prompts that they use are usually incredibly generic like "A digital render of a sleek, futuristic motorcycle racing through a neon-lit cityscape."
I actually built GenAI Showdown a while back because I was deeply unsatisfied with LM Arena and other purported comparison tables which either (A) relied solely on visual fidelity (which is a far less interesting benchmark than adherence, IMHO) and/or (B) relied on extremely simplistic and banal prompts.
Blow everything out of the water on llmarena
https://arena.ai/leaderboard/text-to-image
242 Elo points clear of the next best model and 93% win rate against random models (96% against nano banana) while Gemini 3.1 (second best) sits at 67%. That’s quite the leap.
How is text to image even scored? Seems like a subjective measurement..
Users get two completions for their prompt and rank them. From this you can then use Bradley-Terry to get Elo scores per model.
LM Arena is a particularly bad comparison site too. Prompts that they use are usually incredibly generic like "A digital render of a sleek, futuristic motorcycle racing through a neon-lit cityscape."
I actually built GenAI Showdown a while back because I was deeply unsatisfied with LM Arena and other purported comparison tables which either (A) relied solely on visual fidelity (which is a far less interesting benchmark than adherence, IMHO) and/or (B) relied on extremely simplistic and banal prompts.
[dupe] https://news.ycombinator.com/item?id=47853000