Testing Models
Blog
EN/PL
Open arena
Blog

Notes from the arena

What the head-to-head runs actually show, how the methodology works, and takes you can disagree with by voting.

Release reports

Fable 5 is back — and the arena receipts are worth reading

July 2, 2026 · 6 min

Anthropic’s Mythos-class model is usable again after the launch crunch. What 18 one-shot coding challenges, 3 compiled Godot games, and the community votes actually say about it.

Method notes

Why we run AI model outputs live, not screenshots

July 2, 2026 · 5 min

Every model comparison you see on social media is a screenshot of the best run out of many. Here are the four rules this arena uses instead, and what live execution catches that images hide.

Takes & analysis

Does thinking effort actually matter? Same model, low vs max

July 2, 2026 · 5 min

Opus 4.8 and Sonnet ship at up to six effort levels each. The arena lets you blind-compare a model against itself — and the output data says effort is not the dial you think it is.

Testing Models

The same prompt, every model, real outputs — compared side-by-side, judged blind, voted on by everyone.

54 challenges · 53 variants · $189 est. output spend
counted at build, 2026-07-05

Arenas
  • Coding arenalive
  • Writing arenalive
  • Imagesguide
  • Videosguide
  • Voiceguide
  • Musicguide
Data
  • Leaderboard
  • Method & changelog
  • Blog
Author
  • Patryk Raba on LinkedIn
  • raba.pl
Built by Patryk Raba. Cookieless Umami analytics always on; PostHog only with consent.