How we test models
The entire site reduces to one rule: same words in, real outputs out, you judge. This page is the fine print — citable, versioned, and honest about limitations.
Prompt policy
Every challenge is one brief, written once, handed verbatim to every model. No per-model tuning, no system-prompt tricks, no quiet retries to make a favorite look good. When a generation crashes with zero output, we re-run it and mark the entry as regenerated — the original brief never changes.
The full prompt is published on every challenge — open any task in the arena and hit ℹ Details. If you think a brief favors a model, the receipts are right there.
Runs & harness
Single-file tasks are one-shot: the model gets the brief and returns one self-contained HTML file — no follow-ups, no fixing. That file is served unmodified as the artifact you interact with. Where a model family exposes thinking-effort levels, we run the same brief at each level and publish every variant separately — seeing what extra reasoning actually buys is half the point.
The three Godot tasks are the exception, and say so on the tin: an agentic pipeline (godforge) where the model iterates — write engine code, compile a WebAssembly export, smoke-test, fix — until the build passes or 40 turns run out.
Currently on the stand: 54 challenges, 53 model variants across 13 families, 1,130 artifacts. Counted from the manifest at build, 2026-07-05.
Cost estimation
Every artifact is a single file, so we estimate output tokens from file size and price them at published per-model output rates:
tokens ≈ characters ÷ 4 cost = tokens × published $/1M output
These are estimates, not invoices. Input and reasoning tokens are not counted, so true cost is higher — especially at high thinking effort. Models without published pricing show no cost. Wall-clock generation time is recorded for runs after 2026-07-02 and includes queue and throttling waits. Total estimated output spend across the site so far: $179.
Blind voting
First-time visitors judge blind by default: labels hidden, panes shuffled. Voting triggers the reveal — who you picked, and how the community split. Votes land in a shared database keyed per task; one vote per task per browser, changeable any time. The tally you see is everyone’s votes combined, and it is public on the leaderboard.
Heavy tasks (WebGL, large simulations, Godot exports) start as posters and load all panes on one click — partly to save your laptop, partly method: no output should win votes just by loading first.
What this is not
Not a lab benchmark. No pass@k, no held-out test sets, no statistical significance. This is a like-for-like showcase of one-shot generations judged by people. It answers a different question than MMLU does: given the same brief, whose output would you actually ship, play, or read? Community votes measure preference, and preference has biases — we mitigate with blind-first judging and fair loading, not by pretending the votes are science.
Changelog
- 2026-07-05Blind-first onboarding, tournament mode, shareable matchup links, fairness posters for heavy tasks.
- 2026-07-03W11 translation brief added to the writing arena; leaderboard page shipped.
- 2026-07-02Wall-clock generation time recorded for all new runs (includes queue and throttling waits).
- 2026-06-28Fable 5 post-ban rerun published as a separate max variant next to the pre-ban originals.
- 2026-06-24Coding arena expanded to 43 tasks; flash-era classics and simulation briefs added.
Disagree with a verdict?
Good. The arena is the argument: open a challenge, judge it blind, and vote.
Open the coding arena