Methodology

How we test models

The entire site reduces to one rule: same words in, real outputs out, you judge. This page is the fine print — citable, versioned, and honest about limitations.

Prompt policy

Every challenge is one brief, written once, handed verbatim to every model. No per-model tuning, no system-prompt tricks, no quiet retries to make a favorite look good. When a generation crashes with zero output, we re-run it and mark the entry as regenerated — the original brief never changes.

The full prompt is published on every challenge — open any task in the arena and hit ℹ Details. If you think a brief favors a model, the receipts are right there.

Runs & harness

Single-file tasks are one-shot: the model gets the brief and returns one self-contained HTML file — no follow-ups, no fixing. That file is served unmodified as the artifact you interact with. Where a model family exposes thinking-effort levels, we run the same brief at each level and publish every variant separately — seeing what extra reasoning actually buys is half the point.

The three Godot tasks are the exception, and say so on the tin: an agentic pipeline (godforge) where the model iterates — write engine code, compile a WebAssembly export, smoke-test, fix — until the build passes or 40 turns run out.

Currently on the stand: 54 challenges, 53 model variants across 13 families, 1,130 artifacts. Counted from the manifest at build, 2026-07-05.

Cost estimation

Every artifact is a single file, so we estimate output tokens from file size and price them at published per-model output rates:

tokens ≈ characters ÷ 4
cost   = tokens × published $/1M output

These are estimates, not invoices. Input and reasoning tokens are not counted, so true cost is higher — especially at high thinking effort. Models without published pricing show no cost. Wall-clock generation time is recorded for runs after 2026-07-02 and includes queue and throttling waits. Total estimated output spend across the site so far: $179.

What this is not

Not a lab benchmark. No pass@k, no held-out test sets, no statistical significance. This is a like-for-like showcase of one-shot generations judged by people. It answers a different question than MMLU does: given the same brief, whose output would you actually ship, play, or read? Community votes measure preference, and preference has biases — we mitigate with blind-first judging and fair loading, not by pretending the votes are science.

Changelog

2026-07-05Blind-first onboarding, tournament mode, shareable matchup links, fairness posters for heavy tasks.
2026-07-03W11 translation brief added to the writing arena; leaderboard page shipped.
2026-07-02Wall-clock generation time recorded for all new runs (includes queue and throttling waits).
2026-06-28Fable 5 post-ban rerun published as a separate max variant next to the pre-ban originals.
2026-06-24Coding arena expanded to 43 tasks; flash-era classics and simulation briefs added.

Disagree with a verdict?

Good. The arena is the argument: open a challenge, judge it blind, and vote.

Open the coding arena