← All posts
Methodology·5 min read

Why we run AI model outputs live, not screenshots

Every model comparison you see on social media is a screenshot of the best run out of many. Here are the four rules this arena uses instead, and what live execution catches that images hide.


Model comparisons on social media follow a script. Someone runs a prompt a handful of times, keeps the best result, crops it into a screenshot, and writes a caption declaring a winner. The screenshot might be real. What it cannot show you is the run that crashed before it, the buttons that do nothing when clicked, or the “smooth 60fps” animation that is actually a slideshow.

This arena exists because I wanted the opposite of that. Four rules, applied to all 43 coding challenges.

Rule 1: the same prompt, verbatim, one shot

Every model gets the identical brief wrapped in the identical boilerplate: one fully self-contained HTML file, all CSS and JS inline, zero external resources, no second chances. The exact prompt — boilerplate included — is published under the Details panel of every task. If you think a prompt is unfair to some model, you can read it and say so.

Rule 2: the artifact runs live in your browser

What you see in each pane is not an image or a video — it is the model’s actual output file executing in an iframe on your machine. You can play the games, click the dashboards, resize the layouts, and open any entry full-screen. Live execution is brutal in a way screenshots are not:

  • A landing page that looks polished in a thumbnail falls apart the moment you scroll it.
  • A game with beautiful art but broken collision reveals itself in three seconds of play.
  • The raw-WebGL challenge bans every 3D library — most models ship a black screen. A screenshot comparison would simply skip those runs; the arena shows them.
  • The CHIP-8 emulator challenge grades itself: a test ROM draws a checkmark or an X per CPU instruction, on screen, in front of you.

Rule 3: judge blind, then see the labels

Model names are a bias machine. Blind mode hides the labels and shuffles the panes, and it switches on automatically when you change the lineup — so you form an opinion about the output before you know whose it is. Labels reveal when you vote. It is uncomfortable how often the reveal surprises me, and I built the thing.

Rule 4: votes are public and shared

Every vote lands in one shared database, and the tallies you see are everyone’s votes combined — not my editorial opinion, not a curated leaderboard. You can vote, unvote, and change your mind. When a model’s entry was regenerated because the original run crashed with no output, the Details panel says so explicitly.

The one documented exception

Three challenges are compiled Godot 4 games from an agentic pipeline, not single-file one-shots — the model iterates on real engine code until its build passes. They are clearly marked, the pipeline is described in the Details panel, and only the models that ran the pipeline appear on those tasks.

None of this makes the arena a formal benchmark, and it does not pretend to be one. It is something I find more useful: the raw material for your own judgment, with nothing between you and the output. Pick a challenge and judge one yourself.

Don’t take the post’s word for it

The arena runs every model’s real output live. Pick a challenge, go blind, and cast a vote that counts in the public tally.

Open the arena