Does thinking effort actually matter? Same model, low vs max
Opus 4.8 and Sonnet ship at up to six effort levels each. The arena lets you blind-compare a model against itself — and the output data says effort is not the dial you think it is.
The most underused comparison in this arena is not Claude vs GLM. It is a model against itself. Opus 4.8 ships here at six effort levels (low, medium, high, xhigh, max, ultracode), Sonnet 4.6 at four, Sonnet 5 at six. The model picker happily lets you put Opus·low in one pane and Opus·max in the other, hit Compare blind, and find out whether all that extra thinking bought anything you can actually see.
What the output sizes say
Each pane shows an estimated output-token count for its entry (file size divided by four — approximate, and labeled as such). Comparing low vs max across our challenges, three patterns show up:
- Sometimes effort buys real substance. On the maze tower defense, Opus·low shipped ~19,700 output tokens and Opus·max ~26,900 — and the max entry tends to have more of the required mechanics actually implemented. The CHIP-8 emulator moves from ~8,900 to ~12,300 the same way.
- Sometimes it buys almost nothing. On the landing-page challenge, Opus·low and Opus·max land within a few percent of each other (~16,000 vs ~16,700 tokens). The cave-copter game is flat to slightly negative.
- Sometimes more thinking means less output. Sonnet 4.6 on the brick-breaker: low produced ~9,700 tokens, max produced ~8,300. Thinking harder and writing less is not automatically worse — but it is not automatically better either.
A bigger file can mean more features, or more bloat. That is exactly why the panes run live and the votes are blind — size tells you where to look, playing the entry tells you what is true.
What effort costs
The estimated cost under each pane is output tokens times the model’s published output price — and it undercounts on purpose, because reasoning tokens and input are not in the file. The practical read: the higher the effort level, the more the invisible thinking overhead grows, so the true gap between low and max is larger than the visible one. If max does not visibly beat low on your task, low was the better buy by more than the sticker difference.
My read, after a lot of blind pairs
This is opinion, not measurement: effort pays most where correctness is binary — the emulator that grades itself, the spreadsheet formula engine, chess rules where one illegal move fails the task. It pays least where the judgment is aesthetic — landing pages, the news front page — where low-effort runs regularly win blind votes against max-effort ones. The community tallies on same-model pairs are still thin, which is a polite way of saying: I want more of your votes before I trust my own hunch.
Try the experiment that started this post: open the arena, set pane A to Opus 4.8 · low and pane B to Opus 4.8 · max, go blind, and vote on three tasks. If you can reliably spot max, you have learned something about the model. If you cannot — you have learned something about your token budget.
Don’t take the post’s word for it
The arena runs every model’s real output live. Pick a challenge, go blind, and cast a vote that counts in the public tally.
Open the arena