Methodology
How we chart it
This publication picks axes for the reader, not for the convention. Three rules govern every chart that runs here; three short examples explain why each rule earns its keep.
The rules
- Time is always linear. The reader's intuition of "later" is calendar-time, not log-time. Every chart that uses a year axis runs it linearly — even when the events being plotted are accelerating.
- Magnitude gets log only when the range demands it. The breakpoint is roughly 100×: when a number ranges over two orders of magnitude or more, linear axes squash the early movement into the floor and the reader stops being able to read the chart. Below that, linear keeps the reader's number-sense intact.
- Capability is always linear. Percentages, accuracy scores, benchmark grades — the reader interprets them as linear quantities. Putting a percentage on a log y axis distorts the question of "how good is the model now"; it answers a question the reader didn't ask.
Example 1 — when linear reveals saturation
MMLU is hitting a ceiling. From GPT-3 in 2020 (44%) through GPT-3.5 (70%) and GPT-4 (86%), the benchmark climbed through three obvious eras of progress. From mid-2023 onward, the frontier cluster — Claude, GPT-4o, Llama 3.1, Gemini — sits between 88% and 90%. Stanford's AI Index 2025 calls MMLU saturated, and the data agrees.
The linear-y panel below makes the saturation visible: the curve bends toward an asymptote near 90% and stays there. The log-y panel, plotting the same points, compresses the upper range and reads as if MMLU were still climbing steadily. Same data, different axes, different stories — and only one of them is honest about the ceiling. GPQA Diamond (the volt-colored series) shows what a benchmark with headroom looks like in the same view.
MMLU progression, linear y
Schematic — after Hendrycks et al. 2021 (arXiv:2009.03300) + Stanford HAI AI Index 2025 ch. 2. Linear y reveals the saturating-S flattening at ~90%. GPQA Diamond (volt) still climbing.
Same MMLU data, log y
Schematic — same MMLU points, log y. Saturation is compressed; the curve reads like steady growth. Wrong axis for this question.
Example 2 — when log reveals geometry
Cottier's 2025 inference-price work plots cost-per-million- tokens at fixed capability across six benchmarks. The decline is multiplicative: 9× to 900× per year depending on the benchmark, with a median around 50×. At the GPT-4 capability floor — the cheapest model that hits a given threshold — the decline is roughly tenfold per year.
A multiplicative process reads as a straight line on log y, and that is exactly what the chart below shows. The same data on linear y would crash to a floor in the first six months and render every subsequent improvement invisible against the post-collapse baseline. When the story IS multiplicative, log reveals what linear would erase.
Inference cost at fixed capability, log y
Schematic — after Cottier, Snodin, Owen, Adamczewski 2025, “LLM inference prices have fallen rapidly but unequally across tasks,” Epoch AI Data Insights, March 12 2025. Geometric decline reads as a straight line on log y. Range 9×–900×/yr across six benchmarks; ~10×/yr at the GPT-4 capability floor.
Example 3 — when log-log is right (and when it misleads)
Kaplan et al. 2020 plot test loss as a function of training compute and find a power law: loss falls roughly as compute to the −0.05. The relationship spans six orders of magnitude on the compute axis, and on log-log paper it renders as a straight line. The convention serves the literature because the underlying relationship IS log-log; the chart isn't massaging the data, it's matching it.
The convention misleads readers who care about real-world capability rather than next-token loss. Two-tenths of a nat at the end of the curve is the difference between "unimpressive" and "frontier"; the log-log treatment makes them look adjacent. The scaling-laws community charts the math it cares about. This publication charts the question the reader is asking.
Kaplan 2020 scaling law, log-log
Schematic — after Kaplan et al. 2020, “Scaling Laws for Neural Language Models,” arXiv:2001.08361, Figure 1. Test loss falls as a power law of compute; log-log axes turn the relationship into a straight line.
When in doubt: linear
Default to linear. The reader's intuition is linear; log is a tool the writer reaches for when the math demands it. Most stories on this page aren't multiplicative. When they are, the rules above will say so.