PX-bench measures product experience: what the product an agent ships is actually like to use, form and function together. It drops an agent inside a complete, opinionated application, asks it to add a feature the way a product team would, and scores the result across eight categories. Every run produces a number, a composite across the eight.
A number invites comparison. If one agent scores 0.72 and another 0.69, the natural read is that the first did better. That read holds only if the two numbers are far enough apart to mean something, and they may not be, because the same agent on the same task does not score the same twice.
The score drifts with nothing changed
We ran one agent on one task ten times, with nothing changed between runs. The composite landed anywhere from 0.66 to 0.79. The spread is the eval's own noise.
Two things move it. The scoring layer leans on language-model judges, and even handed identical finished work they wobble a little. The agent is the bigger source: asked to build the same feature twice, it builds genuinely different things, an empty state here, a keyboard shortcut skipped there, and all of it moves the composite.
- Instrument noise is the scorers' share, how much the number moves when the work under it is held fixed.
- Total noise is how much it moves when everything is free to vary.
Total noise includes instrument noise, and whatever it carries beyond that traces back to the agent.
Why we measure it
Two reasons, both about not fooling ourselves.
The first is a guardrail on every comparison we publish. If a score would drift by some amount with nothing changed, then a gap between two agents smaller than that drift is not a finding. It is noise dressed up as signal. So the rule: a gap smaller than about twice the total noise is not readable from single runs. That distance is the noise floor. The rule runs in one direction. Below the floor, a gap is noise and we will not read it. Above the floor, it has earned a look, nothing more; when nothing real separates two agents, drift alone clears the floor about one time in six.
The second is operational. The results are published as a grid, where a cell is one agent on one task, and total noise decides how many times each cell runs. Averaging N runs shrinks the spread of the average by √N. Say the noise is 0.05: two single runs cannot resolve a gap under 0.1, while four runs of each agent halve that floor to 0.05. We pick the smallest N that puts the floor below the gaps we need to read, and that averaged floor, not the single-run one, is what we print beside the numbers.
How we measure it
We measure the two noises separately, because a single spread cannot say which source moved it.
- For instrument noise, we freeze one agent's finished build and re-score only that, several times. Anything that moves is the instrument, since the work beneath it is byte-identical.
- For total noise, we run the whole eval end to end several times. That captures everything the instrument does plus everything the agent did differently from one build to the next.
Both are the ordinary sample standard deviation across the repeats.
On our first scenario the split was lopsided. Instrument noise was negligible: most checks score identically on every re-score, and the residual lives in a couple of genuinely subjective judge calls. Total noise came out roughly seven times larger, so the agent rebuilding the feature differently, not the scorers disagreeing, is what actually moves a PX-bench number between runs.
Measuring the floor carefully also surfaces things a single run would miss. The first was the number of runs. A spread estimated from too few of them is unreliable, so the count is fixed up front rather than stopped once the figure looks settled, and the floor is published as a round number rather than a precise-looking one.
The second was a bug in disguise. One accessibility check kept wobbling across re-scores of identical files, which should be impossible for a deterministic check. It counted an element that repeats once per row of a list, and the scoring pass adds rows as it exercises the app, so the check was tracking the row count, not the build. That is a determinism defect, not noise, and it had been hiding inside what we would otherwise have filed as scorer disagreement. Re-scoring a frozen build is the cleanest test we have that a "deterministic" scorer really is deterministic.
What this buys, and what it doesn't
The floor does one concrete thing: it stops us reading meaning into gaps the eval cannot resolve. A reader can take any two published numbers, measure the distance between them against the floor, and throw out the comparisons that are just drift. That is the point of publishing the number rather than keeping it as an internal sanity check.
The floor has limits worth naming.
- It is partly a property of the subject: a weaker agent's messier output may be noisier to score, so a floor measured on one agent does not automatically transfer to another.
- It is partly a property of the task, since a tightly scoped feature leaves the agent less room to build differently.
- It covers the composite only; single categories rest on fewer checks, swing harder, and need wider floors.
- It is tied to the version of the scoring layer that measured it, so we re-measure it as the scorers evolve.
The floor reduces overconfidence, it does not abolish it. It tells you when a gap is too small to trust, not that every gap above it means what you think it does, and a grid of many pairwise comparisons will produce the occasional above-floor gap by chance alone. What it removes is the easiest way to be wrong, which is to read a leaderboard's third decimal as though the eval could see that far.
Version 1.0 — June 2026. Reach us at hello@chordio.com.