Scoring the scorers

PX-bench measures product experience: what the product an agent ships is actually like to use, form and function together. It drops an agent inside a complete, opinionated application, asks it to add a feature the way a product team would, and scores the result across eight categories. A few scores come from scripts. Most come from scoring agents of our own, which grade the finished work against ground truth written in advance.

A scoring agent recognizes a defensible implementation a regex never would, but it can also be wrong, and a wrong scorer does not look wrong; it posts the same confident number a right one does. The host app can be clean and the result still worthless if the instrument misread the work. So a score has to carry three guarantees. A 1.0 means the agent did the thing well, not that a check matched while the feature is broken. A 0.0 means it did the thing badly, not that the scorer never found the work. A missing score means it didn't attempt the item, not that the scorer was blind to a defensible alternative.

The sources of a wrong score

The scoring layer is built from parts, and each part encodes an assumption that can fail.

The rubric. Written too narrowly, an item rewards one implementation where the brief permits several: it names a drawer, the agent ships a defensible dialog, and real work scores as a miss. Written too loosely, grading collapses back into opinion.

The scorer's instructions. A scoring agent told to grade from the code alone will pattern-match. Code that contains all the pieces of a working feature reads like one, whether or not it runs.

The scorer's tools. Under the judgment sits plumbing: helper scripts, regexes, DOM probes, the harness that boots the app. Each assumes a shape for work it hasn't seen.

The host app. Scorers are written against the app as it shipped; the agent is free to edit it. Rename a component a probe selects on, and the probe comes back empty on work that exists.

And under all of it, drift. The scoring layer leans on models, so the same finished work does not score identically twice. An error smaller than that drift can't be seen at all.

The first check happens before any run

The rubric and its assumptions about the host app get checked first, before any agent runs. Each item is reviewed against the host app as actually built, not as specified. Does the app genuinely carry the convention the item rewards? Does the item have exactly one home among the categories, so the same miss isn't punished twice? Does the scorer's code test what the item's words say? Each item comes out ready, usable with a footnote, or in need of a fix before it can score anything.

Three scorers from three labs

Every item that calls for judgment is scored three times, by the same scoring agent on models from three different providers, and the published score is their median, not their average. An average lets one scorer drag the number; a median ignores the outlier, so one provider's bad read, or its habit of going easy on work that looks like its own, cannot move the result.

The median has a failure mode of its own, and one run showed it plainly. The item graded whether saved views (a filter setup the user names and reapplies) could be created, edited, and deleted end to end. Two providers read the code and scored 1.0. The third exercised the running app, watched the create request fail with an HTTP 400, and scored it down. The median of [1.0, 1.0, 0.75] is 1.0. A median doesn't average the minority down; it discards it.

Two fixes followed. A scorer that exercised the app and landed well below the consensus now raises a flag that puts the item in front of a person; the dissent can still be wrong, so it doesn't overwrite the median, but it can no longer disappear. And the item itself now requires every flow verified against the running app. Re-scored, all three providers land on 0.75.

A second reading of every item

Consensus guards against one provider's bias. It cannot catch a mistake all three scorers share or a rubric item that is itself wrong. For that, every run gets audited.

The auditor is an agent with read-only access to the run's artifacts: the code the agent wrote, the screens that actually rendered, and the case each scorer files alongside its number (what it expected, what it found, why). It re-grades every item from scratch and only then compares. Its independence is structural. It shares no scripts or probes with the scorers, and the audit runs several times, from different agent harnesses on different labs' models, with unanimous disagreements ranked first.

A disagreement is a finding, not a verdict. Each gets attributed to a cause from a closed list, and each cause is fixed in a different place. The cleanest example is the missing score, which hides three situations: the agent never attempted the feature, it wrote code but never wired it in, or it built the thing and the scorer failed to find it. Only the third is a scorer bug, and the bare number can't say which is which. The auditor can.

Run, attribute, fix, run again

Run the eval, compare scorer to auditor, attribute every disagreement, fix what the attribution names, run again. Sometimes that thing is the scorer, sometimes the rubric, sometimes the auditor itself. Sometimes it is nothing, because the call is genuinely ambiguous and belongs with human graders.

Every fix in that loop starts with a person reading the run. Attributing a disagreement means seeing what the agent built, what each scorer looked at, and what the auditor found, and digging that out of raw logs made every iteration slow. So we built Lens, a web app that lays a run out for inspection: the agent's transcript, the diff it shipped, the screens that rendered, and each scorer's evidence beside its score, down to where the three providers split. Most iterations begin there, with a person looking at a number that feels wrong.

The auditing is scaffolding, and it comes off against criteria set in advance: scorer and auditor agreeing on at least 95% of items where both produce a signal, a frozen build re-scoring within measured scorer noise, and auditors agreeing with each other at least 90% of the time (a yardstick that wobbles more than what it measures is not a yardstick). Then the audit thins to a sampled fraction of runs. The evidence records stay for good; they are what lets a reader check our work without re-running it.

The payoff, and the limits

In the first end-to-end runs, the same finished build, scored twice, could come back 0.2 apart. After the loop had been through the scorers, the rubric, and the tools, re-scoring a frozen build moves the composite by about 0.007. That distance is the point. None of this machinery makes a scorer right; it makes a wrong one visible, and a visible error can be attributed and fixed.

Agreement is not truth: the scorers and the auditor are all models, and a mistake every lab's models make would sail through both layers. That is why the most contested items go to people, and why each published score carries its evidence. The last line of defense is not our loop; it is a reader who can check the case we filed.

A calibrated rubric, a host app whose conventions carry real signal, scoring agents that verify what happened rather than infer it. None of it came out right on the first pass. Each piece is the product of reviews and runs that caught it being wrong. Most of the work of building the benchmark turned out to be exactly this.

Version 1.0 — June 2026. Reach us at hello@chordio.com.