Chordio

PX-bench

Long-horizon product experience benchmark for coding agents.

Why it matters.

Coding agents have swallowed the design step - and the product decisions baked into it. Hand one a half-formed description and it returns working frontend code, but on the way there it has quietly decided where the feature lives, which states exist, and how the copy reads. It's making those calls whether it's good at them or not. Usually it isn't: the craft is there, but applying it inside a real product is where agents fall short.

  1. 01

    No reward for product.

    What made these models strong engineers was training on rewards you can verify: the tests pass, the build goes green. Nothing like that exists for whether an empty state is any good. Product judgment was never something the training optimized for, so you get the average answer. For product, the average answer is usually wrong.

  2. 02

    The stop signal is the wrong one.

    Teams stack on review passes, critic agents, and verification chains - and the engineering does get more reliable. But none of them changes what the model treats as done. The cancel path, the error and empty states, the layout at 375px with real data: each gets handled only if something thought to check it, and nothing in the loop rewards that. So the agent stops at a working demo, however many passes it took.

  3. 03

    Consistent with what?

    Words like "consistent," "on-pattern," and "in the house style" only mean something relative to the app around it. A senior designer carries the whole product in their head and knows which details have to stay in sync. For an agent, seeing the codebase isn't the hard part - judging what matters is. So it reinvents components that already exist, lets terminology drift from screen to screen, and formats the same value two different ways.

PX-bench measures what the agent ships: the product it hands you, judged the way a senior product designer would judge it.

What it measures.

The brief settles what to build; PX-bench measures how well the agent realizes it in a real product. The seven categories of product experience form a ladder, ordered the way a senior product designer faces them when adding a feature to an app that already exists. The deepest, most ambiguous calls sit at the top; the most mechanical checks at the bottom. Get the intent right first; clear the accessibility bar last.

01
Intent fidelity
Rubrics include: every requested capability present and behaviorally correct, no material feature omitted, and no unrequested complexity.
02
Product fit
Rubrics include: container and pattern choice, entry-point placement, and action consolidation over fragmentation across views.
03
Convention adherence
Rubrics include: component reuse over duplication, design tokens over hardcoded values, visual hierarchy and spacing, and date and number formatting in the house style.
04
Pathway completeness
Rubrics include: cancel, undo, and error-recovery paths with no dead-ends, and loading, empty, error, and pending states present.
05
Content & language
Rubrics include: label and error-message quality, empty-state copy, and microcopy in the product's voice.
06
Resilience
Rubrics include: long-content overflow, responsive layout across breakpoints, rendering under API failure, and performance under real network conditions.
07
Accessibility
Rubrics include: axe-core violations, color contrast, keyboard operability, and correct focus order.

The taxonomy is v1 and will change; we publish revisions with the diff stated.

How it works.

PX-bench is a capability evaluation in the tradition of METR and the UK AI Safety Institute, applied to product experience. It rests on three deliberate choices, with scoring that runs automatically but answers to senior product designers.

Three deliberate choices
  1. 01

    Reference apps, not prompts

    Instead of building from a blank prompt, agents add a feature to a real, multi-screen app that already has its own conventions - say, a saved-views feature in an existing task tracker. That's what makes consistency and pattern choice scorable.

  2. 02

    Failure modes with a known answer

    Each app presents real product situations a senior product designer would recognize: an implied screen that doesn't exist, state that could be lost on navigation, an ambiguous primary action. We map them in advance, so the agent's choice is scored against a known-good outcome.

  3. 03

    Quasi-objective rubrics

    Items are scoped to where senior product designers agree; any item that can't clear an agreement threshold is reworked or dropped.

Scoring
01
Agent scoring
A scoring agent reads what the agent built and judges it against the app's existing patterns and conventions - not against one fixed set of expected answers. So a defensible alternative still scores on its merits.
02
Script checks
axe-core and structural-diff checks for the questions a script can settle outright: accessibility violations, hardcoded values, structural regressions.

Senior product designers set the ground truth. A scorer - script or agent - earns its place on an item only by matching that judgment at a set agreement bar. Where a call still needs a trained eye, a designer makes it directly. And where even the experts disagree, we publish the disagreement instead of forcing a score.

The harness is Inspect AI, the UK AI Safety Institute's framework, so any scenario we publish can be independently rerun.

Once a benchmark is public, models train on it, and it starts measuring exposure as much as capability. PX-bench keeps its scored scenarios held out and rotating, separate from anything we publish, so a score reflects capability rather than familiarity with the test.

References  //  METR  ·  UK AISI  ·  Inspect AI

Run a private PX-bench eval.

We score your coding agent on a reference app and send back the full report.

Get in touch