Chordio

PX-bench

Long-horizon product experience benchmark for coding agents.

Latest publication

A taxonomy of product experience capability for AI agents

Most discussions of AI "design capability" talk past each other because "design" points at three different things. This is the capability we name instead — product experience — set out as eight categories, with the exclusions stated explicitly.

Read the publication →v1.0

Why it matters.

Coding agents now make product design decisions as they write frontend code. Given an incomplete brief, an agent must decide where a feature belongs, which states exist, and how the interface explains itself. PX-bench evaluates those decisions inside held-out reference apps because product quality fails in ways ordinary coding checks do not catch.

  1. 01

    Product quality is hard to reward.

    Models improve fastest where outcomes are easy to verify: tests pass, builds complete, errors disappear. Product decisions are harder to score. An empty state can render correctly and still be unhelpful; a feature can work and still belong in the wrong place. PX-bench makes those decisions observable.

  2. 02

    Working demos miss edge paths.

    Most agent loops treat a working implementation as done. Review passes can catch broken code, but they do not guarantee the agent checked cancel paths, error states, long content, or mobile layouts. PX-bench scores the shipped product, including the paths a demo often misses.

  3. 03

    Consistency depends on context.

    Many product decisions only make sense relative to the surrounding app. The right component, term, format, or entry point is usually the one the product already uses. PX-bench tests agents in held-out reference apps so consistency is judged against local conventions.

PX-bench measures what the agent ships: the product it hands you, judged the way a senior product designer would judge it.

What it measures.

The brief settles what to build; PX-bench measures how well the agent realizes it in a held-out reference app. The eight categories of product experience form a ladder, ordered the way a senior product designer faces them when adding a feature to an app with established conventions. The deepest, most ambiguous calls sit at the top; the most mechanical checks at the bottom. Get the intent right first; clear the accessibility bar last.

01
Intent fidelity
Rubrics include: every requested capability present and working on its core path, no material feature omitted, and no unrequested complexity.
02
Product fit
Rubrics include: container and pattern choice, entry-point placement, and action consolidation over fragmentation across views.
03
Visual craft
Rubrics include: visual hierarchy and emphasis, spacing rhythm and alignment, and type-scale use that guides the eye to the primary thing first.
04
Convention adherence
Rubrics include: component reuse over duplication, design tokens over hardcoded values, naming and file conventions, and date and number formatting in the house style.
05
Pathway completeness
Rubrics include: cancel, undo, and error-recovery paths with no dead-ends, and loading, empty, error, and pending states present.
06
Content & language
Rubrics include: label and error-message quality, empty-state copy, and microcopy in the product's voice.
07
Resilience
Rubrics include: long-content overflow, responsive layout across breakpoints, rendering under API failure, and slow-network performance.
08
Accessibility
Rubrics include: axe-core violations, color contrast, keyboard operability, and correct focus order.

The taxonomy is v1 and will change; we publish revisions with the diff stated.

How it works.

PX-bench is a capability evaluation in the tradition of METR and the UK AI Safety Institute, applied to product experience. It rests on three deliberate choices, with scoring that runs automatically but answers to senior product designers.

Three deliberate choices
  1. 01

    Held-out reference apps

    Instead of building from a blank prompt, agents add a feature to a held-out, multi-screen reference app with its own conventions. That's what makes consistency and pattern choice scorable.

  2. 02

    Failure modes with a known answer

    Each app presents product situations a senior product designer would recognize: an implied screen that doesn't exist, state that could be lost on navigation, an ambiguous primary action. We map them in advance, so the agent's choice is scored against a known-good outcome.

  3. 03

    Quasi-objective rubrics

    Items are scoped to where senior product designers agree; any item that can't clear an agreement threshold is reworked or dropped.

Scoring
01
Agent scoring
A scoring agent reads what the agent built and judges it against the app's local patterns and conventions. An alternative implementation can earn credit if it follows those conventions and solves the product problem.
02
Script checks
axe-core and structural-diff checks for the questions a script can settle outright: accessibility violations, hardcoded values, structural regressions.

Senior product designers set the ground truth. A scorer - script or agent - earns its place on an item only by matching that judgment at a set agreement bar. Where a call still needs a trained eye, a designer makes it directly. And where even the experts disagree, we publish the disagreement instead of forcing a score.

The harness is Inspect AI, the UK AI Safety Institute's framework, so any scenario we publish can be independently rerun.

Once a benchmark is public, models train on it, and it starts measuring exposure as much as capability. PX-bench keeps its scored scenarios held out and rotating, separate from anything we publish, so a score reflects capability rather than familiarity with the test.

References  //  METR  ·  UK AISI  ·  Inspect AI

Run a private PX-bench eval.

We score your coding agent on a reference app and send back the full report.

Get in touch