PX-bench

Long-horizon product experience benchmark for coding agents.

Latest publications.

View all →

Why it matters.

Models improve fastest where the correctness of outcomes is easy to verify at scale. Product decisions are hard to score. An empty state can render correctly and still be unhelpful; a feature can work and still belong in the wrong place.

PX-bench makes agent product decisions observable and verifiable without requiring humans to review every output. It helps agent builders find quality gaps, catch regressions, and reduce cost without sacrificing quality.

What it measures.

The brief settles what to build; PX-bench measures how well the agent realizes it in a held-out host app. We score the result across eight categories of product experience, each naming one kind of decision a senior product designer makes when adding a feature to an app with established conventions.

01
Intent fidelity
Rubrics include: every requested capability present and working on its core path, no material feature omitted, and no unrequested complexity.
02
Product fit
Rubrics include: container and pattern choice, entry-point placement, and action consolidation over fragmentation across views.
03
Visual craft
Rubrics include: visual hierarchy and emphasis, spacing rhythm and alignment, and type-scale use that guides the eye to the primary thing first.
04
Convention adherence
Rubrics include: component reuse over duplication, design tokens over hardcoded values, naming and file conventions, and date and number formatting in the house style.
05
Pathway completeness
Rubrics include: cancel, undo, and error-recovery paths with no dead-ends, and loading, empty, error, and pending states present.
06
Content & language
Rubrics include: label and error-message quality, empty-state copy, and microcopy in the product's voice.
07
Resilience
Rubrics include: long-content overflow, responsive layout across breakpoints, rendering under API failure, and slow-network performance.
08
Accessibility
Rubrics include: axe-core violations, color contrast, keyboard operability, and correct focus order.

The taxonomy is v1 and will change; we publish revisions with the diff stated.

How it works.

PX-bench is a capability evaluation in the tradition of METR and the UK AI Safety Institute, applied to product experience. It rests on three deliberate choices, each designed to make product judgment measurable with automatic scoring anchored in expert-defined ground truth.

Three deliberate choices
  1. 01

    Held-out host apps

    Instead of building from a blank prompt, agents add a feature to a held-out, multi-screen host app with its own conventions. That's what makes consistency and pattern choice scorable.

  2. 02

    Failure modes with a known answer

    Each app presents product situations a senior product designer would recognize: an implied screen that doesn't exist, state that could be lost on navigation, an ambiguous primary action. We map them in advance, so the agent's choice is scored against a known-good outcome.

  3. 03

    Quasi-objective rubrics

    Items are scoped to where senior product designers agree; any item that can't clear an agreement threshold is reworked or dropped.

The harness is Inspect AI, the UK AI Safety Institute's framework, so any scenario we publish can be independently rerun.

References  //  METR  ·  UK AISI  ·  Inspect AI

Run a private PX-bench eval.

Put your coding agent through the same bench and get back a complete scored report: where its product experience holds up, where it breaks, and what it costs to ship.

Get in touch