PX-bench

Design benchmarks judge outputs with no product context. Coding benchmarks have the context and never judge the experience. Product-in-context evaluation puts the halves together. A map of the six families of AI design measurement, and when to use each.

Read →

ValidityJun 10, 2026

Scoring the scorers

A feature whose create flow returned an error scored a perfect 1.0. Two scoring models read the code and inferred it worked; the one that ran the app was outvoted. How do you catch a scorer that is confidently wrong?

Read →

ValidityJun 9, 2026

The noise floor

Run the same agent on the same task ten times, change nothing, and the score swings from 0.66 to 0.79. So when one agent edges out another, how do you know the gap is real?

Read →

The need for PX evals at scale.

Product experience is hard to score, and judging it well usually needs human reviewers. That is expensive and slow, and it never keeps up with the rate of model updates. So the questions that decide cost and quality go unanswered. Which model is most cost-efficient for this task? Does your harness still hold after an update? Without answers, regressions creep in and good product experience stays a costly, uncertain part of every agent build.

PX-bench answers those questions without a human reviewing every output. In the demo above, GPT-5.5 builds the right feature: intent fidelity 99. Then it builds a custom modal for create and edit where the app already uses a drawer: product fit 73. The report names gaps like that, catches regressions, and prices every run.

The decisions we score.

The brief settles what to build; PX-bench measures how well the agent realizes it in a held-out host app. We score the result across eight categories of product experience, each naming one kind of decision a senior product designer makes when adding a feature to an app with established conventions.

Intent fidelity

Rubrics include: every requested capability present and working on its core path, no material feature omitted, and no unrequested complexity.

Product fit

Rubrics include: container and pattern choice, entry-point placement, and action consolidation over fragmentation across views.

Visual craft

Rubrics include: visual hierarchy and emphasis, spacing rhythm and alignment, and type-scale use that guides the eye to the primary thing first.

Convention adherence

Rubrics include: component reuse over duplication, design tokens over hardcoded values, naming and file conventions, and date and number formatting in the house style.

Pathway completeness

Rubrics include: cancel, undo, and error-recovery paths with no dead-ends, and loading, empty, error, and pending states present.

Content & language

Rubrics include: label and error-message quality, empty-state copy, and microcopy in the product's voice.

Resilience

Rubrics include: long-content overflow, responsive layout across breakpoints, rendering under API failure, and slow-network performance.

Accessibility

Rubrics include: axe-core violations, color contrast, keyboard operability, and correct focus order.

The taxonomy is v1 and will change; we publish revisions with the diff stated.

How it works.

PX-bench is a capability evaluation in the tradition of METR and the UK AI Safety Institute, applied to product experience. Three deliberate choices make product judgment measurable. Scoring is automatic, anchored in expert-defined ground truth.

Three deliberate choices

01
Held-out host apps
Instead of building from a blank prompt, agents add a feature to a held-out host app, a multi-screen app the agent has never seen, with its own conventions. That's what makes consistency and pattern choice scorable.
02
Failure modes with a known answer
Each app presents product situations a senior product designer would recognize: an implied screen that doesn't exist, state that could be lost on navigation, an ambiguous primary action. We map them in advance, so the agent's choice is scored against a known-good outcome.
03
Quasi-objective rubrics
Items are scoped to where senior product designers agree; any item that can't clear an agreement threshold is reworked or dropped.

The harness is Inspect AI, the UK AI Safety Institute's framework, so any scenario we publish can be independently rerun.

References // METR · UK AISI · Inspect AI

Run a private PX-bench eval.

Send your coding agent and harness. It runs against the same held-out host apps that scored GPT-5.5. You get back all eight category scores: where product experience holds up, where it breaks, and what it costs to ship.

Get in touch

PX-bench

Run PX-bench on your own agent.

Latest publications.

Mapping the AI design evaluation landscape

Scoring the scorers

The noise floor

The need for PX evals at scale.

The decisions we score.

How it works.

Held-out host apps

Failure modes with a known answer

Quasi-objective rubrics

Run a private PX-bench eval.