Measure evaluator accuracy against human ground truth and calibrate automated evaluators through the alignment workflow.

Benchmarking & Alignment

Benchmarking measures how well an automated evaluator agrees with human reviewers on a fixed set of labeled examples. Unlike experiments — which compare different model configurations against each other — benchmarks hold the evaluation criteria constant and ask: does this evaluator reliably reproduce human judgment?

Benchmarks vs. Experiments

	Experiments	Benchmarks
Purpose	Compare model/prompt configurations	Validate evaluator accuracy
What varies	Prompt, model, dataset	Evaluator version / parameters
Ground truth	Evaluator scores the output	Human reviewers score the output
Output	Score comparisons across runs	Alignment rate, discrepancy analysis

Alignment Concept

An evaluator is aligned when its scores closely match human-provided scores on the same items. The platform normalizes all scores to a 0–100 scale before comparing:

Aligned: normalized delta between human and eval score < 1
Discrepancy: normalized delta ≥ 20%

Score normalization by type:

Numeric / Boolean: ((value − min) / (max − min)) × 100
Categorical: equal-width buckets across categories, mapped to bucket midpoints
Text: cannot be compared numerically

Benchmarking is accessed through the evaluator editor. Open any evaluator and look for the Benchmarks section in the sidebar.

Creating a Benchmark

A benchmark links a set of human-scored examples to an evaluator. There are four ways to supply that data.

Open the evaluator

Navigate to Evals → Templates and open an existing evaluator (or create a new one). The Benchmarks panel appears in the right sidebar.

Click "Link Benchmark Dataset"

This opens the benchmark setup dialog. Choose one of four source types.

Select a source type

Option 1 — Upload a file

Upload a CSV or JSON file (max 10 MB) containing pre-labeled examples. Map columns to the required fields:

Field	Required	Description
`input`	Yes	The LLM input sent to the model
`output`	No	The model's response
`humanScore`	No	Ground-truth score assigned by a human reviewer
`humanReasoning`	No	Reviewer comment or explanation

Use drag-to-map column matching to align your file's headers to these fields. A preview shows how the data will be parsed before you submit.

Option 2 — Select from Human Review

Use completed annotation queue results as ground truth.

Select an annotation queue that has at least one completed run.
Select the queue run containing the labeled items.
Select a score config that defines the scoring scale used during review.

The platform maps each reviewed trace to a benchmark item, carrying over the human score and reasoning.

Option 3 — Select Dataset Run + Review Queue

Link an existing dataset run to an annotation queue to gather human scores.

Select a dataset run as the source of model outputs.
Select or create an annotation queue run where reviewers will score those outputs.
Select a score config.

The platform populates the queue with the dataset items. Human reviewers score them through the annotation queue UI, and their scores are synced into the benchmark automatically.

Option 4 — Create Dataset Run + Review Queue

Start from scratch: select a dataset and a system prompt, trigger a new dataset run, and route the outputs to a new annotation queue for review.

Select a dataset to use as test cases.
Select a system prompt version.
The platform creates a dataset run and a linked annotation queue.
Reviewers score the outputs through the queue; the benchmark is populated once reviews are complete.

Submit

Click Create. The benchmark run is queued for processing. The evaluator runs on each item in the background and results appear once it completes.

Benchmark Run Status

Each benchmark run has one of four statuses:

Status	Description
`PENDING`	Queued, waiting for the worker to pick it up
`RUNNING`	Evaluator is actively scoring items
`COMPLETED`	All items have been scored
`FAILED`	Processing encountered an unrecoverable error

For CREATE_RUN benchmarks, the run stays PENDING until the underlying dataset run finishes and items are synced in.

Viewing Results

Open a benchmark run from the evaluator's Benchmarks sidebar to see the full results view.

Header stats

The top of the results page shows aggregate metrics for the run:

Metric	Description
Total items	Number of examples in the benchmark
Human reviewed	% of items with a human score
Evaluated	% of items the evaluator has scored
Aligned	% of evaluated items where human ≈ eval score
Discrepancies	% of items with a normalized delta ≥ 20%

Item table

Each row shows one example with side-by-side scores:

Input — the LLM input (expandable code viewer)
Output — the model response
Human Score — reviewer's score (color-coded)
Human Reasoning — reviewer's comment
Eval Score — evaluator's score (color-coded)
Eval Reasoning — evaluator's explanation or error
Discrepancy — warning icon when the delta ≥ 20%

Filters

Use the filter panel to drill into specific subsets:

Filter	Options
Discrepancy	Show only discrepant items
Human review status	Completed / Pending
Eval status	Completed / Pending
Human score range	Numeric range or categorical values
Eval score range	Numeric range

Discrepancy breakdown

Below the table, the platform shows how disagreements are distributed:

Eval scored higher — count of items where the evaluator was more generous than the human
Human scored higher — count of items where the human was more generous than the evaluator
Equal — items where scores matched

Alignment Workflow

The alignment workflow is how you iteratively improve an evaluator until its scores reliably match human judgment.

Gather human ground truth

Create a benchmark using any source type that includes human scores. Ensure reviewers use a consistent score config so the scale is comparable.

Run the evaluator

Once the benchmark is created and human scores are in place, the evaluator runs automatically on each item. Check the run status until it reaches COMPLETED.

Analyze discrepancies

Open the benchmark results and filter to discrepant items. Look for patterns:

Does the evaluator consistently score higher or lower than humans?
Are discrepancies concentrated on specific input types?
Does the evaluator's reasoning reflect misunderstanding of the task?

Tune the evaluator

Based on the analysis, adjust the evaluator:

LLM judge: edit the evaluator prompt to address the patterns you identified.
Code evaluator: update the scoring logic.
Parameters: adjust configurable parameters (temperature, thresholds, etc.) defined in the evaluator's parameter settings.

Score Config Type	Human input	Comparable to eval?
Numeric	Number within min–max range	Yes
Boolean	True / False	Yes
Categorical	Label from a predefined list	Yes (via bucket normalization)
Text	Free-text comment	No — excluded from alignment metrics

If the score types are incompatible (e.g., categorical human score vs. numeric eval score), the platform marks those items as cannotCompare and excludes them from alignment calculations.

Benchmarking & Alignment

Benchmarking & Alignment

Benchmarks vs. Experiments

Alignment Concept

Creating a Benchmark

Open the evaluator

Click "Link Benchmark Dataset"

Select a source type

Option 1 — Upload a file

Option 2 — Select from Human Review

Option 3 — Select Dataset Run + Review Queue

Option 4 — Create Dataset Run + Review Queue

Submit

Benchmark Run Status

Viewing Results

Header stats

Item table

Filters

Discrepancy breakdown

Alignment Workflow

Gather human ground truth

Run the evaluator

Analyze discrepancies

Tune the evaluator

Rerun the benchmark

Compare versions

Deploy

Score Configs

On this page