Benchmarking & Alignment
Measure evaluator accuracy against human ground truth and calibrate automated evaluators through the alignment workflow.
Benchmarking & Alignment
Benchmarking measures how well an automated evaluator agrees with human reviewers on a fixed set of labeled examples. Unlike experiments — which compare different model configurations against each other — benchmarks hold the evaluation criteria constant and ask: does this evaluator reliably reproduce human judgment?
Benchmarks vs. Experiments
| Experiments | Benchmarks | |
|---|---|---|
| Purpose | Compare model/prompt configurations | Validate evaluator accuracy |
| What varies | Prompt, model, dataset | Evaluator version / parameters |
| Ground truth | Evaluator scores the output | Human reviewers score the output |
| Output | Score comparisons across runs | Alignment rate, discrepancy analysis |
Alignment Concept
An evaluator is aligned when its scores closely match human-provided scores on the same items. The platform normalizes all scores to a 0–100 scale before comparing:
- Aligned: normalized delta between human and eval score < 1
- Discrepancy: normalized delta ≥ 20%
Score normalization by type:
- Numeric / Boolean:
((value − min) / (max − min)) × 100 - Categorical: equal-width buckets across categories, mapped to bucket midpoints
- Text: cannot be compared numerically
Benchmarking is accessed through the evaluator editor. Open any evaluator and look for the Benchmarks section in the sidebar.
Creating a Benchmark
A benchmark links a set of human-scored examples to an evaluator. There are four ways to supply that data.
Open the evaluator
Navigate to Evals → Templates and open an existing evaluator (or create a new one). The Benchmarks panel appears in the right sidebar.
Click "Link Benchmark Dataset"
This opens the benchmark setup dialog. Choose one of four source types.
Select a source type
Option 1 — Upload a file
Upload a CSV or JSON file (max 10 MB) containing pre-labeled examples. Map columns to the required fields:
| Field | Required | Description |
|---|---|---|
input | Yes | The LLM input sent to the model |
output | No | The model's response |
humanScore | No | Ground-truth score assigned by a human reviewer |
humanReasoning | No | Reviewer comment or explanation |
Use drag-to-map column matching to align your file's headers to these fields. A preview shows how the data will be parsed before you submit.
Option 2 — Select from Human Review
Use completed annotation queue results as ground truth.
- Select an annotation queue that has at least one completed run.
- Select the queue run containing the labeled items.
- Select a score config that defines the scoring scale used during review.
The platform maps each reviewed trace to a benchmark item, carrying over the human score and reasoning.
Option 3 — Select Dataset Run + Review Queue
Link an existing dataset run to an annotation queue to gather human scores.
- Select a dataset run as the source of model outputs.
- Select or create an annotation queue run where reviewers will score those outputs.
- Select a score config.
The platform populates the queue with the dataset items. Human reviewers score them through the annotation queue UI, and their scores are synced into the benchmark automatically.
Option 4 — Create Dataset Run + Review Queue
Start from scratch: select a dataset and a system prompt, trigger a new dataset run, and route the outputs to a new annotation queue for review.
- Select a dataset to use as test cases.
- Select a system prompt version.
- The platform creates a dataset run and a linked annotation queue.
- Reviewers score the outputs through the queue; the benchmark is populated once reviews are complete.
Submit
Click Create. The benchmark run is queued for processing. The evaluator runs on each item in the background and results appear once it completes.
Benchmark Run Status
Each benchmark run has one of four statuses:
| Status | Description |
|---|---|
PENDING | Queued, waiting for the worker to pick it up |
RUNNING | Evaluator is actively scoring items |
COMPLETED | All items have been scored |
FAILED | Processing encountered an unrecoverable error |
For CREATE_RUN benchmarks, the run stays PENDING until the underlying dataset run finishes and items are synced in.
Viewing Results
Open a benchmark run from the evaluator's Benchmarks sidebar to see the full results view.
Header stats
The top of the results page shows aggregate metrics for the run:
| Metric | Description |
|---|---|
| Total items | Number of examples in the benchmark |
| Human reviewed | % of items with a human score |
| Evaluated | % of items the evaluator has scored |
| Aligned | % of evaluated items where human ≈ eval score |
| Discrepancies | % of items with a normalized delta ≥ 20% |
Item table
Each row shows one example with side-by-side scores:
- Input — the LLM input (expandable code viewer)
- Output — the model response
- Human Score — reviewer's score (color-coded)
- Human Reasoning — reviewer's comment
- Eval Score — evaluator's score (color-coded)
- Eval Reasoning — evaluator's explanation or error
- Discrepancy — warning icon when the delta ≥ 20%
Filters
Use the filter panel to drill into specific subsets:
| Filter | Options |
|---|---|
| Discrepancy | Show only discrepant items |
| Human review status | Completed / Pending |
| Eval status | Completed / Pending |
| Human score range | Numeric range or categorical values |
| Eval score range | Numeric range |
Discrepancy breakdown
Below the table, the platform shows how disagreements are distributed:
- Eval scored higher — count of items where the evaluator was more generous than the human
- Human scored higher — count of items where the human was more generous than the evaluator
- Equal — items where scores matched
Alignment Workflow
The alignment workflow is how you iteratively improve an evaluator until its scores reliably match human judgment.
Gather human ground truth
Create a benchmark using any source type that includes human scores. Ensure reviewers use a consistent score config so the scale is comparable.
Run the evaluator
Once the benchmark is created and human scores are in place, the evaluator runs automatically on each item. Check the run status until it reaches COMPLETED.
Analyze discrepancies
Open the benchmark results and filter to discrepant items. Look for patterns:
- Does the evaluator consistently score higher or lower than humans?
- Are discrepancies concentrated on specific input types?
- Does the evaluator's reasoning reflect misunderstanding of the task?
Tune the evaluator
Based on the analysis, adjust the evaluator:
- LLM judge: edit the evaluator prompt to address the patterns you identified.
- Code evaluator: update the scoring logic.
- Parameters: adjust configurable parameters (temperature, thresholds, etc.) defined in the evaluator's parameter settings.
Rerun the benchmark
Click Rerun on the benchmark. This creates a new benchmark version (auto-incremented) using the same items and source configuration. The previous run is preserved for comparison.
Compare versions
The evaluator's benchmark sidebar lists all versions chronologically. Compare the aligned % and discrepancy % across versions to confirm the evaluator is improving.
Deploy
Once the evaluator reaches an acceptable alignment rate, deploy it for production use via Online Evaluations or as part of an Experiment.
Rerunning a benchmark always creates a new version — it does not overwrite the original run. This lets you track improvement over time without losing historical baselines.
Score Configs
Score configs define the scale used when comparing human and evaluator scores. Choose a score config that matches the evaluator's output type.
| Score Config Type | Human input | Comparable to eval? |
|---|---|---|
| Numeric | Number within min–max range | Yes |
| Boolean | True / False | Yes |
| Categorical | Label from a predefined list | Yes (via bucket normalization) |
| Text | Free-text comment | No — excluded from alignment metrics |
If the score types are incompatible (e.g., categorical human score vs. numeric eval score), the platform marks those items as cannotCompare and excludes them from alignment calculations.