BrowserStack AI Evals
Evaluation

Benchmarking & Alignment

Measure evaluator accuracy against human ground truth and calibrate automated evaluators through the alignment workflow.

Benchmarking & Alignment

Benchmarking measures how well an automated evaluator agrees with human reviewers on a fixed set of labeled examples. Unlike experiments — which compare different model configurations against each other — benchmarks hold the evaluation criteria constant and ask: does this evaluator reliably reproduce human judgment?

Benchmarks vs. Experiments

ExperimentsBenchmarks
PurposeCompare model/prompt configurationsValidate evaluator accuracy
What variesPrompt, model, datasetEvaluator version / parameters
Ground truthEvaluator scores the outputHuman reviewers score the output
OutputScore comparisons across runsAlignment rate, discrepancy analysis

Alignment Concept

An evaluator is aligned when its scores closely match human-provided scores on the same items. The platform normalizes all scores to a 0–100 scale before comparing:

  • Aligned: normalized delta between human and eval score < 1
  • Discrepancy: normalized delta ≥ 20%

Score normalization by type:

  • Numeric / Boolean: ((value − min) / (max − min)) × 100
  • Categorical: equal-width buckets across categories, mapped to bucket midpoints
  • Text: cannot be compared numerically

Benchmarking is accessed through the evaluator editor. Open any evaluator and look for the Benchmarks section in the sidebar.


Creating a Benchmark

A benchmark links a set of human-scored examples to an evaluator. There are four ways to supply that data.

Open the evaluator

Navigate to Evals → Templates and open an existing evaluator (or create a new one). The Benchmarks panel appears in the right sidebar.

This opens the benchmark setup dialog. Choose one of four source types.

Select a source type

Option 1 — Upload a file

Upload a CSV or JSON file (max 10 MB) containing pre-labeled examples. Map columns to the required fields:

FieldRequiredDescription
inputYesThe LLM input sent to the model
outputNoThe model's response
humanScoreNoGround-truth score assigned by a human reviewer
humanReasoningNoReviewer comment or explanation

Use drag-to-map column matching to align your file's headers to these fields. A preview shows how the data will be parsed before you submit.

Option 2 — Select from Human Review

Use completed annotation queue results as ground truth.

  1. Select an annotation queue that has at least one completed run.
  2. Select the queue run containing the labeled items.
  3. Select a score config that defines the scoring scale used during review.

The platform maps each reviewed trace to a benchmark item, carrying over the human score and reasoning.

Option 3 — Select Dataset Run + Review Queue

Link an existing dataset run to an annotation queue to gather human scores.

  1. Select a dataset run as the source of model outputs.
  2. Select or create an annotation queue run where reviewers will score those outputs.
  3. Select a score config.

The platform populates the queue with the dataset items. Human reviewers score them through the annotation queue UI, and their scores are synced into the benchmark automatically.

Option 4 — Create Dataset Run + Review Queue

Start from scratch: select a dataset and a system prompt, trigger a new dataset run, and route the outputs to a new annotation queue for review.

  1. Select a dataset to use as test cases.
  2. Select a system prompt version.
  3. The platform creates a dataset run and a linked annotation queue.
  4. Reviewers score the outputs through the queue; the benchmark is populated once reviews are complete.

Submit

Click Create. The benchmark run is queued for processing. The evaluator runs on each item in the background and results appear once it completes.


Benchmark Run Status

Each benchmark run has one of four statuses:

StatusDescription
PENDINGQueued, waiting for the worker to pick it up
RUNNINGEvaluator is actively scoring items
COMPLETEDAll items have been scored
FAILEDProcessing encountered an unrecoverable error

For CREATE_RUN benchmarks, the run stays PENDING until the underlying dataset run finishes and items are synced in.


Viewing Results

Open a benchmark run from the evaluator's Benchmarks sidebar to see the full results view.

Header stats

The top of the results page shows aggregate metrics for the run:

MetricDescription
Total itemsNumber of examples in the benchmark
Human reviewed% of items with a human score
Evaluated% of items the evaluator has scored
Aligned% of evaluated items where human ≈ eval score
Discrepancies% of items with a normalized delta ≥ 20%

Item table

Each row shows one example with side-by-side scores:

  • Input — the LLM input (expandable code viewer)
  • Output — the model response
  • Human Score — reviewer's score (color-coded)
  • Human Reasoning — reviewer's comment
  • Eval Score — evaluator's score (color-coded)
  • Eval Reasoning — evaluator's explanation or error
  • Discrepancy — warning icon when the delta ≥ 20%

Filters

Use the filter panel to drill into specific subsets:

FilterOptions
DiscrepancyShow only discrepant items
Human review statusCompleted / Pending
Eval statusCompleted / Pending
Human score rangeNumeric range or categorical values
Eval score rangeNumeric range

Discrepancy breakdown

Below the table, the platform shows how disagreements are distributed:

  • Eval scored higher — count of items where the evaluator was more generous than the human
  • Human scored higher — count of items where the human was more generous than the evaluator
  • Equal — items where scores matched

Alignment Workflow

The alignment workflow is how you iteratively improve an evaluator until its scores reliably match human judgment.

Gather human ground truth

Create a benchmark using any source type that includes human scores. Ensure reviewers use a consistent score config so the scale is comparable.

Run the evaluator

Once the benchmark is created and human scores are in place, the evaluator runs automatically on each item. Check the run status until it reaches COMPLETED.

Analyze discrepancies

Open the benchmark results and filter to discrepant items. Look for patterns:

  • Does the evaluator consistently score higher or lower than humans?
  • Are discrepancies concentrated on specific input types?
  • Does the evaluator's reasoning reflect misunderstanding of the task?

Tune the evaluator

Based on the analysis, adjust the evaluator:

  • LLM judge: edit the evaluator prompt to address the patterns you identified.
  • Code evaluator: update the scoring logic.
  • Parameters: adjust configurable parameters (temperature, thresholds, etc.) defined in the evaluator's parameter settings.

Rerun the benchmark

Click Rerun on the benchmark. This creates a new benchmark version (auto-incremented) using the same items and source configuration. The previous run is preserved for comparison.

Compare versions

The evaluator's benchmark sidebar lists all versions chronologically. Compare the aligned % and discrepancy % across versions to confirm the evaluator is improving.

Deploy

Once the evaluator reaches an acceptable alignment rate, deploy it for production use via Online Evaluations or as part of an Experiment.

Rerunning a benchmark always creates a new version — it does not overwrite the original run. This lets you track improvement over time without losing historical baselines.


Score Configs

Score configs define the scale used when comparing human and evaluator scores. Choose a score config that matches the evaluator's output type.

Score Config TypeHuman inputComparable to eval?
NumericNumber within min–max rangeYes
BooleanTrue / FalseYes
CategoricalLabel from a predefined listYes (via bucket normalization)
TextFree-text commentNo — excluded from alignment metrics

If the score types are incompatible (e.g., categorical human score vs. numeric eval score), the platform marks those items as cannotCompare and excludes them from alignment calculations.