BrowserStack AI Evals
Evaluation

Evaluation

Overview of the AI Evaluation framework — datasets, experiments, evaluators, prompts, and tools.

Evaluation

The AI Evaluation framework gives you a structured way to measure LLM pipeline quality. It is available across TypeScript, Python, and Java SDKs.

Core Concepts

ConceptDescription
DatasetsCollections of input/expected-output pairs used as test cases
EvaluatorsMetrics that score LLM outputs (faithfulness, relevance, accuracy, etc.)
Evaluator ListsNamed groups of evaluators with their parameter configurations
ExperimentsA dataset + prompt + evaluator list combination run to produce scores
PromptsVersion-controlled prompt templates fetched at runtime
ToolsVersioned LLM tool schemas stored in a central registry

How They Work Together

Dataset (test cases)
    +
Prompt (versioned template)
    +
Evaluator List (metrics)

Experiment Run

Scores per item → aggregate results
  1. Build a dataset with representative inputs and expected outputs.
  2. Create an evaluator list selecting which metrics to apply.
  3. Create an experiment tying together a dataset, a prompt, and an evaluator list.
  4. Start a run — the platform executes each dataset item through the prompt and scores the output.
  5. Review results in the dashboard or via the SDK.

You can also run evaluations inline (without an experiment) to spot-check individual LLM responses during development.

Pages in This Section