Evaluation
Overview of the AI Evaluation framework — datasets, experiments, evaluators, prompts, and tools.
Evaluation
The AI Evaluation framework gives you a structured way to measure LLM pipeline quality. It is available across TypeScript, Python, and Java SDKs.
Core Concepts
| Concept | Description |
|---|---|
| Datasets | Collections of input/expected-output pairs used as test cases |
| Evaluators | Metrics that score LLM outputs (faithfulness, relevance, accuracy, etc.) |
| Evaluator Lists | Named groups of evaluators with their parameter configurations |
| Experiments | A dataset + prompt + evaluator list combination run to produce scores |
| Prompts | Version-controlled prompt templates fetched at runtime |
| Tools | Versioned LLM tool schemas stored in a central registry |
How They Work Together
Dataset (test cases)
+
Prompt (versioned template)
+
Evaluator List (metrics)
↓
Experiment Run
↓
Scores per item → aggregate results- Build a dataset with representative inputs and expected outputs.
- Create an evaluator list selecting which metrics to apply.
- Create an experiment tying together a dataset, a prompt, and an evaluator list.
- Start a run — the platform executes each dataset item through the prompt and scores the output.
- Review results in the dashboard or via the SDK.
You can also run evaluations inline (without an experiment) to spot-check individual LLM responses during development.
Pages in This Section
Datasets
Create datasets, add test items, and track dataset runs.
Experiments
Create experiments, start runs, and poll for results.
Evaluators
Build evaluator lists and run inline evaluations.
Online Evaluations
Continuously score live production traces with evaluator lists.
Automation Rules
Route traces to annotation queues or datasets based on conditions.
Human Review
Set up annotation queues, score items, and share queues with external reviewers.
Benchmarking & Alignment
Measure evaluator accuracy against human ground truth and calibrate through the alignment workflow.
Playground
Interactively test models, prompts, and tool calls without writing code.
Evaluator Types
Reference for LLM-as-a-Judge, code-based, aggregate, and tool call evaluators.
Prompts
Fetch versioned prompts and compile Mustache templates.
Tools
Register and compile LLM tool schemas across providers.