Overview of the AI Evaluation framework — datasets, experiments, evaluators, prompts, and tools.

Evaluation

The AI Evaluation framework gives you a structured way to measure LLM pipeline quality. It is available across TypeScript, Python, and Java SDKs.

Core Concepts

Concept	Description
Datasets	Collections of input/expected-output pairs used as test cases
Evaluators	Metrics that score LLM outputs (faithfulness, relevance, accuracy, etc.)
Evaluator Lists	Named groups of evaluators with their parameter configurations
Experiments	A dataset + prompt + evaluator list combination run to produce scores
Prompts	Version-controlled prompt templates fetched at runtime
Tools	Versioned LLM tool schemas stored in a central registry

Dataset (test cases)
    +
Prompt (versioned template)
    +
Evaluator List (metrics)
    ↓
Experiment Run
    ↓
Scores per item → aggregate results

Build a dataset with representative inputs and expected outputs.
Create an evaluator list selecting which metrics to apply.
Create an experiment tying together a dataset, a prompt, and an evaluator list.
Start a run — the platform executes each dataset item through the prompt and scores the output.
Review results in the dashboard or via the SDK.

You can also run evaluations inline (without an experiment) to spot-check individual LLM responses during development.