EvaluationExperiments
Experiments
Systematically evaluate AI pipeline outputs against a dataset using prompts, APIs, or existing runs.
Experiments
Experiments let you systematically evaluate the outputs of an AI pipeline against a dataset, using one or more evaluators. Each run produces scored results you can compare and analyze in the dashboard.
How to Provide Outputs
Every experiment needs outputs to score. The platform supports three ways to provide them:
- Prompt + Dataset — the platform runs the given prompt against each dataset item to generate outputs, then evaluates them. Use when you want to test a prompt end-to-end.
- API (Dashboard only) — configure an HTTP endpoint that receives each dataset item and returns an output. The platform calls your API, collects the response, and evaluates it. Use when your AI pipeline isn't just a single prompt (e.g., RAG, agents, custom workflows).
- Dataset Run Tag — point the experiment at an existing tagged dataset run that already has outputs (generated from your own code or trace pipeline). Use when you've already run the pipeline and just want to evaluate the results.
All three options use the same evaluator lists and produce results in the same dashboard view.
Where to Create Experiments
- Dashboard UI — configure and launch from Evaluation > Experiments. Supports all three output options above, including the API-based experiment.
- SDK — TypeScript, Python, and Java SDKs support Prompt + Dataset and Dataset Run Tag via
client.experiments.create()/client.experimentRuns.create(). - REST API — direct HTTP access to the same endpoints. See the Experiments API reference.