The key concepts behind BrowserStack AI Evals — traces, observations, evaluators, datasets, and more.

Core Concepts

This page explains the data model behind BrowserStack AI Evals. Understanding these concepts will help you instrument your application effectively and make the most of the platform's evaluation and experimentation features.

Concept Map

Session
└── Trace (one per request / agent run)
    ├── Observation: Span (non-LLM step)
    │   └── Observation: Generation (LLM call)
    │       ├── Input messages + model params
    │       ├── Output message
    │       ├── Token usage + cost + latency
    │       ├── Tool calls (if any)
    │       └── Score(s)
    └── Score(s)

Evaluator ──── scores ──────► Trace / Observation
Prompt    ──── used by ─────► Generation
Dataset   ──── used by ─────► Experiment
Experiment ─── runs ────────► Traces → Scores

Traces

A trace is a complete record of a single end-to-end request through your AI application. Every user interaction, background job, or test case execution produces one trace.

Traces capture:

The root input (e.g., the user's question)
The final output (e.g., the assistant's response)
A tree of all intermediate observations (LLM calls, retrievals, processing steps)
Timestamps, latency, metadata, and custom attributes

Traces are the primary unit you work with in the dashboard. You can search, filter, score, and annotate them.

Observations

An observation is a single step within a trace. Observations form a parent-child tree that mirrors the execution flow of your application. There are three types:

Generations

A generation is an LLM call. It is the most important observation type.

A generation captures:

input — the messages or prompt sent to the model
output — the response from the model
model — the model name (e.g., gpt-4o, claude-3-5-sonnet)
usage — token counts (prompt, completion, total)
cost — estimated cost in USD
latency — time to first token and total duration
model_parameters — temperature, max_tokens, etc.
tool_calls — any tool/function calls made during the response
prompt_name / prompt_version — if a managed prompt was used

Spans

A span is a timed, non-LLM operation. Use spans to instrument any step you want to track — vector database lookups, document retrieval, pre/post-processing, external API calls, or custom business logic.

Spans have a name, start_time, end_time, input, output, and optional metadata. They can be nested to represent sub-steps.

Events

An event is a point-in-time marker within a trace. Unlike spans, events have no duration — they record that something happened at a specific moment.

Use events for things like: user messages arriving, cache hits, errors, decision points, or any discrete occurrence worth recording.

Sessions

A session groups multiple traces that belong to the same logical interaction — for example, all turns of a multi-turn conversation with a chatbot.

Sessions let you:

View the full conversation history in the dashboard
Evaluate conversational quality (coherence, context retention)
Track per-session metrics like total cost and message count

A trace is associated with a session by passing a session_id when creating it.

Scores

A score is a quality assessment attached to a trace or an individual observation. Scores have:

name — the metric being measured (e.g., faithfulness, toxicity, correctness)
value — a numeric value (typically 0–1) or a categorical label
source — who or what produced the score: evaluator, human, or api
comment — optional free-text explanation

Scores are how evaluation results are stored and surfaced. They appear on the trace detail page and are aggregated in experiment results.

Evaluators

An evaluator is automated scoring logic that produces scores. Evaluators run against traces and observations and attach score records.

BrowserStack AI Evals supports several evaluator types:

LLM-as-judge Uses a language model to evaluate output quality against a rubric. Suitable for open-ended questions where rule-based checks are insufficient. You define the rubric; the platform handles the scoring prompt and parsing.

RAGAS metrics Pre-built RAG evaluation metrics from the RAGAS framework, including:

faithfulness — does the answer only use information from the retrieved context?
answer_relevancy — is the answer relevant to the question?
context_precision — is the context actually useful?
context_recall — does the context contain the needed information?

Code-based (rule-based) Custom Python functions that take the trace input/output and return a score. Use these for deterministic checks: JSON schema validation, keyword presence, regex matching, format verification, or any business-rule assertion.

Human review Routes traces to a review queue where annotators can score responses manually. Human scores feed into the same scoring system as automated evaluators.

Evaluators can run in two modes:

Online — automatically scores new traces as they arrive (configured per project)
Offline — scores a dataset or a specific set of traces on demand (used in experiments)

Datasets

A dataset is an ordered collection of test cases. Each test case (called a dataset item) typically contains:

input — the prompt or inputs to send to your AI application
expected_output — the ideal response (used as a reference for evaluators)
metadata — optional labels, tags, or context

Datasets are used as the input for experiments. They also serve as regression suites — run the same dataset before and after a change to measure the impact.

Start with a dataset of 20–50 representative examples. Focus on edge cases, hard queries, and known failure modes rather than easy "happy path" examples.

Experiments

An experiment runs a dataset through your AI pipeline and scores every output. Use experiments to make data-driven decisions about prompt changes, model upgrades, or architectural changes.

An experiment defines:

Dataset — the test cases to run
Pipeline — the function or endpoint to call for each test case
Evaluators — which scoring logic to apply to each output

After an experiment completes, you can:

Compare scores across runs side-by-side
View per-item breakdowns to find which test cases improved or regressed
Export results for further analysis

Run a baseline experiment before making any changes. This gives you a benchmark to compare against.

Prompts

A prompt is a version-controlled template stored in BrowserStack AI Evals. Managing prompts through the platform lets you:

Track prompt changes over time with a full version history
A/B test prompt variants using experiments
Roll back to a previous version instantly
Share prompts across applications and environments

Prompts support variable substitution (e.g., {{user_name}}, {{context}}) and are fetched at runtime using the SDK.

When a managed prompt is used in a generation, the generation automatically records the prompt_name and prompt_version for traceability.

Tools

Tools (also called function calls) are structured definitions that tell an LLM what external functions it can call. BrowserStack AI Evals captures tool call details within generations:

The tool name and input arguments
The tool response
Whether the tool call was successful

Tool call data is stored as part of the generation observation, so you can inspect exactly what the model requested and what it received.

How It All Fits Together

A typical production flow looks like this:

A user sends a message to your application
Your SDK creates a trace (and optionally associates it with a session)
Your code runs: retrieval becomes a span, each LLM call becomes a generation
The trace is sent to BrowserStack AI Evals
Configured evaluators run automatically and attach scores to the trace
You review low-scoring traces in the dashboard and add them to a dataset
You run an experiment with that dataset to reproduce and measure improvements
You update your prompt version and run the experiment again to verify the fix

This cycle — instrument, observe, evaluate, improve — is the core workflow of BrowserStack AI Evals.

Core Concepts

On this page