Core Concepts
The key concepts behind BrowserStack AI Evals — traces, observations, evaluators, datasets, and more.
Core Concepts
This page explains the data model behind BrowserStack AI Evals. Understanding these concepts will help you instrument your application effectively and make the most of the platform's evaluation and experimentation features.
Concept Map
Session
└── Trace (one per request / agent run)
├── Observation: Span (non-LLM step)
│ └── Observation: Generation (LLM call)
│ ├── Input messages + model params
│ ├── Output message
│ ├── Token usage + cost + latency
│ ├── Tool calls (if any)
│ └── Score(s)
└── Score(s)
Evaluator ──── scores ──────► Trace / Observation
Prompt ──── used by ─────► Generation
Dataset ──── used by ─────► Experiment
Experiment ─── runs ────────► Traces → ScoresTraces
A trace is a complete record of a single end-to-end request through your AI application. Every user interaction, background job, or test case execution produces one trace.
Traces capture:
- The root input (e.g., the user's question)
- The final output (e.g., the assistant's response)
- A tree of all intermediate observations (LLM calls, retrievals, processing steps)
- Timestamps, latency, metadata, and custom attributes
Traces are the primary unit you work with in the dashboard. You can search, filter, score, and annotate them.
Observations
An observation is a single step within a trace. Observations form a parent-child tree that mirrors the execution flow of your application. There are three types:
Generations
A generation is an LLM call. It is the most important observation type.
A generation captures:
input— the messages or prompt sent to the modeloutput— the response from the modelmodel— the model name (e.g.,gpt-4o,claude-3-5-sonnet)usage— token counts (prompt, completion, total)cost— estimated cost in USDlatency— time to first token and total durationmodel_parameters— temperature, max_tokens, etc.tool_calls— any tool/function calls made during the responseprompt_name/prompt_version— if a managed prompt was used
Spans
A span is a timed, non-LLM operation. Use spans to instrument any step you want to track — vector database lookups, document retrieval, pre/post-processing, external API calls, or custom business logic.
Spans have a name, start_time, end_time, input, output, and optional metadata. They can be nested to represent sub-steps.
Events
An event is a point-in-time marker within a trace. Unlike spans, events have no duration — they record that something happened at a specific moment.
Use events for things like: user messages arriving, cache hits, errors, decision points, or any discrete occurrence worth recording.
Sessions
A session groups multiple traces that belong to the same logical interaction — for example, all turns of a multi-turn conversation with a chatbot.
Sessions let you:
- View the full conversation history in the dashboard
- Evaluate conversational quality (coherence, context retention)
- Track per-session metrics like total cost and message count
A trace is associated with a session by passing a session_id when creating it.
Scores
A score is a quality assessment attached to a trace or an individual observation. Scores have:
name— the metric being measured (e.g.,faithfulness,toxicity,correctness)value— a numeric value (typically 0–1) or a categorical labelsource— who or what produced the score:evaluator,human, orapicomment— optional free-text explanation
Scores are how evaluation results are stored and surfaced. They appear on the trace detail page and are aggregated in experiment results.
Evaluators
An evaluator is automated scoring logic that produces scores. Evaluators run against traces and observations and attach score records.
BrowserStack AI Evals supports several evaluator types:
LLM-as-judge Uses a language model to evaluate output quality against a rubric. Suitable for open-ended questions where rule-based checks are insufficient. You define the rubric; the platform handles the scoring prompt and parsing.
RAGAS metrics Pre-built RAG evaluation metrics from the RAGAS framework, including:
faithfulness— does the answer only use information from the retrieved context?answer_relevancy— is the answer relevant to the question?context_precision— is the context actually useful?context_recall— does the context contain the needed information?
Code-based (rule-based) Custom Python functions that take the trace input/output and return a score. Use these for deterministic checks: JSON schema validation, keyword presence, regex matching, format verification, or any business-rule assertion.
Human review Routes traces to a review queue where annotators can score responses manually. Human scores feed into the same scoring system as automated evaluators.
Evaluators can run in two modes:
- Online — automatically scores new traces as they arrive (configured per project)
- Offline — scores a dataset or a specific set of traces on demand (used in experiments)
Datasets
A dataset is an ordered collection of test cases. Each test case (called a dataset item) typically contains:
input— the prompt or inputs to send to your AI applicationexpected_output— the ideal response (used as a reference for evaluators)metadata— optional labels, tags, or context
Datasets are used as the input for experiments. They also serve as regression suites — run the same dataset before and after a change to measure the impact.
Start with a dataset of 20–50 representative examples. Focus on edge cases, hard queries, and known failure modes rather than easy "happy path" examples.
Experiments
An experiment runs a dataset through your AI pipeline and scores every output. Use experiments to make data-driven decisions about prompt changes, model upgrades, or architectural changes.
An experiment defines:
- Dataset — the test cases to run
- Pipeline — the function or endpoint to call for each test case
- Evaluators — which scoring logic to apply to each output
After an experiment completes, you can:
- Compare scores across runs side-by-side
- View per-item breakdowns to find which test cases improved or regressed
- Export results for further analysis
Run a baseline experiment before making any changes. This gives you a benchmark to compare against.
Prompts
A prompt is a version-controlled template stored in BrowserStack AI Evals. Managing prompts through the platform lets you:
- Track prompt changes over time with a full version history
- A/B test prompt variants using experiments
- Roll back to a previous version instantly
- Share prompts across applications and environments
Prompts support variable substitution (e.g., {{user_name}}, {{context}}) and are fetched at runtime using the SDK.
When a managed prompt is used in a generation, the generation automatically records the prompt_name and prompt_version for traceability.
Tools
Tools (also called function calls) are structured definitions that tell an LLM what external functions it can call. BrowserStack AI Evals captures tool call details within generations:
- The tool name and input arguments
- The tool response
- Whether the tool call was successful
Tool call data is stored as part of the generation observation, so you can inspect exactly what the model requested and what it received.
How It All Fits Together
A typical production flow looks like this:
- A user sends a message to your application
- Your SDK creates a trace (and optionally associates it with a session)
- Your code runs: retrieval becomes a span, each LLM call becomes a generation
- The trace is sent to BrowserStack AI Evals
- Configured evaluators run automatically and attach scores to the trace
- You review low-scoring traces in the dashboard and add them to a dataset
- You run an experiment with that dataset to reproduce and measure improvements
- You update your prompt version and run the experiment again to verify the fix
This cycle — instrument, observe, evaluate, improve — is the core workflow of BrowserStack AI Evals.