BrowserStack AI Evals
Getting Started

Core Concepts

The key concepts behind BrowserStack AI Evals — traces, observations, evaluators, datasets, and more.

Core Concepts

This page explains the data model behind BrowserStack AI Evals. Understanding these concepts will help you instrument your application effectively and make the most of the platform's evaluation and experimentation features.

Concept Map

Session
└── Trace (one per request / agent run)
    ├── Observation: Span (non-LLM step)
    │   └── Observation: Generation (LLM call)
    │       ├── Input messages + model params
    │       ├── Output message
    │       ├── Token usage + cost + latency
    │       ├── Tool calls (if any)
    │       └── Score(s)
    └── Score(s)

Evaluator ──── scores ──────► Trace / Observation
Prompt    ──── used by ─────► Generation
Dataset   ──── used by ─────► Experiment
Experiment ─── runs ────────► Traces → Scores

Traces

A trace is a complete record of a single end-to-end request through your AI application. Every user interaction, background job, or test case execution produces one trace.

Traces capture:

  • The root input (e.g., the user's question)
  • The final output (e.g., the assistant's response)
  • A tree of all intermediate observations (LLM calls, retrievals, processing steps)
  • Timestamps, latency, metadata, and custom attributes

Traces are the primary unit you work with in the dashboard. You can search, filter, score, and annotate them.


Observations

An observation is a single step within a trace. Observations form a parent-child tree that mirrors the execution flow of your application. There are three types:

Generations

A generation is an LLM call. It is the most important observation type.

A generation captures:

  • input — the messages or prompt sent to the model
  • output — the response from the model
  • model — the model name (e.g., gpt-4o, claude-3-5-sonnet)
  • usage — token counts (prompt, completion, total)
  • cost — estimated cost in USD
  • latency — time to first token and total duration
  • model_parameters — temperature, max_tokens, etc.
  • tool_calls — any tool/function calls made during the response
  • prompt_name / prompt_version — if a managed prompt was used

Spans

A span is a timed, non-LLM operation. Use spans to instrument any step you want to track — vector database lookups, document retrieval, pre/post-processing, external API calls, or custom business logic.

Spans have a name, start_time, end_time, input, output, and optional metadata. They can be nested to represent sub-steps.

Events

An event is a point-in-time marker within a trace. Unlike spans, events have no duration — they record that something happened at a specific moment.

Use events for things like: user messages arriving, cache hits, errors, decision points, or any discrete occurrence worth recording.


Sessions

A session groups multiple traces that belong to the same logical interaction — for example, all turns of a multi-turn conversation with a chatbot.

Sessions let you:

  • View the full conversation history in the dashboard
  • Evaluate conversational quality (coherence, context retention)
  • Track per-session metrics like total cost and message count

A trace is associated with a session by passing a session_id when creating it.


Scores

A score is a quality assessment attached to a trace or an individual observation. Scores have:

  • name — the metric being measured (e.g., faithfulness, toxicity, correctness)
  • value — a numeric value (typically 0–1) or a categorical label
  • source — who or what produced the score: evaluator, human, or api
  • comment — optional free-text explanation

Scores are how evaluation results are stored and surfaced. They appear on the trace detail page and are aggregated in experiment results.


Evaluators

An evaluator is automated scoring logic that produces scores. Evaluators run against traces and observations and attach score records.

BrowserStack AI Evals supports several evaluator types:

LLM-as-judge Uses a language model to evaluate output quality against a rubric. Suitable for open-ended questions where rule-based checks are insufficient. You define the rubric; the platform handles the scoring prompt and parsing.

RAGAS metrics Pre-built RAG evaluation metrics from the RAGAS framework, including:

  • faithfulness — does the answer only use information from the retrieved context?
  • answer_relevancy — is the answer relevant to the question?
  • context_precision — is the context actually useful?
  • context_recall — does the context contain the needed information?

Code-based (rule-based) Custom Python functions that take the trace input/output and return a score. Use these for deterministic checks: JSON schema validation, keyword presence, regex matching, format verification, or any business-rule assertion.

Human review Routes traces to a review queue where annotators can score responses manually. Human scores feed into the same scoring system as automated evaluators.

Evaluators can run in two modes:

  • Online — automatically scores new traces as they arrive (configured per project)
  • Offline — scores a dataset or a specific set of traces on demand (used in experiments)

Datasets

A dataset is an ordered collection of test cases. Each test case (called a dataset item) typically contains:

  • input — the prompt or inputs to send to your AI application
  • expected_output — the ideal response (used as a reference for evaluators)
  • metadata — optional labels, tags, or context

Datasets are used as the input for experiments. They also serve as regression suites — run the same dataset before and after a change to measure the impact.

Start with a dataset of 20–50 representative examples. Focus on edge cases, hard queries, and known failure modes rather than easy "happy path" examples.


Experiments

An experiment runs a dataset through your AI pipeline and scores every output. Use experiments to make data-driven decisions about prompt changes, model upgrades, or architectural changes.

An experiment defines:

  • Dataset — the test cases to run
  • Pipeline — the function or endpoint to call for each test case
  • Evaluators — which scoring logic to apply to each output

After an experiment completes, you can:

  • Compare scores across runs side-by-side
  • View per-item breakdowns to find which test cases improved or regressed
  • Export results for further analysis

Run a baseline experiment before making any changes. This gives you a benchmark to compare against.


Prompts

A prompt is a version-controlled template stored in BrowserStack AI Evals. Managing prompts through the platform lets you:

  • Track prompt changes over time with a full version history
  • A/B test prompt variants using experiments
  • Roll back to a previous version instantly
  • Share prompts across applications and environments

Prompts support variable substitution (e.g., {{user_name}}, {{context}}) and are fetched at runtime using the SDK.

When a managed prompt is used in a generation, the generation automatically records the prompt_name and prompt_version for traceability.


Tools

Tools (also called function calls) are structured definitions that tell an LLM what external functions it can call. BrowserStack AI Evals captures tool call details as a tool span:

  • The tool name and input arguments
  • The tool response
  • Whether the tool call was successful

Tool call data is stored as tool span, so you can inspect exactly which tool was called, along with input arguments and output .


How It All Fits Together

A typical production flow looks like this:

  1. A user sends a message to your application
  2. Your SDK creates a trace (and optionally associates it with a session)
  3. Your code runs: retrieval becomes a span, each LLM call becomes a generation
  4. The trace is sent to BrowserStack AI Evals
  5. Configured evaluators run automatically and attach scores to the trace
  6. You review low-scoring traces in the dashboard and add them to a dataset
  7. You run an experiment with that dataset to reproduce and measure improvements
  8. You update your prompt version and run the experiment again to verify the fix

This cycle — instrument, observe, evaluate, improve — is the core workflow of BrowserStack AI Evals.