BrowserStack AI Evals
Tracing

Manual Tracing

Create traces, spans, generations, events, and scores to instrument any AI pipeline.

Manual Tracing

Manual tracing lets you instrument any code path — whether or not it uses a supported LLM provider. Use it to structure multi-step pipelines, capture inputs and outputs, and add evaluation scores.

Hierarchy

Trace
├── Generation  (LLM call)
├── Span        (sub-step: retrieval, tool call, etc.)
│   └── Event   (point-in-time observation)
└── Score       (evaluation result)

Creating a Trace

A trace is the top-level container for a single request or pipeline run.

import { AISDK } from '@browserstack/ai-sdk';

const testOps = new AISDK({
  publicKey: process.env.AISDK_PUBLIC_KEY,
  secretKey: process.env.AISDK_SECRET_KEY,
});

const trace = testOps.trace({
  name: 'qa-pipeline',
  userId: 'user-123',
  sessionId: 'session-456',
  input: { question: 'What is the capital of France?' },
  metadata: { source: 'api' },
  tags: ['production', 'v2'],
});
import os
from browserstack_ai_sdk import AISDK

client = AISDK(
    public_key=os.environ["AISDK_PUBLIC_KEY"],
    secret_key=os.environ["AISDK_SECRET_KEY"],
)

trace = client.trace(
    name="rag-pipeline",
    user_id="user-123",
    session_id="session-abc",
    tags=["rag", "production"],
    metadata={"version": "2.0"},
)
import com.browserstack.aisdk.TestOps;
import com.browserstack.aisdk.tracing.TraceManager;
import com.browserstack.aisdk.tracing.model.TraceBody;

TestOps sdk = TestOps.fromEnv();
TraceManager tm = sdk.traceManager();

var trace = tm.trace(TraceBody.builder()
    .name("answer-question")
    .input("What causes Northern Lights?")
    .userId("user-42")
    .sessionId("session-abc")
    .environment("production")
    .build());

Trace Options

FieldTypeRequiredDescription
namestringNoDisplay name for the trace.
userIdstringNoID of the user that triggered this trace.
sessionIdstringNoGroup multiple traces into a session.
inputanyNoInput data (shown in the dashboard).
outputanyNoOutput data. Usually set when the trace completes.
metadataRecord<string, any>NoArbitrary key-value metadata.
tagsstring[]NoLabels for filtering in the dashboard.
idstringNoCustom trace ID (auto-generated if omitted).
FieldTypeRequiredDescription
namestrYesDisplay name for the trace.
user_idstrNoID of the user that triggered this trace.
session_idstrNoGroup multiple traces into a session.
inputanyNoInput data (shown in the dashboard).
outputanyNoOutput data. Usually set via trace.update().
metadatadictNoArbitrary key-value metadata.
tagslist[str]NoLabels for filtering in the dashboard.
publicboolNoWhether the trace is publicly visible.
idstrNoCustom trace ID (auto-generated if omitted).
expected_outputanyNoExpected output for evaluation.

All TraceBody fields:

FieldTypeRequiredDescription
nameStringNoDisplay name for the trace
inputObjectNoRoot input (String, Map, or any serializable object)
outputObjectNoFinal output
userIdStringNoUser identifier
sessionIdStringNoSession identifier for grouping traces
environmentStringNoEnvironment name (e.g., "production")
releaseStringNoRelease version
tagsList<String>NoArbitrary tags
metadataMap<String, Object>NoCustom key-value metadata
isPublicBooleanNoWhether the trace is publicly visible
idStringNoCustom trace ID (auto-generated if omitted)

Custom trace IDs

You can attach a business or external identifier to a trace — useful for idempotent retries (the same request ID always maps to the same trace) and for looking up traces later by an ID your system already owns (an order ID, a request UUID, a user-message UUID).

There are two ways to do this. Both produce the same result:

  1. Pass any string as id when creating the trace. The SDK detects whether the value is a 32-character hex string and either preserves it or hashes it deterministically.
  2. Pre-compute the trace ID with generateTraceId(customId) / generate_trace_id(custom_id) and pass the result as id. Useful when you need the trace ID before the trace is created (for example, to return it in an API response).

How the SDK resolves the id you pass:

InputStored trace IDcustomId field
null / undefined / empty / non-stringrandom UUID (no dashes)not set
32-character hex (with or without dashes, any case)normalised lowercase hexthe original string, if it differed from the normalised form
any other stringSHA-256 of the input, truncated to 32 charsthe original string

Once a trace exists, read the original custom string via trace.customId (Node) / trace.custom_id (Python). The dashboard surfaces it on the trace detail page and lets you filter the Traces list by Custom ID.

import { AISDK } from '@browserstack/ai-sdk';

const testOps = new AISDK({
  publicKey: process.env.AISDK_PUBLIC_KEY,
  secretKey: process.env.AISDK_SECRET_KEY,
});

// Option 1 — pass the custom string directly
const trace = testOps.trace({
  id: 'request-abc-123',
  name: 'qa-pipeline',
});

console.log(trace.id);        // 32-char hex (deterministic for 'request-abc-123')
console.log(trace.customId);  // 'request-abc-123'

// Option 2 — pre-compute when you need the hex up front
const traceId = testOps.generateTraceId('request-abc-123');
const sameTrace = testOps.trace({ id: traceId, name: 'qa-pipeline' });
// sameTrace.id === trace.id
import os
from browserstack_ai_sdk import AISDK

client = AISDK(
    public_key=os.environ["AISDK_PUBLIC_KEY"],
    secret_key=os.environ["AISDK_SECRET_KEY"],
)

# Option 1 — pass the custom string directly
trace = client.trace(id="request-abc-123", name="qa-pipeline")

print(trace.id)         # 32-char hex (deterministic for 'request-abc-123')
print(trace.custom_id)  # 'request-abc-123'

# Option 2 — pre-compute when you need the hex up front
trace_id = client.generate_trace_id("request-abc-123")
same_trace = client.trace(id=trace_id, name="qa-pipeline")
# same_trace.id == trace.id
import com.browserstack.aisdk.TestOps;
import com.browserstack.aisdk.tracing.model.TraceBody;

TestOps sdk = TestOps.fromEnv();

// Pass the custom string directly as `id`
TraceBody body = TraceBody.builder()
    .id("request-abc-123")
    .name("qa-pipeline")
    .build();

var trace = sdk.tracing().trace(body);
System.out.println(trace.getId());

The Java SDK does not expose a generateTraceId helper yet. Pass the custom string directly as id; the backend applies the same resolution rules.

Custom IDs are deterministic: the same input string always produces the same trace ID. This is what makes retries idempotent — a retried request with the same custom ID writes to the same trace rather than creating a duplicate.

Updating a Trace

const trace = testOps.trace({ name: 'qa-pipeline', input: { q: 'Hello?' } });

// ... run your pipeline ...

trace.update({
  output: 'Paris is the capital of France.',
  metadata: { latencyMs: 320 },
});
trace = client.trace(name="qa-pipeline", input={"question": "What is the capital of France?"})

# ... run your pipeline ...

trace.update(output="Paris is the capital of France.")
trace.update(TraceBody.builder()
    .output("The Aurora Borealis is caused by solar wind particles...")
    .build());

Creating a Generation

A generation records a single LLM call. Select your language and provider:

Model Provider
import OpenAI from 'openai';

const openai = new OpenAI();

const generation = trace.generation({
  name: 'openai-call',
  model: 'gpt-4o',
  modelParameters: { temperature: 0.3, maxTokens: 512 },
  input: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' },
  ],
});

const result = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' },
  ],
});

generation.end({
  output: result.choices[0].message,
  usage: {
    input: result.usage?.prompt_tokens,
    output: result.usage?.completion_tokens,
  },
});

Generation Parameters

FieldTypeRequiredDescription
namestringNoDisplay name.
modelstringNoModel identifier (e.g. gpt-4o).
modelParametersRecord<string, any>NoParameters passed to the model.
inputanyNoPrompt or messages sent to the model.
outputanyNoModel response. Usually set via generation.end().
usage{ input?: number; output?: number; total?: number }NoToken counts.
metadataRecord<string, any>NoArbitrary metadata.
startTimeDateNoOverride start timestamp.
endTimeDateNoOverride end timestamp.
Model Provider
import openai

openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

generation = trace.start_generation(
    name="generate-answer",
    model="gpt-4o",
    input=[
        {"role": "system", "content": "Answer based on the provided context."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer based on the provided context."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
)

answer = response.choices[0].message.content
generation.update(
    output=answer,
    usage_details={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens},
)
generation.end()

Generation Parameters

FieldTypeRequiredDescription
namestrYesDisplay name.
modelstrNoModel identifier (e.g. gpt-4o).
inputanyNoPrompt or messages sent to the model.
outputanyNoModel response. Set via generation.update().
promptanyNoAlias for input.
responseanyNoAlias for output.
expected_outputanyNoExpected output for evaluation.
contextanyNoContext to attach to the generation.
usage_detailsdictNoToken counts (set via generation.update()).
metadatadictNoArbitrary metadata.
Model Provider
import com.browserstack.aisdk.tracing.model.GenerationBody;
import com.browserstack.aisdk.tracing.model.Usage;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.*;

OpenAIClient openai = OpenAIOkHttpClient.fromEnv();

long genStart = System.currentTimeMillis();

ChatCompletion response = openai.chat().completions().create(
    ChatCompletionCreateParams.builder()
        .model("gpt-4o")
        .addMessage(ChatCompletionMessageParam.ofUser("What causes Northern Lights?"))
        .build()
);

String answer = response.choices().get(0).message().content();

trace.generation(GenerationBody.builder()
    .name("gpt-4o-call")
    .model("gpt-4o")
    .input(List.of(Map.of("role", "user", "content", "What causes Northern Lights?")))
    .output(answer)
    .usage(Usage.builder()
        .promptTokens(120)
        .completionTokens(85)
        .totalTokens(205)
        .build())
    .startTime(genStart)
    .endTime(System.currentTimeMillis())
    .modelParameters(Map.of("temperature", 0.7, "maxTokens", 512))
    .build());

Creating a Span

A span tracks a sub-step that is not an LLM call — for example a retrieval, a tool call, or a preprocessing step.

const trace = testOps.trace({ name: 'rag-pipeline' });

const retrievalSpan = trace.span({
  name: 'vector-search',
  input: { query: 'capital of France', topK: 5 },
});

const vectorStore = {
  search: async (query, { topK }) => [
    { id: 'doc1', text: 'Paris is the capital of France.', score: 0.92 },
    { id: 'doc2', text: 'France is a country in Europe.', score: 0.81 },
  ],
};

const docs = await vectorStore.search('capital of France', { topK: 5 });

retrievalSpan.end({
  output: docs,
  metadata: { source: 'pinecone' },
});
retrieval_span = trace.span(
    name="retrieve-documents",
    input={"query": "What is the capital of France?"},
)

documents = ["France is a country in Western Europe. Its capital is Paris."]

retrieval_span.update(output={"documents": documents})
retrieval_span.end()
import com.browserstack.aisdk.tracing.model.SpanBody;

long start = System.currentTimeMillis();

List<String> docs = vectorDb.search(query);

var retrieval = trace.span(SpanBody.builder()
    .name("vector-retrieval")
    .input(query)
    .output(docs)
    .startTime(start)
    .endTime(System.currentTimeMillis())
    .metadata(Map.of("collection", "knowledge-base", "topK", 5))
    .build());

Call span.end() to record the current timestamp if you didn't set endTime in the builder:

var span = trace.span(SpanBody.builder().name("processing").build());
// ... do work ...
span.end(); // records endTime = now

Span Parameters

FieldTypeRequiredDescription
namestringNoDisplay name.
inputanyNoInput data.
outputanyNoOutput data. Usually set via span.end().
metadataRecord<string, any>NoArbitrary metadata.
startTimeDateNoOverride start timestamp.
endTimeDateNoOverride end timestamp.
FieldTypeRequiredDescription
namestrYesDisplay name.
inputanyNoInput data.
expected_outputanyNoExpected output for evaluation.
contextanyNoContext to attach to the span.
metadatadictNoArbitrary metadata.

Use span.update(output=...) to set output before calling span.end().

FieldTypeRequiredDescription
nameStringNoDisplay name.
inputObjectNoInput data.
outputObjectNoOutput data.
metadataMap<String, Object>NoArbitrary metadata.
startTimelongNoOverride start timestamp (epoch ms).
endTimelongNoOverride end timestamp (epoch ms).

Creating an Event

An event is a point-in-time observation within a trace or span.

const trace = testOps.trace({ name: 'agent-run' });

trace.event({
  name: 'tool-selected',
  input: { toolName: 'search_web' },
  metadata: { reasoning: 'User asked for current info' },
});
span = trace.start_span(name="agent-step")
span.create_event(
    name="cache-miss",
    input={"key": "query-123"},
)
span.end()
import com.browserstack.aisdk.tracing.model.EventBody;

trace.event(EventBody.builder()
    .name("cache-miss")
    .input(Map.of("key", query))
    .metadata(Map.of("reason", "ttl-expired"))
    .build());

Event Parameters

FieldTypeRequiredDescription
namestringNoDisplay name.
inputanyNoInput data.
outputanyNoOutput data.
metadataRecord<string, any>NoArbitrary metadata.
levelstringNoObservation level (e.g. "DEBUG", "WARNING", "ERROR").
startTimeDateNoOverride timestamp.
FieldTypeRequiredDescription
namestrYesDisplay name.
inputanyNoInput data.
expected_outputanyNoExpected output for evaluation.
contextanyNoContext to attach to the event.
metadatadictNoArbitrary metadata.
FieldTypeRequiredDescription
nameStringNoDisplay name.
inputObjectNoInput data.
metadataMap<String, Object>NoArbitrary metadata.

Adding Scores

Scores attach evaluation results to a trace or observation.

// Score on a trace
testOps.score({
  traceId: trace.id,
  name: 'relevance',
  value: 0.92,
  comment: 'Answer was highly relevant to the question.',
});

// Score on a specific generation
testOps.score({
  traceId: trace.id,
  observationId: generation.id,
  name: 'faithfulness',
  value: 1.0,
});
trace.score(
    name="correctness",
    value=1.0,
    comment="Answer is correct",
)
import com.browserstack.aisdk.tracing.model.ScoreBody;

// Score a generation
gen.score(ScoreBody.builder()
    .name("faithfulness")
    .value(0.92)
    .comment("Answer stays within the provided context.")
    .build());

// Score a trace
trace.score(ScoreBody.builder()
    .name("overall-quality")
    .value(0.85)
    .dataType("NUMERIC")
    .build());

Score Parameters

FieldTypeRequiredDescription
namestringYesScore name.
valuenumber | stringYesScore value. Numeric for NUMERIC/BOOLEAN, string for CATEGORICAL.
traceIdstringNoTrace to attach the score to.
observationIdstringNoObservation to attach the score to.
commentstringNoComment explaining the score.
dataTypestringNo"NUMERIC", "BOOLEAN", or "CATEGORICAL". Auto-inferred if omitted.
metadataRecord<string, any>NoArbitrary metadata.
FieldTypeRequiredDescription
namestrYesScore name.
valuefloat | strYesScore value. Numeric for NUMERIC/BOOLEAN, string for CATEGORICAL.
commentstrNoComment explaining the score.
data_typestrNo"NUMERIC", "BOOLEAN", or "CATEGORICAL". Auto-inferred if omitted.
observation_idstrNoObservation to attach the score to.
metadatadictNoArbitrary metadata.
FieldTypeRequiredDescription
nameStringYesScore name.
valuedoubleYesScore value.
commentStringNoComment explaining the score.
dataTypeStringNo"NUMERIC", "BOOLEAN", or "CATEGORICAL".

Nesting Observations

Spans and generations can be nested inside other spans.

const trace = testOps.trace({ name: 'agent' });

const agentSpan = trace.span({ name: 'agent-step-1' });

// Nest a generation inside a span
const generation = agentSpan.generation({
  name: 'tool-decision',
  model: 'gpt-4o',
  input: [{ role: 'user', content: 'Which tool should I use?' }],
});

generation.end({ output: 'search_web' });
agentSpan.end({});

Use start_span() and start_generation() to nest observations explicitly:

trace = client.trace(name="agent")

span = trace.start_span(
    name="agent-step-1",
    input={"query": "hello"},
)

gen = span.start_generation(
    name="llm-call",
    model="gpt-4o",
    input="Say hello in French",
)

gen.update(output="Bonjour!")
gen.end()
span.update(output="Bonjour!")
span.end()
client.flush()

Use the withWorkflow() helper to wrap a callable in an auto-created trace:

String result = tm.withWorkflow("answer-question", () -> {
    // All tracing here is nested under the auto-created trace
    return generateAnswer(question);
});

Frameworks

Select your language to see how to combine manual tracing with each framework:

Framework

Use the SDK directly — create a trace, record generations and spans around your LLM calls, then update the trace with the final output.

import { AISDK } from '@browserstack/ai-sdk';
import OpenAI from 'openai';

const testOps = new AISDK({
  publicKey: process.env.AISDK_PUBLIC_KEY,
  secretKey: process.env.AISDK_SECRET_KEY,
});
const openai = new OpenAI();

async function runRagPipeline(question) {
  const trace = testOps.trace({
    name: 'rag-pipeline',
    input: { question },
    tags: ['rag', 'production'],
  });

  // Step 1: Retrieve context
  const retrievalSpan = trace.span({
    name: 'retrieval',
    input: { query: question },
  });
  const context = await retrieveDocuments(question);
  retrievalSpan.end({ output: context });

  // Step 2: Generate answer
  const generation = trace.generation({
    name: 'answer-generation',
    model: 'gpt-4o',
    input: [
      { role: 'system', content: 'Answer using the provided context.' },
      { role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  });

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'Answer using the provided context.' },
      { role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  });

  const answer = response.choices[0].message.content ?? '';
  generation.end({
    output: answer,
    usage: {
      input: response.usage?.prompt_tokens,
      output: response.usage?.completion_tokens,
    },
  });

  trace.update({ output: answer });
  await testOps.shutdown();
  return answer;
}

async function retrieveDocuments(question) {
  return 'Paris is the capital of France. France is a country in Europe.';
}

await runRagPipeline('What is the capital of France?');
Framework

Use the SDK directly — create a trace, record generations and spans around your LLM calls, then update the trace with the final output.

import os
from browserstack_ai_sdk import AISDK
import openai

client = AISDK(
    public_key=os.environ["AISDK_PUBLIC_KEY"],
    secret_key=os.environ["AISDK_SECRET_KEY"],
)
openai_client = openai.OpenAI()

def run_rag_pipeline(question: str) -> str:
    trace = client.trace(
        name="rag-pipeline",
        user_id="user-123",
        session_id="session-abc",
        tags=["rag", "production"],
    )

    # Step 1: Retrieve context
    retrieval_span = trace.span(
        name="retrieve-documents",
        input={"query": question},
    )
    documents = ["France is a country in Western Europe. Its capital is Paris."]
    retrieval_span.update(output={"documents": documents})
    retrieval_span.end()

    # Step 2: Generate answer
    generation = trace.start_generation(
        name="generate-answer",
        model="gpt-4o",
        input=[
            {"role": "system", "content": "Answer based on the provided context."},
            {"role": "user", "content": question},
        ],
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on the provided context."},
            {"role": "user", "content": question},
        ],
    )

    answer = response.choices[0].message.content
    generation.update(output=answer, usage_details={"input": 50, "output": 20})
    generation.end()
    trace.update(output=answer)

    trace.score(name="answer-quality", value=0.9)

    client.flush()
    return answer

result = run_rag_pipeline("What is the capital of France?")
print(result)
Framework

Use the SDK directly — create a trace, record generations and spans around your LLM calls, then update the trace with the final output.

import com.browserstack.aisdk.TestOps;
import com.browserstack.aisdk.tracing.TraceManager;
import com.browserstack.aisdk.tracing.model.*;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.*;
import java.util.List;
import java.util.Map;

public class TracingExample {
    public static void main(String[] args) throws Exception {
        TestOps sdk = TestOps.fromEnv();
        TraceManager tm = sdk.traceManager();

        String question = "What causes Northern Lights?";

        var trace = tm.trace(TraceBody.builder()
            .name("rag-pipeline")
            .input(question)
            .userId("user-42")
            .environment("production")
            .build());

        // Retrieve documents
        long t0 = System.currentTimeMillis();
        List<String> docs = List.of("Aurora borealis occur when...", "Solar particles...");
        trace.span(SpanBody.builder()
            .name("retrieval")
            .input(question)
            .output(docs)
            .startTime(t0)
            .endTime(System.currentTimeMillis())
            .build());

        // Call the LLM
        OpenAIClient openai = OpenAIOkHttpClient.fromEnv();
        long genStart = System.currentTimeMillis();
        ChatCompletion response = openai.chat().completions().create(
            ChatCompletionCreateParams.builder()
                .model("gpt-4o")
                .addMessage(ChatCompletionMessageParam.ofUser(question))
                .build()
        );
        String answer = response.choices().get(0).message().content();

        var gen = trace.generation(GenerationBody.builder()
            .name("gpt-4o-call")
            .model("gpt-4o")
            .input(List.of(Map.of("role", "user", "content", question)))
            .output(answer)
            .usage(Usage.builder().promptTokens(200).completionTokens(120).totalTokens(320).build())
            .startTime(genStart)
            .endTime(System.currentTimeMillis())
            .build());

        gen.score(ScoreBody.builder().name("faithfulness").value(0.95).build());
        trace.update(TraceBody.builder().output(answer).build());

        tm.flush();
        sdk.shutdown();
    }
}

Python: @observe Decorator

The Python SDK also provides an @observe decorator for automatic span wrapping:

import os
from browserstack_ai_sdk import observe, AISDK
import openai

client = AISDK(
    public_key=os.environ["AISDK_PUBLIC_KEY"],
    secret_key=os.environ["AISDK_SECRET_KEY"],
)

openai_client = openai.OpenAI()

@observe(name="summarize-article")
def summarize(article: str) -> str:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize the following article in 3 bullet points."},
            {"role": "user", "content": article},
        ],
    )
    return response.choices[0].message.content

result = summarize("Article text goes here...")