正在切换页面...

Tracing the Fingerprints of Thought: Nested Tracing in Action with LangSmith

mediumObservabilityLangSmithTracingDebuggingLLMUpdated

What (What this article is about)

For an agent, tracing is not just "prettier logs", but a "retrospectable causal chain". It breaks a single task into a nestable execution tree: every model invocation, tool call, retry, timeout, and idempotent commit point has a dedicated place in the trace.

This article uses LangSmith as a concrete implementation example, while grounding the abstractions using OpenTelemetry concepts:

What are the semantics of trace/span/attributes/events respectively.
How the agent's span tree should be layered to pinpoint "which step led the system astray".
Which fields must be recorded to transform an incident from a "guess" into a "chain of evidence" (audit, observability).

Problem (The engineering problem to be solved)

The failure modes of agents often span multiple layers:

Model misreading: Misinterpreting a tool's output, taking the wrong path, and consequently writing the wrong file.
Tool exceptions: Timeouts, permission denials, truncated returns, triggering a retry storm (timeout, retry).
Repeated side-effects: Missing an idempotency key, causing retries to result in duplicate commits (idempotency).
Context pollution: A file read returns too much data, causing attention degradation in subsequent steps (degradation).

If you only have a "long console log", it is extremely difficult to answer three questions:

At which step did the error occur?
What input did the model see at that exact moment?
Did the system trigger timeout/retry/idempotency/permission gates at that time?

The goal of tracing is to make these three questions answerable.

Principle (Tracing semantics: a span is a causal segment, not a log line)

OpenTelemetry defines the fundamental objects of observability very clearly:

trace: The end-to-end causal chain of a single request/task.
span: An operational segment within a trace (having a start/end time), which is nestable.
attributes: Structured key-value fields used for aggregation and filtering.
events: Discrete events within a span (e.g., "retried once", "permission denied once").

The value of these semantics is that you no longer rely on "reading text" to locate problems, but instead rely on "aggregated fields + timelines + parent-child relationships".

Usage (How to use it: Drawing the correct span tree for your agent)

1) What a usable span tree should look like

It is recommended to divide it into at least four layers:

task (root span): The lifecycle of a single task.
plan (child span): A single reasoning/planning step.
tool (child span): A single tool invocation (including timeout/retry information).
commit (child span): A single side-effect commit (WAL record point, idempotency key generation point).

The key is to separate "side-effects" from "reasoning": the commit is the anchor of the evidence chain.

2) Minimum field specification (recommended mandatory)

Whether you use LangSmith, Langfuse, or OTel, the fields should be as consistent as possible to facilitate cross-system aggregation:

Association fields:
- task_id
- trace_id
- span_id
Tool fields:
- tool_name
- tool_timeout_ms (timeout)
- tool_attempt / retry_reason (retry)
Side-effect fields:
- idempotency_key (idempotency)
- resource_targets (collection of resources written to)
- commit_id / wal_id (audit)
Budget fields:
- input_tokens / output_tokens
- cache_read_tokens (if available)
Security fields:
- permission_scope
- redaction_applied (whether redacted)

The purpose of this set of fields is not to be "pretty", but to allow you to answer:

Where are timeouts most likely to occur?
Where are retries most likely to occur?
Are there idempotency conflicts?
Are there unauthorized access attempts?

3) Minimum integration of LangSmith (Example)

The following is just illustrative: what you need to do is "wrap key functions with trace", and fill in the attributes at the tool and commit layers.

import os
from langsmith import traceable


def _env(name: str) -> str:
    v = os.environ.get(name)
    if not v:
        raise RuntimeError(f"missing env: {name}")
    return v


class AgentEngine:
    """
    Agent Engine:
    Demonstrates how to turn the planning/tool/commit layers into traceable spans.
    """

    @traceable(run_type="chain", name="task")
    async def run_task(self, task: dict):
        return await self._plan_and_execute(task)

    @traceable(run_type="chain", name="plan")
    async def _plan_and_execute(self, task: dict):
        plan = await self._make_plan(task)
        for step in plan:
            await self._invoke_tool(step)

    @traceable(run_type="tool", name="tool")
    async def _invoke_tool(self, tool_call: dict):
        # Fill in here: tool_name / timeout / attempt / retry_reason
        # And truncate/redact the tool output to prevent sensitive info from entering the trace directly
        pass

    @traceable(run_type="tool", name="commit")
    async def _commit(self, change: dict):
        # Generate idempotency_key here, write to WAL, then perform the actual commit
        pass

4) Privacy and Leak Risks: Tracing itself is the data plane

The most easily overlooked point is: traces often store prompts, tool outputs, and file snippets.

Therefore, you must design tracing as a "sensitive data system":

Redaction: API keys, tokens, emails, and connection strings must be redacted before writing (audit).
Permissions: Who can view the traces? Isolate by project/environment (permissions, isolation).
Retention: The retention period and deletion policy for production traces (compliance risk), combined with log rotation (resource release).

Design (Design trade-offs: Why tracing is not logging)

log: Suitable for recording detailed text, suitable for human reading.
trace: Suitable for expressing causal trees, suitable for system aggregation and replay.

You need both, but do not mix them:

Use trace for the critical path (span tree + attributes).
Use log for supplementary details (structured logs + associated trace_id).

Pitfall (Common pitfalls and error prevention)

Tracing only models, not tools: Ultimately, you still won't know why side-effects occurred.
Not recording timeouts/retries: One of the most common sources of online accidents is retry storms (timeout, retry).
Not recording idempotency keys: You cannot distinguish between "retry" and "duplicate commit" (idempotency).
Writing sensitive info into traces: The observability system becomes a source of leaks (permissions, audit).
Uncontrollable tracing overhead: Sampling and field pruning must be configurable (degradation).

Debug (Locating a real incident using tracing)

Recommended troubleshooting sequence:

Find the failed tool span starting from the root span.
Look at the timeout/attempt/retry_reason of the tool span.
If writing is involved, locate the commit span and check the idempotency_key and wal_id.
Compare the span distribution with similar tasks to find the degradation point (observability).

Metrics and Alerts (Turning tracing into a runtime defense)

After integrating tracing, don't just use it as a "debugging UI". You should immediately produce alertable metrics:

Timeout rate (aggregated by tool_name) (timeout).
Retry count distribution (aggregated by retry_reason) (retry).
Idempotency conflict count (aggregated by idempotency_key) (idempotency).
Permission denial count (aggregated by permission_scope) (permissions).
P95 latency (aggregated by span type: plan/tool/commit).

The significance of these metrics is to discover "the system is heading towards losing control" in advance, rather than opening the trace after an accident has occurred.

Sampling and Degradation (Tracing can also drag down the system)

Tracing itself has a cost. You need two layers of switches:

Sampling: Collect full traces for only a portion of tasks, and only collect critical spans (like commit) for others.
Field pruning: Truncate and redact prompts/tool outputs to avoid blasting large texts into the observability system.

When system pressure rises, degradation must be allowed:

Retain only audit-critical fields (commit span + wal_id).
Pause recording large payloads (e.g., full file texts).

This is also part of "the observability system must be observable" (degradation, resource release).

A Minimum "commit span" Checklist (Recommended to hardcode)

The commit span is the anchor of the evidence chain; you can treat it as a mandatory set of fields:

commit_id / wal_id
idempotency_key
resource_targets
result / error_code
approved_by (if there is a manual approval chain)

Even if you downsample, you must retain these fields, otherwise post-mortem accountability and rollback are impossible (audit, rollback).

Source (References)

LangSmith tracing: https://docs.smith.langchain.com/observability/how_to_guides/tracing
OpenTelemetry concepts: https://opentelemetry.io/docs/concepts/observability-primer/
OpenTelemetry trace spec: https://opentelemetry.io/docs/specs/otel/trace/
Langfuse docs: https://langfuse.com/docs