← Thinking Thinking

How Agents Go Off Track

When an AI Agent picks the wrong tool at step 7, the remaining 13 steps are pure waste — and traditional monitoring can't see it

2026-06-21Thinking19 min read

How Agents Go Off Track

When an AI Agent picks the wrong tool at step 7, the remaining 13 steps are pure waste — and traditional monitoring can't see it

Deep Dive 3 / 3 · This article is the third deep dive into The Three Blind Spots of AI Observability, focusing on Agent decision path observability.

An 18-Step Task by a Data Analysis Agent

Imagine an enterprise data analysis Agent that receives the task: "Analyze the pricing changes of our three main competitors over the past 30 days. Find out who adjusted prices, by how much, and the likely reasons."

The Agent executes. 18 steps later, it outputs a report.

The report looks professional, well-structured, and data-rich. But it's wrong — because at step 4, the Agent confused the product lines of "Company A" and "Company B." The remaining 14 steps were all analyzing the wrong comparison.

Let's break down these 18 steps with an idealized Agent trace to see what happened at each step, why traditional monitoring can't see the problem, and what kind of observability could catch it.

Trace Overview

Run #agent-20260620-001
Goal: Analyze competitor pricing changes
Duration: 4m 32s
Steps: 18
Tokens: input 187,000 / output 12,400
Cost: $1.72 (estimated at Claude Opus pricing, including prompt caching: 80% input cached at 10% billing)
Status: ✅ Completed (but the result is wrong)

Traditional monitoring sees: Status = Completed, Duration = 272s, Cost = $1.72. All good.

Step-by-Step Breakdown

Step 1  [Thought]  Understand task: analyze 30-day pricing changes for 3 competitors
        Tool: web_search("Company A pricing 2026")
        Result: Returns 10 search results
        ✅ Correct

Step 2  [Thought]  Start with Company A
        Tool: web_search("Company A product line pricing page")
        Result: Returns Company A official pricing page link
        ✅ Correct

Step 3  [Thought]  Get Company A pricing details
        Tool: web_fetch("https://company-a.com/pricing")
        Result: Returns pricing page HTML, contains 3 product lines
        ✅ Correct

Step 4  [Thought]  Company A's Enterprise product line went from $49/user/mo to $59
        ❌ Semantic drift: Confused Company A's Team product line ($49) with
           Company B's Pro product line ($49).
           Reason: In Step 3's pricing page, Company A's product lines are
           named "Team," "Business," "Enterprise" — but the Agent
           incorrectly associated the "Pro $49" seen in a Company B ad
           from Step 2's search results.
        Tool: execute_sql("INSERT INTO pricing_data VALUES...")
        Result: Wrote incorrect data
        ⚠️ No error — write succeeded

From Step 4 onward, every analysis the Agent does is built on a wrong premise.

18-Step Agent Trace: Drift at Step 4, 81% Token Waste
18-Step Agent Trace: Drift at Step 4, 81% Token Waste
Step 5  [Thought]  Company A raised prices 20.4%, record this finding
        Tool: web_search("Company A price increase news")
        Result: No relevant news found
        ❌ Agent interpreted no news as "stealth price increase"

Step 6  [Thought]  Search for Company B pricing
        Tool: web_search("Company B pricing")
        Result: Returns Company B pricing page

Step 7  [Thought]  Get Company B pricing details
        Tool: web_fetch("https://company-b.com/pricing")
        Result: Returns pricing page

Step 8  [Thought]  Company B's Pro product line holds steady at $39
        ✅ The analysis of Company B itself is correct
        But due to Step 4's confusion, all subsequent comparisons
        are built on a wrong baseline

Step 9  [Thought]  Compare Company A and Company B
        Tool: execute_sql("SELECT ... FROM pricing_data")
        Result: Returns incorrect data (from Step 4)
        ❌ Comparison results are completely wrong

Step 10 [Thought]  Company A raised 20% vs Company B unchanged,
                   Company A is raising prices to capture margin
        ❌ Wrong conclusion based on wrong data

Step 11-15 [Thought]  Continue analyzing Company C, generate charts,
                      write analysis report
        All based on Step 4's incorrect data

Step 16 [Thought]  Check data quality
        Tool: execute_sql("SELECT COUNT(*) FROM pricing_data")
        Result: 3 records
        Agent judges: data complete ✅
        ❌ Checked completeness, not correctness

Step 17 [Thought]  Generate final report
        Tool: generate_report(...)
        Result: A perfectly formatted report with wrong content

Step 18 [Thought]  Task complete
        Output: "Analysis complete. Company A Enterprise went from $49 to $59..."

Waste Quantification

Total token consumption: 199,400
Effective tokens (Step 1-3 correct exploration): 38,000 (19%)
Wasted tokens (Step 4-18 wrong path): 161,400 (81%)
Total cost: $1.72
Wasted cost: $1.39 (81%)

Of the $1.72 this task cost, $1.39 was burned in the wrong direction. Traditional monitoring can't see this waste at all — every API call returned 200 OK.


The Data Model for Agent Traces

The 18-step breakdown above reveals a fundamental problem: Agent execution is not a linear request chain — it's a decision tree where every step has multiple possible branches, and the Agent's "choices" determine the path.

Traditional Trace vs. Agent Trace

Traditional distributed trace:

HTTP Request → Service A → Service B → DB Query → Service C → HTTP Response

This is a linear chain. Each span has start_time, duration, service_name. OpenTelemetry's data model expresses this perfectly.

Agent trace:

Goal → Thought ("I should search Company A first")
     → Tool Call (web_search) → Result
     → Thought ("Looks like Company A has 3 product lines")
     → Tool Call (web_fetch) → Result
     → Thought ("Enterprise went from $49 to $59")     ← Wrong inference
     → Tool Call (SQL INSERT) → Result                  ← Wrong data written
     → Thought ("Company A raised 20%")                 ← Based on wrong premise
     → Thought ("No news coverage, possibly stealthy")  ← Wrong attribution
     → ...

This isn't a linear chain — it's a tree with branches. Each node has:

  • Thought: The LLM's reasoning process (free text)
  • Tool Call: Tool name, parameters, return value
  • Decision: Why this path was chosen over alternatives
  • Assumption: What assumptions the Agent made at this step
  • Error: Explicit errors (exceptions, timeouts)
  • Semantic Drift: Implicit drift (no error but direction is off)

Traditional OTel spans can't express Thought and Decision — they're semantic information, not structured fields. This is why a dedicated Agent trace data model is needed.

OpenInference Extensions

OpenInference (led by Arize) extends OpenTelemetry's semantic conventions with AI-specific fields:

# Traditional OTel span
span.set_attribute("http.method", "POST")
span.set_attribute("http.url", "/api/chat")
span.set_attribute("http.status_code", 200)

# OpenInference span
span.set_attribute("llm.model", "glm-5.2")
span.set_attribute("llm.token_count.prompt", 45000)
span.set_attribute("llm.token_count.completion", 3200)
span.set_attribute("llm.cost", 0.064)
span.set_attribute("llm.tools_called", ["web_search", "execute_sql"])
span.set_attribute("llm.decision_path", "search → fetch → analyze")
span.set_attribute("llm.retrieval.context", "...")

These fields let traces show not just "what was called" but also "what was consumed" and "what was chosen."

agent-run: A Purpose-Built Agent Standard

agent-run (builderz-labs) goes further, defining atomic units for Agent observation:

Run
  id: string
  goal: string
  status: running | completed | failed | abandoned
  started_at: timestamp
  finished_at: timestamp
  total_cost: float
  total_tokens: int
  Step[]
    id: string
    type: thought | tool_call | error | handoff
    content: string
    result: string
    duration_ms: int
    cost: float
    parent_step_id: string

This structure is better suited for Agent scenarios than OTel spans — it natively supports decision trees, thought text, tool call parameters, and return values. But it's not yet supported by mainstream OTel backends (Jaeger, Tempo, Datadog) and is currently used primarily in dedicated Agent observation tools (Atla, Langfuse's Agent mode).


Detecting Semantic Drift

The hardest observability problem for Agents isn't detecting errors — errors have exceptions, error codes, and traditional methods catch them. The hardest part is detecting semantic drift: the Agent hasn't errored, but its direction is wrong.

Three Detection Methods

UC Berkeley's MASFT research (2025) systematically classified failure modes in multi-Agent systems: 14 failure modes across 3 categories (specification/system design failures, inter-agent misalignment, task validation and termination). In AppWorld tests, multi-Agent systems had a failure rate of 86.7% [4].

Method 1: Semantic Drift Detection

After each Agent thought step, use a lightweight evaluation model to judge "is the current thinking drifting from the original goal":

Evaluator input:
  Goal: "Analyze pricing changes for 3 competitors"
  Current Step (Step 4): "Company A's Enterprise product line went from $49 to $59"

Evaluator output:
  relevance_score: 0.85 (still relevant)
  confidence: 0.9

→ Step 4 is not flagged as drift

The problem: semantic drift detection is itself an LLM call with cost and latency. Doing it every step could double the cost. And for steps like Step 4 that are "wrong but appear relevant," the evaluator also struggles — because the Agent's description is coherent, just factually incorrect.

Method 2: Output Quality Regression

Set checkpoints at critical steps — after the Agent finishes data collection and before analysis, run a validator to check data quality:

Checkpoint (after Step 9):
  Validator checks: Is the data in pricing_data table reasonable?
    - Company A Enterprise $59 ← within reasonable range
    - Company B Pro $39 ← within reasonable range
    - Company C ... ← within reasonable range

  Validator cannot detect: Company A's Team product line mislabeled as Enterprise
  → Checkpoint passes, no interception

Data-level checkpoints can catch structural anomalies (null values, format errors, range violations), but not semantic errors — the data looks perfectly normal, just associated with the wrong entity.

Method 3: Behavioral Pattern Anomaly

Instead of checking single-step correctness, detect whether the overall execution sequence pattern is abnormal:

Normal Agent execution pattern:
  search → fetch → extract → store → search → fetch → ...
  Tool call types alternate at each step

Abnormal pattern examples:
  1. Loop: search → fetch → search → fetch (same query)
     → May be stuck in an information retrieval loop

  2. Overlong chain: thought → thought → thought → thought → ... (no tool call)
     → May be in reasoning hallucination without real validation

  3. Skip: search → skip_validation → skip_analysis → generate_report
     → Skipped critical steps and jumped to conclusions

Behavioral pattern detection doesn't need to understand the Agent's semantics — it only analyzes statistical features of the tool call sequence. This approach is cheap (no extra LLM calls) but limited in precision — it can only detect structural anomalies, not whether each step is in the right direction.

Current State: No Silver Bullet

All three methods have limitations:

Method What It Detects What It Misses Cost
Semantic drift detection Topic-level drift Factual errors High (one LLM call per step)
Output quality regression Structured data anomalies Semantic correctness Medium (checkpoint validation)
Behavioral pattern anomaly Execution pattern deviations Single-step correctness Low (pure statistics)

The practical best practice is combining all three: behavioral pattern detection for coarse filtering (low cost), output quality regression for key checkpoint validation, and semantic drift detection for full-chain auditing of high-value tasks.

Three Drift Detection Methods Compared
Three Drift Detection Methods Compared

Loop Deadlock

The Phenomenon

An Agent gets stuck in a "search → read → search the same thing → read again" loop. Every step executes successfully (200 OK), but the entire chain is spinning in place.

Recent industry exploration (such as AgentDoG-style security diagnostic tools) analyzes complete Agent trajectories from three dimensions: Risk Source, Failure Mode, and Real-world Harm — looking beyond final output to the entire execution path. AgentConductor (SJTU i-WiN + Meituan, 2026) made an equally compelling finding: a 3B-parameter conductor Agent trained with RL dynamically generates interaction topologies, achieving 14.6% accuracy improvement on competition-level programming tasks while reducing token cost by 68% [5] — indirectly proving that fixed-topology multi-Agent communication contains significant redundant information transfer.

A simplified version of a real pattern:

Step 1: search("Company A pricing")
Step 2: fetch(company-a.com/pricing)
Step 3: search("Company A price change 2026")
Step 4: fetch(company-a.com/pricing)        ← repeat
Step 5: search("Company A pricing plans")   ← nearly identical to Step 1
Step 6: fetch(company-a.com/pricing)        ← repeat again
Step 7: search("why Company A pricing")     ← another near-duplicate
Step 8: fetch(company-a.com/pricing)        ← third repeat

Traditional APM: Every API returns 200. Every page fetched successfully. No anomalies.

Agent trace: Search queries and fetch URLs are highly clustered in semantic space. Information gain between steps approaches zero.

Detection Methods

Query embedding similarity: Convert each step's search query into an embedding. Compute cosine similarity against all previous queries. If 3 consecutive steps have query similarity >0.85, flag as a loop.

Tool return information gain: Compare the overlap between current and historical tool returns. If fetching the same URL returns identical content (the page hasn't changed), information gain = 0.

Context growth rate analysis: During normal Agent execution, context length should grow steadily (new information added each step). If context growth rate suddenly approaches zero — net new information per step <100 tokens — the Agent is treading water.

Automated Intervention

Once a loop is detected, automated intervention is possible:

  • Prompt injection: Add to the Agent's next-step input: "You appear to be searching for the same information repeatedly. Try different keywords or a different approach."
  • Forced redirection: Prevent the Agent from using the same or semantically similar queries again
  • Human escalation: If the loop persists beyond 5 steps, pause the Agent and notify a human

Context Bloat and Reasoning Quality Degradation

The Problem

Every Agent step adds tool returns to the context. By step 15, the context might be 200K tokens — mostly historical tool returns (search results, SQL output, API responses).

But the model's effective attention isn't uniformly distributed. The "Lost in the Middle" phenomenon (Liu et al., 2023, arXiv:2307.03172) delivered a disturbing finding: GPT-3.5-Turbo's accuracy when key documents are in the middle position is about 54%, lower than the closed-book accuracy without any context at 56.1% [6]. Chroma's Context Rot report (2025.07, testing 18 models) further found that accuracy begins degrading beyond 32,000 tokens [7].

  • Start (system prompt, early steps): High attention
  • Middle (tool returns from historical steps): Low attention — the "blind spot"
  • End (latest step input): High attention

Quantifying the Impact

Research shows that as context grows from 4K to 128K:

  • Information retrieval accuracy at the start and end degrades by <5%
  • Information retrieval accuracy in the middle position degrades by 20–40%

Impact on Agents: The key information obtained at Step 3 (Company A's product lines are named "Team," "Business," "Enterprise"), by the time the comparison happens at Step 10, has sunk to the middle of the context — the model's attention to it has diminished. Worse, Liu et al. (2023) found that GPT-3.5-Turbo's accuracy with key documents in the middle (54%) is actually lower than closed-book accuracy with no context at all (56.1%). Providing context can be actively harmful — this is a fundamental challenge for Agent architecture. This is exactly the cognitive basis for Step 4's product line confusion.

Context Management

Three strategies for addressing context bloat:

Strategy 1: Sliding Window

  • Keep only the last N steps of full records
  • Earlier steps are replaced by summaries
  • Drawback: The summarization process may lose critical details

Strategy 2: Selective Retention

  • Assign an "importance score" to each step
  • Low-importance steps are discarded
  • Drawback: Determining importance is itself a hard problem

Strategy 3: External Memory

  • Store tool returns in external storage (vector database)
  • Retrieve relevant information from external storage at each step
  • Context retains only system prompt + retrieval results + most recent 2–3 steps
  • Drawback: Retrieval quality depends on embedding and indexing

Information Loss in Multi-Agent Handoffs

The Scenario

In a multi-Agent system:

Agent A (data collection) → Agent B (data analysis) → Agent C (report generation)

Agent A spends 10 steps collecting data. Along the way it:

  • Considered but discarded 2 data sources (insufficient quality)
  • Made internal reliability judgments (Company B's data may be outdated)
  • Noticed an anomaly but decided to defer (Company C's price suddenly doubled — possibly a scraping error)

Agent A finishes and hands the "result" to Agent B. But what is the "result"?

Typically, what Agent A passes to Agent B is a structured dataset or a summary paragraph. The alternative approaches A considered, the data quality judgments, the observed anomalies — all lost. Agent B makes the next analytical step with a heavily compressed version of A's understanding.

Quantifying Information Loss

Agent A's complete cognitive state:
  - Factual data: ~5,000 tokens
  - Exploration process: ~15,000 tokens
  - Data quality judgments: ~2,000 tokens
  - Anomaly observations: ~1,000 tokens
  - Discarded alternatives: ~3,000 tokens
  Total: ~26,000 tokens

What Agent A passes to Agent B:
  - Structured dataset: ~5,000 tokens
  - A summary paragraph: ~500 tokens
  Total: ~5,500 tokens

Information retention rate: 5,500 / 26,000 ≈ 21%

Agent B receives only 21% of Agent A's cognitive state. The lost 79% — quality judgments, anomaly observations, discarded paths — may prove critical later.

Solution Directions

Rich Handoff: Pass not just the result but the full reasoning trace. Agent B can access all of Agent A's thoughts and tool call history. But this causes Agent B's context to balloon, triggering context bloat.

Structured Metadata Transfer: Along with results, Agent A attaches structured metadata — data quality scores, known anomaly lists, brief descriptions of discarded paths. This is more compact than a full trace but requires Agent A to proactively generate this metadata.

Shared Memory Layer: Agent A and Agent B share an external memory store. Agent B retrieves Agent A's reasoning details on demand. This avoids context bloat but introduces retrieval latency and accuracy issues.


Security Observability: The Audit Imperative for Agents

Agents aren't read-only systems — they call APIs, spend money, modify files, and send messages. Every action needs auditing.

Deconvolute: The MCP Firewall

Deconvolute is a client-side runtime firewall that wraps around MCP (Model Context Protocol) tool calls:

Agent decides to call execute_sql("DELETE FROM users WHERE...")
  ↓
Deconvolute intercepts:
  1. Check if execute_sql is in the allowed tools list ✅
  2. Check if parameters match baseline definition ✅
  3. Check if SQL type is permitted by policy:
     - SELECT ✅
     - INSERT ⚠️ Requires confirmation
     - DELETE ❌ Blocked by policy
     - DROP ❌ Blocked by policy
  4. Block, write audit log, return error to Agent

Every tool call is:

  • Cryptographically signed and verified: Ensuring tool definitions haven't been tampered with
  • Policy-matched: Compared against a predefined policy baseline
  • Audit-logged: Call parameters, return values, and decisions (allow/deny) are written to an immutable log

This isn't observation — it's interception. But from an observability perspective, Deconvolute's audit log is the most reliable record of Agent behavior: because it's captured at the tool call layer, it's unaffected by the Agent's self-reporting. An Agent might say in its thought "I'm checking data quality," but if its tool call is DELETE FROM pricing_data, the audit log faithfully records the truth.


The Tooling Landscape

Tool Positioning Strengths Limitations
Langfuse Open-source LLM engineering platform Trace + Prompt management + Evaluation, ClickHouse backend, framework-agnostic Agent-level thought/decision tracking requires manual instrumentation
Arize Phoenix ML-engineer-oriented Native OpenInference support, strong embedding analysis Skews ML rather than Agent
Atla Agent-specific Real-time thought/tool call visualization, drift detection New product, small ecosystem
agent-run Agent trace open standard Defines atomic units for Agent observation Standard stage, no mature implementations
Deconvolute MCP security firewall Tool-call-level interception and auditing Only covers MCP tools
Pydantic AI + Logfire Python Agent + observability Type-safe Agent definition + native tracing Tied to Pydantic AI framework
LangSmith (LangChain) LangChain ecosystem Deep integration with LangChain/LangGraph Locked into LangChain ecosystem
Salesforce Query-Driven Observability Enterprise-grade Agent observability Spark-based, validated on 600 users / 400M records Internal system, not publicly available

Salesforce's Playbook: From Two Weeks to One Day

Salesforce's Agentforce platform faced the ultimate Agent observability challenge: 60+ AI features, 600 users, 400+ million records, 800GB of data. When an Agent went wrong, engineers needed two weeks to locate the problem.

Their solution: Query-Driven Observability — not predefined dashboards, but letting engineers directly query production Agent behavior data using SQL:

-- Find all Agent runs where token consumption spiked after Step 5
SELECT run_id, step_num, token_count, tool_called
FROM agent_traces
WHERE step_num > 5
  AND token_count > avg_token_count * 3
ORDER BY token_count DESC;

-- Find loop deadlock patterns
SELECT run_id, tool_called, COUNT(*) as call_count
FROM agent_traces
GROUP BY run_id, tool_called
HAVING call_count > 5
ORDER BY call_count DESC;

-- Find cases where quality dropped after handoff
SELECT a.run_id, a.agent_name, b.agent_name,
       a.quality_score, b.quality_score
FROM agent_results a
JOIN agent_results b ON a.run_id = b.run_id
WHERE a.step_type = 'handoff'
  AND b.quality_score < a.quality_score * 0.7;

Core architecture: Spark-based query engine + Notebook environment + 400M records of Agent trace data. Engineers can interactively explore Agent behavior patterns in a Notebook.

Result: debugging time dropped from two weeks to one day. The key isn't the tool's sophistication — it's giving engineers the ability to directly query Agent behavior data, moving from "looking at predefined dashboards" to "asking free-form questions."


Agent Runtime Observability: From Fleet to Individual

The preceding sections broke down Agent decision paths, drift detection, loop deadlocks, context bloat, and handoff information loss individually. But to connect these observation points into a coherent picture, you need a layered system — otherwise 14 failure modes scatter across different tools with no unified dashboard.

Below is an observation system organized into three layers: Global (Agent fleet), Coordination (multi-Agent interaction), and Individual (single-Agent execution). Each layer lists key observation points, metrics, recommended tools, and instrumentation locations.

Global Layer: Agent Fleet Management & Resource Scheduling

Observation Point Metric Tool Instrumentation
Agent instance health active / queued / failed_agents Prometheus Scheduler
Resource consumption CPU / mem / GPU utilization cAdvisor Container runtime
Task allocation fairness per_agent_task_count Custom analysis Dispatcher logs
Global token burn rate tokens/sec, burn_rate vs budget Langfuse API client aggregation
Fault propagation cascade_depth, blast_radius OTel trace Inter-Agent call chain

The global layer answers: Is the overall Agent fleet healthy? Are any Agents spinning idle? Is the token budget being consumed as expected? Is one Agent's failure propagating to others?

Coordination Layer: Multi-Agent Interaction Observation

Observation Point Metric Tool Instrumentation
Handoff quality information_retention_ratio Custom logger Handoff point
Communication efficiency messages_per_task, redundant_ratio OTel spans Message channel
Topology effectiveness Actual vs optimal comm path Orchestrator logs Orchestrator decisions
Conflict detection contradiction_rate LLM-as-judge Post-handoff checkpoint
Loop deadlock detection repeat_query_similarity > 0.85 Embedding similarity Before each tool call

The coordination layer answers: Is information between Agents lossless? Is communication redundant? Is the orchestration topology effective? Are any Agents contradicting each other? Are any Agents repeatedly invoking the same tool?

Individual Layer: Single-Agent Task Execution Observation

Observation Point Metric Tool Instrumentation
Step-level trace step_count / duration / type Langfuse / agent-run ReAct loop
Decision quality tool_selection_accuracy Manual annotation + LLM eval After thought
Drift detection drift_score Behavioral pattern analysis Every N steps
Tool call audit tool / params / result / duration Deconvolute / OTel Tool interception layer
Token consumption distribution tokens_by_step Langfuse LLM usage field
Error & recovery error_type / retry / recovery_success Error logs Catch block

The individual layer answers: Is each step of a single Agent in the right direction? Are tool selections reasonable? Is token consumption concentrated in a few steps? Can the Agent auto-recover after errors?

Instrumentation Code Example

Using OpenTelemetry + OpenInference-style instrumentation, here's how step-level tracing and drift detection play out in the individual layer:

@trace("agent.step")
async def execute_step(agent_id, step):
    span = trace.current_span()
    span.set_attribute("agent.id", agent_id)
    span.set_attribute("agent.step_num", step.num)
    span.set_attribute("agent.step_type", step.type)
    if step.type == "tool_call":
        span.set_attribute("tool.name", step.tool_name)
        span.set_attribute("tool.duration_ms", step.duration_ms)
        span.set_attribute("tool.success", step.success)
    if step.num % 5 == 0:
        drift = await assess_drift(step.task_goal, agent_history)
        span.set_attribute("agent.drift_score", drift.score)
        if drift.score < 0.5:
            span.add_event("drift_warning", {"step": step.num})

Key design: every step records base attributes (agent id, step number, type), tool calls add tool name and duration, and drift detection fires every 5 steps — the frequency is tunable, representing the tradeoff between cost and precision.

Connection to DD1 / DD2

These three observation layers are not isolated — they form a closed loop with the first two articles in this series:

  • Agent-layer "abnormal step count increase" → DD1 cost alert: When the individual layer's step_count spikes, it maps directly to the token cost anomalies described in DD1. The individual layer is a leading indicator for DD1 cost alerts.
  • Agent-layer "reasoning quality degradation" → DD2 engine profiling: When drift_score consistently worsens, the root cause may be declining KV cache hit rates or batch scheduling issues in the underlying inference engine. DD2's engine-level metrics (GPU utilization, cache hit rate) provide the root-cause explanation for Agent quality degradation.
  • Agent-layer "handoff information loss" → architecture improvement signal: If the coordination layer's information_retention_ratio consistently falls below 30%, the multi-Agent architecture itself needs redesign — this isn't solvable by tuning parameters; it's an architectural decision.

From DD1's "where does the money go" to DD2's "why is the engine slow" to DD3's "why does the Agent drift," the three observation layers together form a complete AI observability system.


Tool Landscape and Challenges

Current state: Langfuse supports basic Agent tracing — thoughts and tool calls can be recorded via manual instrumentation, with a ClickHouse backend for queries. agent-run defines a purpose-built Agent trace schema (Run → Step → Thought/ToolCall/Handoff), but remains at the standard stage with no mature engineering implementation. Deconvolute approaches Agent observability from a security angle, providing policy-as-code auditing of MCP tool calls — the most reliable interception and logging source at the tool-call layer. Atla does real-time visualization of Agent thoughts, exploring drift detection. But most Agent frameworks (LangChain/AutoGen/CrewAI) offer very thin built-in observability — essentially "log printing" level tracing, lacking structured spans and semantic attributes.

Challenges:

  1. No de facto standard for Agent traces. Each framework uses its own format: LangChain's LangSmith trace, AutoGen's conversation log, CrewAI's execution log are structurally incompatible. Data doesn't interoperate — you can't use Langfuse to simultaneously view LangChain Agent and AutoGen Agent traces without manual adaptation.

  2. Thoughts are unstructured text that existing tools can only record, not analyze. An 18-step Agent task produces 18 thought segments totaling ~3,000–5,000 tokens. Automatically assessing thought quality (relevance, accuracy, logical coherence) requires an additional LLM call — expensive in both cost and latency. As a result, thought analysis only happens occasionally in post-hoc audits, not as real-time monitoring.

  3. Drift detection recall is too low. Behavioral pattern anomalies can only detect structural problems — loops (search → fetch → search repeated), spinning (consecutive thoughts with no tool calls), skipping (jumping to conclusions without validation). But semantic-level drift — like Step 4 confusing Company A's and Company B's product lines — shows completely normal behavioral patterns (tool call types and frequencies are unremarkable). Only semantic understanding of the Agent's reasoning can catch this. Such drift still requires manual review.

  4. Multi-Agent handoff information loss isn't quantified by any existing tool. When one Agent hands results to the next, current tools only record "a handoff occurred" and "what data was transferred" — not "what percentage of Agent A's cognitive state reached Agent B." Without an automated information_retention_ratio metric, information loss is an invisible problem.

Requirements for the tooling ecosystem:

  • An OTel-compatible Agent trace standard — enabling Agent traces to flow into existing observability infrastructure (Jaeger/Tempo/Datadog) rather than only being viewable in specialized tools. OpenInference is driving this direction, but Agent-specific thought/decision/handoff fields remain unstandardized
  • Low-cost thought quality assessment — using lightweight models (e.g., 1–3B parameter classifiers) for real-time thought quality scoring instead of invoking a full LLM every time. This requires finding the right balance between accuracy and cost
  • Automated handoff information fidelity metrics — automatically computing the information_retention_ratio from Agent A to Agent B, transforming information loss from an invisible problem into a trackable metric

Verdict: The Biggest Market Gap of 2026

Agent observability is the area in AI infrastructure with the most urgent demand and the least adequate supply.

LLM application observability (Langfuse/Phoenix) is relatively mature. Inference engine profiling (Graphsignal/vLLM metrics) is rapidly maturing. But Agent observability — thought tracking, drift detection, loop deadlock discovery, handoff information loss — is still in very early stages.

The standards war is the core battle. OpenInference is trying to bring Agent traces into the OpenTelemetry ecosystem. agent-run is defining its own Agent-specific schema. Langfuse uses its own trace format. If standards unify, data can flow between tools and tools can compose. If fragmentation persists, every Agent framework builds its own observability silo — just like every company building their own RPC tracing in the microservices era.

Who will win? My read:

  • Short-term (6 months): Langfuse, with open source + framework-agnostic + first-mover advantage, becomes the de facto standard for LLM/Agent observability. Similar to Jaeger in the microservices era.
  • Medium-term (1 year): OpenTelemetry + OpenInference will drive standardization. Big players (Datadog/New Relic) enter through acquisitions.
  • Long-term: Agent observability moves from "post-hoc analysis" to "real-time intervention" — auto-correcting detected drift, auto-injecting prompts for detected loops, auto-supplementing context for detected handoff information loss. Observability and execution are no longer separate.

For teams building AI Agents: start with Langfuse (or Phoenix) for basic tracing now, set checkpoints at critical nodes, and watch the agent-run standard's progress. Don't wait until Agents fail at scale in production to start building observability.

References

[1] Berger et al., "Why Do Multi-Agent LLM Systems Fail?" arXiv:2503.13657 (2025). UC Berkeley & Intesa Sanpaolo.

[2] Wang et al., "AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation", arXiv:2602.17100 (2026). Shanghai Jiao Tong University i-WiN & Meituan.

[3] Liu et al., "Lost in the Middle: How Language Models Use Long Contexts", arXiv:2307.03172 (2023). Stanford, UC Berkeley, Samaya AI.

[4] Chroma, "Context Rot: How Language Models Lose Accuracy Over Long Contexts", 2025. See chromaDB research.

[5] Salesforce Engineering, "Query-Driven Observability for Agentforce", 2025. See Salesforce Engineering Blog. Spark-based query engine + Notebook environment, processing 400M agent trace records.


This article is the final installment of the AI Observability series. Full series: Overview: Three Blind Spots · DD1: Anatomy of an Inference Bill · DD2: Inside the Inference Engine · DD3: Agent Decision Paths