Example Input
app_name: 'CodeReviewBot', claude_model: 'claude-opus-4-6', observability_backend: 'Arthur.ai', observability_goal: 'Understand why some tool call chains take 3x longer than others and identify which tools are responsible for cost spikes', language: 'python', debug_scenario: 'A user reports that their PR review request hung for 45 seconds before returning an empty result'
Example Output
## Step 1 — Instrumentation Architecture
For CodeReviewBot, the typical execution graph looks like: user message → LLM turn 1 (plan) → tool calls (read_file, search_code, list_diff) in parallel or sequence → LLM turn 2 (synthesize) → final response. Each LLM turn and each tool invocation is a discrete span worth capturing.
Metadata to capture per LLM span: model ID, input/output token counts, stop reason, latency, request ID from the Anthropic response header. Per tool span: tool name, input arguments (truncated at 2 KB), output size in bytes, wall-clock duration, exception type if failed.
## Step 2 — Implementation
```python
import time, uuid, functools
from anthropic import Anthropic
from arthur_bench.client import ArthurClient # Arthur.ai SDK
arthur = ArthurClient(api_key=os.environ['ARTHUR_API_KEY'])
class TracedAnthropic(Anthropic):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._root_trace_id = None
def new_session(self):
self._root_trace_id = str(uuid.uuid4())
return self._root_trace_id
def messages_create(self, **kwargs):
trace_id = self._root_trace_id or str(uuid.uuid4())
start = time.perf_counter()
try:
response = super().messages.create(**kwargs)
arthur.log_trace(trace_id=trace_id, span='llm_request',
model=kwargs.get('model'), input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=round((time.perf_counter()-start)*1000),
stop_reason=response.stop_reason)
return response
except Exception as e:
arthur.log_trace(trace_id=trace_id, span='llm_request', error=str(e))
raise
def traced_tool(fn):
@functools.wraps(fn)
def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result = fn(*args, **kwargs)
arthur.log_trace(span='tool_call', tool=fn.__name__,
latency_ms=round((time.perf_counter()-start)*1000), status='ok')
return result
except Exception as e:
arthur.log_trace(span='tool_call', tool=fn.__name__, status='error', error=str(e))
raise
return wrapper
```
## Step 3 — Dashboards
1. **Cost by Session** — group spans by root_trace_id, sum input_tokens + output_tokens × model rate, plot as time-series. Alert when session cost exceeds $0.50.
2. **Tool Latency p95** — histogram of tool_call latency_ms grouped by tool name. Identify which tool drives the long tail.
3. **LLM Request Anomaly** — rolling z-score on requests-per-minute; page when z > 3 for 5 consecutive minutes.
## Step 4 — Debug Walkthrough
To investigate the 45-second hang: filter Arthur traces where `root_trace_id` matches the affected session and sort spans by start_time. The trace will reveal a `read_file` tool span with latency_ms = 44,200 — the tool was reading a 180 MB binary asset instead of source files. The fix: add a file-size guard (raise if > 1 MB) in the `read_file` handler and log skipped files as warning spans.
**Remaining gaps:** Claude's internal chain-of-thought between tool calls is not observable via the SDK. Consider streaming responses and capturing partial text deltas as sub-spans to approximate reasoning latency per step.