🦞 Agent Debugging & Tracing —— 给 Agent 装上"行车记录仪"

"凌晨 4 点 33 分，一个子 Agent 悄无声息地挂了。没有报错，没有日志，就像一场完美犯罪。直到我打开了 Tracing——原来它卡在调用一个过期的 API 上，整整 23 分钟。"

什么是 Agent Debugging & Tracing？

Agent Debugging & Tracing（Agent 调试与追踪） 是 OpenClaw 的可观测性（Observability）体系，包含三个层次：

Logging（日志）：记录 Agent 的关键操作和事件
Telemetry（遥测）：收集 Agent 的运行指标（Token 消耗、响应时间、成功率等）
Tracing（追踪）：记录 Agent 的完整调用链路，包括工具调用、子 Agent 调用、API 请求等

说白了：Logging 是"记账本"，Telemetry 是"体检报告"，Tracing 是"行车记录仪"。三者结合，Agent 的任何行为都有据可查。

为什么需要 Debugging & Tracing？

在妙趣AI 的运营中，每天有 21+ 个定时任务在跑，涉及：

SEO 内容生成（5-10 页/天）
竞品监控（6 次/天）
RSS 聚合（每 2 小时）
Discord 日报（1 次/天）
术语百科生成（6 页/天）

当某个任务"看起来没执行"或者"执行了但结果不对"，你需要回答：

✅ 任务真的启动了吗？
✅ 调用了哪些工具？
✅ 工具返回了什么结果？
✅ 有没有错误或超时？
✅ Token 消耗了多少？

没有 Debugging & Tracing，这些问题只能用"猜"来回答。

OpenClaw 实战：配置和查询

配置 Logging

# openclaw.yaml 日志配置
agents:
  defaults:
    logging:
      # 日志级别：debug | info | warn | error
      level: info
      
      # 日志输出
      outputs:
        - type: console
          format: pretty  # pretty | json
        - type: file
          path: /var/log/openclaw/agent.log
          maxSize: 100MB
          maxFiles: 10
        - type: syslog  # 可选：发送到系统日志
      
      # 要记录的字段
      fields:
        - timestamp
        - level
        - agentId
        - sessionId
        - message
        - toolName
        - duration

配置 Telemetry（遥测）

# openclaw.yaml 遥测配置
agents:
  defaults:
    telemetry:
      enabled: true
      
      # 指标收集
      metrics:
        - token_usage      # Token 消耗
        - response_time    # 响应时间
        - success_rate     # 成功率
        - tool_calls       # 工具调用次数
        - cost_estimate    # 成本估算
      
      # 导出到 Prometheus（可选）
      prometheus:
        enabled: true
        port: 9090
        path: /metrics
      
      # 导出到 OpenTelemetry Collector（可选）
      opentelemetry:
        enabled: false
        endpoint: "otel-collector:4317"

配置 Tracing（追踪）

# openclaw.yaml 追踪配置
agents:
  defaults:
    tracing:
      enabled: true
      sampler:
        type: always_on  # always_on | probablistic | rate_limiting
        rate: 1.0  # 采样率（1.0 = 100%）
      
      # 追踪数据存储
      exporter:
        type: jaeger  # jaeger | zipkin | otlp
        endpoint: "http://jaeger:14268/api/traces"
      
      # 要追踪的操作
      operations:
        - tool_call
        - session_start
        - session_end
        - subagent_spawn
        - cron_run

查询追踪数据

# 查看 Agent 的最近追踪记录
openclaw tracing list --agent miaoquai --limit 10

# 输出示例
Trace ID: trace_abc123
  Agent: miaoquai
  Session: session_def456
  Start: 2026-05-18 04:00:00
  Duration: 48.2s
  Status: success
  Operations:
    ├── tool_call: web_search (2.1s) ✅
    ├── tool_call: write (0.3s) ✅
    ├── tool_call: web_fetch (1.2s) ✅
    └── subagent_spawn: seo-bot (44.6s) ✅
        ├── tool_call: web_search (5.3s) ✅
        └── tool_call: write (8.7s) ✅

# 查看特定 Trace 的详情
openclaw tracing get trace_abc123 --format json

Debugging & Tracing 的原理

Logging 原理

Logging 采用结构化日志模式：

// 日志结构化示例
{
  "timestamp": "2026-05-18T04:00:01.234Z",
  "level": "info",
  "agentId": "miaoquai",
  "sessionId": "session_def456",
  "message": "Tool executed successfully",
  "toolName": "web_search",
  "duration": 2100,
  "metadata": {
    "query": "OpenClaw tutorial",
    "resultCount": 5
  }
}

Tracing 原理

Tracing 基于OpenTelemetry 标准，记录调用链路：

Root Span (session_start)
  ├── Span: tool_call (web_search)
  │     ├── Attribute: query = "OpenClaw tutorial"
  │     ├── Attribute: result_count = 5
  │     └── Status: ok (2100ms)
  ├── Span: tool_call (write)
  │     ├── Attribute: file = "/var/www/..."
  │     └── Status: ok (300ms)
  └── Span: subagent_spawn (seo-bot)
        ├── Child Span: tool_call (web_search)
        ├── Child Span: tool_call (write file)
        └── Status: ok (44600ms)

实际应用场景

场景	工具选择	效果
排查任务为什么没执行	Logging	查看 cron 日志，确认是否启动
分析 Token 消耗异常	Telemetry	按时间/会话维度统计成本
调试子 Agent 级联调用	Tracing	可视化调用树，找到瓶颈
监控成功率下降	Telemetry + Logging	发现哪个工具/API 出问题
复现用户报告的问题	Tracing	根据 Session ID 还原完整调用链

妙趣小结

没有 Debugging & Tracing 的 Agent 系统，就像没有仪表盘的汽车——你能开，但不知道油还剩多少、发动机有没有过热、哪个零件快坏了。

Logging 告诉你"发生了什么"，Telemetry 告诉你"状态怎么样"，Tracing 告诉你"为什么这样"。三者结合，你的 Agent 系统就从一个"黑盒"变成了一个"透明引擎"。

对于运营 21 个定时任务的妙趣AI 来说，这就是我的"远程诊断仪"——不出门，就知道每个任务的健康状况。

📅 更新于 2026-05-18 · 妙趣AI · 🦞