OpenClaw日志与诊断 - 问题出来了别慌看日志

凌晨4点23分，生产环境的Agent突然就开始返回空白回复。我一开始很慌，然后打开日志——瞬间就知道问题在哪了。日志就是你AI系统的黑匣子，出事就指望它。

1. 日志架构

OpenClaw的多层日志体系：

logging:
  # 日志层级
  levels:
    - system    # Gateway、节点管理
    - agent     # Agent生命周期
    - tool      # 工具调用详情
    - model     # 模型交互
    - debug     # 详细调试信息
    
  # 存储配置
  storage:
    backend: loki  # 推荐
    retention: 7d
    compression: gzip

2. 日志格式标准化

结构化日志比人眼可读更重要：

{ "timestamp": "2026-04-28T01:23:45.678Z", "level": "INFO", "agent_id": "miaoquai_ops", "session_id": "sess_abc123", "event": "tool_call", "tool": "web_search", "duration_ms": 234, "tokens": {"input": 150, "output": 420}, "trace_id": "trace_xyz789" }

关键字段说明

trace_id：一次完整请求的唯一标识，贯穿所有组件
session_id：用户会话ID，关联同一用户的多轮对话
duration_ms：耗时，性能分析必备
tokens：Token消耗，成本追踪

3. 日志级别配置

logging:
  # 全局级别
  default_level: info
  
  # 按组件精细控制
  levels:
    gateway: info
    agent_runtime: debug
    tool_executor: debug
    model_client: warn  # 模型调用太多，只记警告
    
  # 生产环境
  production:
    default_level: warn
    rate_limit: 100/second  # 避免日志爆炸

4. 关键事件日志

4.1 Agent生命周期

# Agent启动
{"event": "agent_start", "agent_id": "xxx", "config": {...}}

# 会话开始
{"event": "session_start", "session_id": "xxx", "user_id": "xxx"}

# 任务完成
{"event": "task_complete", "task": "xxx", "success": true, "duration_ms": 1234}

4.2 工具调用

# 调用开始
{"event": "tool_call_start", "tool": "web_search", "params": {...}}

# 调用完成
{"event": "tool_call_end", "tool": "web_search", "status": "success", "duration_ms": 234}

# 调用失败
{"event": "tool_call_error", "tool": "web_search", "error": "timeout", "retry": true}

5. 诊断工具

5.1 实时追踪

# 追踪特定会话
openclaw logs trace --session-id sess_abc123

# 追踪特定Agent
openclaw logs trace --agent-id miaoquai_ops --tail

5.2 性能分析

# 慢请求分析
openclaw analyze slow-requests --threshold 5s

# Token消耗排行
openclaw analyze token-usage --top 10

5.3 错误聚合

# 错误统计
openclaw analyze errors --group-by type --last 24h

# 输出示例
# timeout_error: 234次 (45%)
# rate_limit_error: 156次 (30%)
# parsing_error: 89次 (17%)
# other: 38次 (8%)

6. 与可观测性集成

observability:
  # 推送指标
  metrics:
    - tool_call_duration
    - token_usage
    - error_rate
    - agent_latency
    
  # 告警规则
  alerts:
    - name: high_error_rate
      condition: error_rate > 0.05
      action: notify
      
    - name: slow_response
      condition: p99_latency > 10s
      action: notify

7. 日志最佳实践

✅ 使用结构化日志（JSON），不要用纯文本
✅ 每个请求必须有trace_id
✅ 敏感信息脱敏（密码、令牌）
✅ 生产环境适当限流，避免日志风暴
✅ 设置合理的保留周期
✅ 关键事件和错误必须记录
✅ 配置告警，主动发现问题

⚠️ 日志里不要记密码和API Key。被扫描到就完了。

💡 配合时间旅行调试，可以从日志重建完整的执行历史。