📊 OpenClaw可观测性监控教程

让Agent不再是黑盒——每一个决策都有迹可循

📖 功能介绍

凌晨4点,我盯着Agent返回的"Unknown Error",突然想起一个问题:它到底经历了什么?在哪里卡住了?为什么放弃?这些问题就像深夜的迷雾,看不清摸不着。

可观测性(Observability)就是那盏穿透迷雾的灯。它让Agent的每个动作、每个决策、每个错误都变得透明可追踪。这不是监控——监控只能告诉你"出问题了",可观测性能告诉你"为什么出问题"。

三大支柱

支柱 作用 典型工具
Metrics 指标 系统健康度数值 Prometheus, Grafana
Logs 日志 事件详细记录 ELK Stack, Loki
Traces 追踪 请求完整路径 Jaeger, Zipkin
📈 Agent Dashboard Preview
98.5%
成功率
1.2s
平均延迟
847
QPS
12
活跃Agent

🚀 使用方法

1. 指标采集配置

📊 关键监控指标

# OpenClaw 指标采集配置
observability:
  metrics:
    enabled: true
    exporter: "prometheus"
    port: 9090
    
    # 核心指标
    collect:
      # Agent健康指标
      - name: "agent_requests_total"
        type: "counter"
        labels: ["agent_name", "status"]
        
      - name: "agent_latency_seconds"
        type: "histogram"
        buckets: [0.1, 0.5, 1, 2, 5, 10]
        
      - name: "agent_errors_total"
        type: "counter"
        labels: ["agent_name", "error_type"]
        
      # 成本指标
      - name: "api_cost_dollars"
        type: "counter"
        labels: ["model", "provider"]
        
      - name: "tokens_used_total"
        type: "counter"
        labels: ["model", "type"]
        
      # 资源指标
      - name: "memory_usage_bytes"
        type: "gauge"
        
      - name: "active_sessions"
        type: "gauge"

2. 日志配置

日志不是越多越好——太多了反而找不到关键信息。结构化日志才是王道。

logging:
  enabled: true
  level: "info"  # debug/info/warn/error
  
  # 结构化日志
  format: "json"
  
  # 日志字段
  fields:
    - timestamp
    - agent_name
    - session_id
    - level
    - message
    - context
    
  # 日志分级
  levels:
    debug:
      enabled: false  # 生产环境关闭
    info:
      enabled: true
    warn:
      enabled: true
      alert: true
    error:
      enabled: true
      alert: immediate
      
  # 日志存储
  storage:
    type: "loki"
    endpoint: "http://loki:3100"
    retention: "7d"

3. 分布式追踪

一个请求可能经过多个Agent,追踪让整个路径清晰可见。

tracing:
  enabled: true
  exporter: "jaeger"
  
  # Trace配置
  sampling:
    rate: 0.1  # 采样10%请求
    error_rate: 1.0  # 错误请求100%追踪
    
  # Span配置
  spans:
    - name: "agent_execution"
      attributes: ["agent_name", "model", "duration"]
    - name: "skill_call"
      attributes: ["skill_name", "parameters"]
    - name: "api_call"
      attributes: ["provider", "model", "tokens"]
      
  # 上下文传播
  propagation: "w3c_trace_context"

💡 最佳实践

🎯 可观测性设计原则:
  • 白盒优先:不只是看结果,要看过程
  • 结构化日志:JSON格式,便于搜索和分析
  • 关键路径追踪:每个重要步骤都要有Span
  • 成本可见:API调用成本要实时监控

告警配置

🚨 告警规则示例

alerting:
  enabled: true
  
  rules:
    # 错误率告警
    - name: "high_error_rate"
      condition: "agent_errors_total / agent_requests_total > 0.05"
      severity: "warning"
      duration: "5m"
      channels: ["feishu", "slack"]
      
    # 延迟告警
    - name: "high_latency"
      condition: "agent_latency_seconds > 5"
      severity: "warning"
      duration: "2m"
      
    # 成本告警
    - name: "cost_spike"
      condition: "api_cost_dollars increase > 10"
      severity: "critical"
      duration: "1h"
      
    # Agent状态告警
    - name: "agent_down"
      condition: "active_agents == 0"
      severity: "critical"
      immediate: true
      
  # 告警渠道
  channels:
    feishu:
      webhook: "${FEISHU_WEBHOOK}"
    slack:
      channel: "#alerts"
    email:
      recipients: ["ops@miaoquai.com"]
⚠️ 避坑提醒:
  • 不要采集太多指标,聚焦关键指标(KPI)
  • 日志级别要有层次,debug不要在生产环境开启
  • 告警要分级,不要让每个小问题都打扰你
  • 追踪采样率不要太高,否则存储成本爆炸

🔧 完整配置示例

# OpenClaw 可观测性完整配置
observability:
  # 指标配置
  metrics:
    enabled: true
    exporter: "prometheus"
    port: 9090
    collect:
      - agent_requests_total
      - agent_latency_seconds
      - agent_errors_total
      - api_cost_dollars
      - tokens_used_total
      
  # 日志配置
  logging:
    enabled: true
    level: "info"
    format: "json"
    storage:
      type: "loki"
      retention: "7d"
      
  # 追踪配置
  tracing:
    enabled: true
    exporter: "jaeger"
    sampling:
      rate: 0.1
      error_rate: 1.0
      
  # 告警配置
  alerting:
    enabled: true
    rules:
      - high_error_rate
      - high_latency
      - cost_spike
    channels:
      - feishu
      - slack
      
  # Dashboard配置
  dashboard:
    enabled: true
    provider: "grafana"
    panels:
      - agent_health
      - latency_heatmap
      - cost_trend
      - error_breakdown

🔗 相关链接