OpenClaw可观测性监控教程 - Agent监控指标与告警配置

📖 功能介绍

凌晨4点，我盯着Agent返回的"Unknown Error"，突然想起一个问题：它到底经历了什么？在哪里卡住了？为什么放弃？这些问题就像深夜的迷雾，看不清摸不着。

可观测性（Observability）就是那盏穿透迷雾的灯。它让Agent的每个动作、每个决策、每个错误都变得透明可追踪。这不是监控——监控只能告诉你"出问题了"，可观测性能告诉你"为什么出问题"。

三大支柱

支柱	作用	典型工具
Metrics 指标	系统健康度数值	Prometheus, Grafana
Logs 日志	事件详细记录	ELK Stack, Loki
Traces 追踪	请求完整路径	Jaeger, Zipkin

📈 Agent Dashboard Preview

98.5%

成功率

1.2s

平均延迟

847

QPS

活跃Agent

🚀 使用方法

1. 指标采集配置

📊 关键监控指标

# OpenClaw 指标采集配置
observability:
  metrics:
    enabled: true
    exporter: "prometheus"
    port: 9090
    
    # 核心指标
    collect:
      # Agent健康指标
      - name: "agent_requests_total"
        type: "counter"
        labels: ["agent_name", "status"]
        
      - name: "agent_latency_seconds"
        type: "histogram"
        buckets: [0.1, 0.5, 1, 2, 5, 10]
        
      - name: "agent_errors_total"
        type: "counter"
        labels: ["agent_name", "error_type"]
        
      # 成本指标
      - name: "api_cost_dollars"
        type: "counter"
        labels: ["model", "provider"]
        
      - name: "tokens_used_total"
        type: "counter"
        labels: ["model", "type"]
        
      # 资源指标
      - name: "memory_usage_bytes"
        type: "gauge"
        
      - name: "active_sessions"
        type: "gauge"

2. 日志配置

日志不是越多越好——太多了反而找不到关键信息。结构化日志才是王道。

logging:
  enabled: true
  level: "info"  # debug/info/warn/error
  
  # 结构化日志
  format: "json"
  
  # 日志字段
  fields:
    - timestamp
    - agent_name
    - session_id
    - level
    - message
    - context
    
  # 日志分级
  levels:
    debug:
      enabled: false  # 生产环境关闭
    info:
      enabled: true
    warn:
      enabled: true
      alert: true
    error:
      enabled: true
      alert: immediate
      
  # 日志存储
  storage:
    type: "loki"
    endpoint: "http://loki:3100"
    retention: "7d"

3. 分布式追踪

一个请求可能经过多个Agent，追踪让整个路径清晰可见。

tracing:
  enabled: true
  exporter: "jaeger"
  
  # Trace配置
  sampling:
    rate: 0.1  # 采样10%请求
    error_rate: 1.0  # 错误请求100%追踪
    
  # Span配置
  spans:
    - name: "agent_execution"
      attributes: ["agent_name", "model", "duration"]
    - name: "skill_call"
      attributes: ["skill_name", "parameters"]
    - name: "api_call"
      attributes: ["provider", "model", "tokens"]
      
  # 上下文传播
  propagation: "w3c_trace_context"

💡 最佳实践

🎯 可观测性设计原则：

白盒优先：不只是看结果，要看过程
结构化日志：JSON格式，便于搜索和分析
关键路径追踪：每个重要步骤都要有Span
成本可见：API调用成本要实时监控

告警配置

🚨 告警规则示例

alerting:
  enabled: true
  
  rules:
    # 错误率告警
    - name: "high_error_rate"
      condition: "agent_errors_total / agent_requests_total > 0.05"
      severity: "warning"
      duration: "5m"
      channels: ["feishu", "slack"]
      
    # 延迟告警
    - name: "high_latency"
      condition: "agent_latency_seconds > 5"
      severity: "warning"
      duration: "2m"
      
    # 成本告警
    - name: "cost_spike"
      condition: "api_cost_dollars increase > 10"
      severity: "critical"
      duration: "1h"
      
    # Agent状态告警
    - name: "agent_down"
      condition: "active_agents == 0"
      severity: "critical"
      immediate: true
      
  # 告警渠道
  channels:
    feishu:
      webhook: "${FEISHU_WEBHOOK}"
    slack:
      channel: "#alerts"
    email:
      recipients: ["ops@miaoquai.com"]

⚠️ 避坑提醒：

不要采集太多指标，聚焦关键指标（KPI）
日志级别要有层次，debug不要在生产环境开启
告警要分级，不要让每个小问题都打扰你
追踪采样率不要太高，否则存储成本爆炸

🔧 完整配置示例

# OpenClaw 可观测性完整配置
observability:
  # 指标配置
  metrics:
    enabled: true
    exporter: "prometheus"
    port: 9090
    collect:
      - agent_requests_total
      - agent_latency_seconds
      - agent_errors_total
      - api_cost_dollars
      - tokens_used_total
      
  # 日志配置
  logging:
    enabled: true
    level: "info"
    format: "json"
    storage:
      type: "loki"
      retention: "7d"
      
  # 追踪配置
  tracing:
    enabled: true
    exporter: "jaeger"
    sampling:
      rate: 0.1
      error_rate: 1.0
      
  # 告警配置
  alerting:
    enabled: true
    rules:
      - high_error_rate
      - high_latency
      - cost_spike
    channels:
      - feishu
      - slack
      
  # Dashboard配置
  dashboard:
    enabled: true
    provider: "grafana"
    panels:
      - agent_health
      - latency_heatmap
      - cost_trend
      - error_breakdown

📊 OpenClaw可观测性监控教程

📖 功能介绍

三大支柱

🚀 使用方法

1. 指标采集配置

📊 关键监控指标

2. 日志配置

3. 分布式追踪

💡 最佳实践

告警配置

🚨 告警规则示例

🔧 完整配置示例

🔗 相关链接

🔗 相关推荐

📚 相关推荐阅读

📖 功能介绍

三大支柱

🚀 使用方法

1. 指标采集配置

📊 关键监控指标

2. 日志配置

3. 分布式追踪

💡 最佳实践

告警配置

🚨 告警规则示例

🔧 完整配置示例

🔗 相关链接

📚 延伸阅读

🔗 相关推荐

相关推荐

相关推荐

📚 相关推荐阅读