← 返回首页 | 工具教程目录 | 术语百科 | 踩坑实录

🔭 OpenClaw 可观测性工程

OpenClaw Observability OpenTelemetry 监控 Activity Tab

凌晨1点42分,生产环境出问题了,但我不知道哪里出了问题——这就是没有可观测性的痛。

什么是可观测性工程?

可观测性(Observability)= 监控(Metrics)+ 日志(Logs)+ 追踪(Traces)。让你能在凌晨3点Agent自己运行的时候,知道它在干什么、搞没搞砸、为什么搞砸。

1️⃣ OpenTelemetry 集成

OpenClaw 原生支持 OpenTelemetry,一行配置开启全链路追踪:

# otel-config.yaml
observability:
  enabled: true
  provider: "opentelemetry"
  
  otel:
    endpoint: "otel-collector:4317"
    protocol: "grpc"
    service_name: "miaoquai-openclaw"
    sampler:
      type: "trace_id_ratio"
      arg: 0.1  # 10% 采样
      
  exports:
    - type: "jaeger"
      endpoint: "http://jaeger:14268/api/traces"
    - type: "prometheus"
      endpoint: "http://prometheus:9090/metrics"
    - type: "loki"
      endpoint: "http://loki:3100/loki/api/v1/push"

2️⃣ 自定义指标(Custom Metrics)

# custom-metrics.yaml
metrics:
  # Agent 执行指标
  agent_execution_duration:
    type: "histogram"
    labels: ["agent_name", "task_type", "status"]
    buckets: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
    
  agent_cost_per_task:
    type: "counter"
    labels: ["agent_name", "model"]
    unit: "USD"
    
  skill_error_rate:
    type: "gauge"
    labels: ["skill_name", "version"]
    
  context_budget_utilization:
    type: "gauge"
    labels: ["session_id"]
    max: 1.0
    
  # 自定义业务指标
  seo_page_generated:
    type: "counter"
    labels: ["quality_score_range"]
    
  shrimp_rate:  # 含虾率指标!
    type: "gauge"
    labels: ["task_type"]
    description: "正确完成率"

3️⃣ Activity Tab 可观测性

OpenClaw v2026.5.25 引入的 Activity Tab 是内置的可观测性界面:

# activity-tab-config.yaml
activity_tab:
  enabled: true
  retention: "7d"
  
  views:
    - name: "agent_timeline"
      description: "Agent 执行时间线"
      default: true
      
    - name: "cost_analytics"
      description: "成本分析"
      charts: ["cost_per_hour", "cost_per_agent", "cost_trend"]
      
    - name: "skill_health"
      description: "技能健康度"
      metrics: ["error_rate", "latency_p99", "throughput"]
      
  # 实时推送
  live_updates:
    enabled: true
    websocket: true
    update_interval: 5000  # 5秒

💡 妙趣实战:Activity Tab 救了我无数次——凌晨3点看到某个 Agent 的 cost_per_task 突然飙升,赶紧手动介入,发现是死循环调用 API。有了它,凌晨1点42分也能安心睡觉 👍

4️⃣ 日志聚合与分析

# logging-config.yaml
logging:
  level: "info"
  format: "json"
  
  outputs:
    - type: "file"
      path: "/var/log/openclaw/agent.log"
      rotation: "100MB"
      max_files: 10
      
    - type: "loki"
      endpoint: "http://loki:3100"
      labels:
        app: "openclaw"
        env: "production"
        
  # 结构化日志字段
  fields:
    - "session_id"
    - "agent_name"
    - "task_id"
    - "cost_usd"
    - "tokens_used"
    - "error_code"  # 方便排查
  # 敏感信息脱敏
  redact:
    - "api_key"
    - "password"
    - "token"

5️⃣ 告警配置实战

# alerts-config.yaml
alerts:
  # 成本告警
  - name: "High Hourly Cost"
    condition: "sum(rate(cost_usd[1h])) > 50"
    severity: "warning"
    notify: ["slack:#alerts", "email:ops@miaoquai.com"]
    
  - name: "Budget Exhausted"
    condition: "budget_remaining < 0.1"
    severity: "critical"
    notify: ["pagerduty", "sms:+1234567890"]
    auto_action: "pause_new_tasks"
    
  # 质量告警
  - name: "Low Shrimp Rate"
    condition: "shrim_rate < 0.85"
    severity: "warning"
    notify: ["slack:#quality"]
    
  # 系统告警
  - name: "Agent Stuck"
    condition: "agent_last_heartbeat > 10m"
    severity: "critical"
    auto_action: "restart_agent"

6️⃣ 分布式追踪示例

# 在 Agent 代码中添加追踪
from openclaw.observability import trace, span

@span(name="content_generation_pipeline")
def generate_seo_content(keyword):
    with trace.start_as_current_span("research") as span:
        span.set_attribute("keyword", keyword)
        results = web_search(keyword)
        span.set_attribute("results_count", len(results))
    
    with trace.start_as_current_span("content_writing") as span:
        content = write_content(results)
        span.set_attribute("word_count", len(content.split()))
    
    with trace.start_as_current_span("seo_optimization") as span:
        optimized = seo_optimize(content)
        span.set_attribute("keyword_density", calculate_density(optimized))
    
    return optimized

📊 可观测性成熟度模型

级别能力miaoquai.com 现状
Level 1基础日志✅ 已实现
Level 2指标监控✅ 已实现
Level 3分布式追踪✅ 已实现
Level 4智能告警✅ 已实现
Level 5自动修复🔄 进行中

📚 相关资源

💡 妙趣总结

凌晨1点42分,有了可观测性,你终于可以安心睡觉,而不是守着日志发呆 😴

相关阅读监控基础日志诊断性能优化


© 2026 妙趣AI (miaoquai.com) 🤖