📖 功能介绍
凌晨4点,我盯着Agent返回的"Unknown Error",突然想起一个问题:它到底经历了什么?在哪里卡住了?为什么放弃?这些问题就像深夜的迷雾,看不清摸不着。
可观测性(Observability)就是那盏穿透迷雾的灯。它让Agent的每个动作、每个决策、每个错误都变得透明可追踪。这不是监控——监控只能告诉你"出问题了",可观测性能告诉你"为什么出问题"。
三大支柱
| 支柱 | 作用 | 典型工具 |
|---|---|---|
| Metrics 指标 | 系统健康度数值 | Prometheus, Grafana |
| Logs 日志 | 事件详细记录 | ELK Stack, Loki |
| Traces 追踪 | 请求完整路径 | Jaeger, Zipkin |
📈 Agent Dashboard Preview
98.5%
成功率
1.2s
平均延迟
847
QPS
12
活跃Agent
🚀 使用方法
1. 指标采集配置
📊 关键监控指标
# OpenClaw 指标采集配置
observability:
metrics:
enabled: true
exporter: "prometheus"
port: 9090
# 核心指标
collect:
# Agent健康指标
- name: "agent_requests_total"
type: "counter"
labels: ["agent_name", "status"]
- name: "agent_latency_seconds"
type: "histogram"
buckets: [0.1, 0.5, 1, 2, 5, 10]
- name: "agent_errors_total"
type: "counter"
labels: ["agent_name", "error_type"]
# 成本指标
- name: "api_cost_dollars"
type: "counter"
labels: ["model", "provider"]
- name: "tokens_used_total"
type: "counter"
labels: ["model", "type"]
# 资源指标
- name: "memory_usage_bytes"
type: "gauge"
- name: "active_sessions"
type: "gauge"
2. 日志配置
日志不是越多越好——太多了反而找不到关键信息。结构化日志才是王道。
logging:
enabled: true
level: "info" # debug/info/warn/error
# 结构化日志
format: "json"
# 日志字段
fields:
- timestamp
- agent_name
- session_id
- level
- message
- context
# 日志分级
levels:
debug:
enabled: false # 生产环境关闭
info:
enabled: true
warn:
enabled: true
alert: true
error:
enabled: true
alert: immediate
# 日志存储
storage:
type: "loki"
endpoint: "http://loki:3100"
retention: "7d"
3. 分布式追踪
一个请求可能经过多个Agent,追踪让整个路径清晰可见。
tracing:
enabled: true
exporter: "jaeger"
# Trace配置
sampling:
rate: 0.1 # 采样10%请求
error_rate: 1.0 # 错误请求100%追踪
# Span配置
spans:
- name: "agent_execution"
attributes: ["agent_name", "model", "duration"]
- name: "skill_call"
attributes: ["skill_name", "parameters"]
- name: "api_call"
attributes: ["provider", "model", "tokens"]
# 上下文传播
propagation: "w3c_trace_context"
💡 最佳实践
🎯 可观测性设计原则:
- 白盒优先:不只是看结果,要看过程
- 结构化日志:JSON格式,便于搜索和分析
- 关键路径追踪:每个重要步骤都要有Span
- 成本可见:API调用成本要实时监控
告警配置
🚨 告警规则示例
alerting:
enabled: true
rules:
# 错误率告警
- name: "high_error_rate"
condition: "agent_errors_total / agent_requests_total > 0.05"
severity: "warning"
duration: "5m"
channels: ["feishu", "slack"]
# 延迟告警
- name: "high_latency"
condition: "agent_latency_seconds > 5"
severity: "warning"
duration: "2m"
# 成本告警
- name: "cost_spike"
condition: "api_cost_dollars increase > 10"
severity: "critical"
duration: "1h"
# Agent状态告警
- name: "agent_down"
condition: "active_agents == 0"
severity: "critical"
immediate: true
# 告警渠道
channels:
feishu:
webhook: "${FEISHU_WEBHOOK}"
slack:
channel: "#alerts"
email:
recipients: ["ops@miaoquai.com"]
⚠️ 避坑提醒:
- 不要采集太多指标,聚焦关键指标(KPI)
- 日志级别要有层次,debug不要在生产环境开启
- 告警要分级,不要让每个小问题都打扰你
- 追踪采样率不要太高,否则存储成本爆炸
🔧 完整配置示例
# OpenClaw 可观测性完整配置
observability:
# 指标配置
metrics:
enabled: true
exporter: "prometheus"
port: 9090
collect:
- agent_requests_total
- agent_latency_seconds
- agent_errors_total
- api_cost_dollars
- tokens_used_total
# 日志配置
logging:
enabled: true
level: "info"
format: "json"
storage:
type: "loki"
retention: "7d"
# 追踪配置
tracing:
enabled: true
exporter: "jaeger"
sampling:
rate: 0.1
error_rate: 1.0
# 告警配置
alerting:
enabled: true
rules:
- high_error_rate
- high_latency
- cost_spike
channels:
- feishu
- slack
# Dashboard配置
dashboard:
enabled: true
provider: "grafana"
panels:
- agent_health
- latency_heatmap
- cost_trend
- error_breakdown