OpenClaw OpenTelemetry 可观测性教程
让 AI 系统透明化 —— 分布式追踪、指标监控、日志导出一条龙
🎯 为什么需要可观测性?
世界上有一种痛苦,叫「系统出问题了,但不知道哪里出的问题」。AI Agent 调用链复杂,没有监控就像在黑暗中摸索。
"周五凌晨3点,生产环境AI响应变慢。查了半天日志,发现是某个插件在偷偷重试。那一刻我知道,需要完整的可观测性。"
OpenClaw OpenTelemetry 让你能够:
- 🔍 分布式追踪 - 追踪每次 Agent 调用的完整链路
- 📊 指标监控 - Token 使用、响应时间、错误率实时统计
- 📝 日志导出 - 结构化日志导出到 ELK、Loki 等系统
- 🎯 Prometheus 集成 - 原生支持 Prometheus 指标抓取
- 🧪 QA-Lab 验证 - v2026.5.22 新增 OTEL smoke 测试
🚀 快速开始
1. 启用 OpenTelemetry
// ~/.openclaw/config.json
{
"observability": {
"enabled": true,
"provider": "opentelemetry",
"config": {
"serviceName": "openclaw-gateway",
"serviceVersion": "2026.5.22",
"exporter": {
"type": "otlp",
"endpoint": "http://localhost:4317",
"protocol": "grpc"
},
"instrumentation": {
"http": true,
"grpc": true,
"database": true
}
}
}
}
2. 配置 Prometheus 指标
v2026.5.22 新增一等公民 Prometheus 支持和可观测性 smoke 别名:
{
"observability": {
"metrics": {
"enabled": true,
"provider": "prometheus",
"config": {
"endpoint": "/metrics", // Gateway 暴露 /metrics 端点
"port": 9090,
"collectDefaultMetrics": true,
"prefix": "openclaw_"
}
}
}
}
// 验证 Prometheus 指标
curl http://localhost:9090/metrics
# 输出示例:
# openclaw_agent_requests_total{agent="main"} 1247
# openclaw_agent_latency_seconds_bucket{le="0.5"} 890
# openclaw_token_usage_total{model="gpt-4"} 125000
3. 配置日志导出
{
"observability": {
"logs": {
"enabled": true,
"exporter": {
"type": "otlp",
"endpoint": "http://loki:4317"
},
"include": ["info", "warn", "error"],
"excludeFields": ["apiKey", "password"] // 敏感信息过滤
}
}
}
💻 实战示例
示例 1: 追踪 Agent 调用链
// OpenClaw 自动为每次 Agent 调用生成 trace
// 无需手动埋点
openclaw agent "帮我分析这个GitHub仓库的代码质量"
// 自动生成的 trace 结构:
// Trace ID: 4f8b2c3d1e5a4f6b8c9d0e1f2a3b4c5
//
// Span 1: agent.execute (root)
// ├─ Span 2: tool.web_search
// │ ├─ Span 3: http.request (to Brave API)
// │ └─ Span 4: response.parse
// ├─ Span 5: tool.read_file
// │ └─ Span 6: fs.read
// └─ Span 7: llm.completion (to GPT-4)
// └─ Span 8: http.request (to OpenAI API)
// 在 Jaeger/Zipkin 中查看完整的调用链路图
示例 2: 自定义指标监控
// 在插件中添加自定义指标
import { metrics } from '@openclaw/observability';
class MyPlugin {
private counter = metrics.createCounter('my_plugin_operations_total', {
description: 'Total operations executed by my plugin'
});
private histogram = metrics.createHistogram('my_plugin_latency_seconds', {
description: 'Operation latency in seconds',
boundaries: [0.1, 0.5, 1, 2, 5]
});
async executeTool(params) {
const startTime = Date.now();
try {
// 业务逻辑
await this.doWork(params);
this.counter.add(1, { status: 'success' });
} catch (error) {
this.counter.add(1, { status: 'error' });
throw error;
} finally {
const latency = (Date.now() - startTime) / 1000;
this.histogram.record(latency);
}
}
}
示例 3: QA-Lab OpenTelemetry Smoke 测试
v2026.5.22 扩展了 OpenTelemetry smoke harness,验证 trace、metric 和 log 导出:
// 运行 OTEL smoke 测试
openclaw qa:smoke --suite opentelemetry
# 输出示例:
# 🧪 OpenTelemetry Smoke Test
# ========================
# ✓ Trace export: 测试 trace 成功导出到 OTLP collector
# ✓ Metric export: 测试 counter/histogram 指标导出
# ✓ Log export: 测试结构化日志导出
# ✓ Prometheus endpoint: /metrics 端点可访问
# ✓ First-class smoke alias: `openclaw qa:otel` 可用
#
# 所有测试通过!(5/5)
📊 可观测性架构
| 组件 | 说明 | 推荐工具 |
|---|---|---|
| Trace 收集 | 收集分布式追踪数据 | Jaeger, Zipkin, Tempo |
| Metrics 存储 | 存储和查询指标数据 | Prometheus, VictoriaMetrics |
| Logs 聚合 | 集中式日志存储和查询 | Loki, Elasticsearch, Splunk |
| 可视化 | 指标和追踪可视化 | Grafana, Kibana |
| Collector | OTEL Collector 统一接收 | opentelemetry-collector |
🔧 OpenTelemetry Collector 配置
推荐使用 OTEL Collector 作为统一的数据收集器:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
memory_limiter:
check_interval: 1s
limit_mib: 4000
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
jaeger:
endpoint: "jaeger:14250"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
🎓 最佳实践
1. 敏感信息过滤
💡 提示: v2026.5.22 新增诊断遥测中清理 OpenTelemetry 日志主体,并在 Prometheus 标签中清理有界队列通道前缀。同时清理了 OpenTelemetry 和 Prometheus 标签中的会话密钥。
{
"observability": {
"privacy": {
"scrubFields": ["apiKey", "password", "token", "secret"],
"scrubHeaders": ["Authorization", "X-Api-Key"],
"hashSessionKeys": true // 会话密钥做hash处理
}
}
}
2. 性能优化
{
"observability": {
"performance": {
"sampleRate": 0.1, // 10% 采样率(生产环境)
"maxQueueSize": 1000,
"exportTimeoutMs": 5000,
"disableInstrumentation": ["low-value-module"] // 禁用低价值埋点
}
}
}
3. 与 Grafana 集成
# docker-compose.yml 快速搭建监控栈
version: '3'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317"
- "8889:8889"
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
📚 相关资源
🎯 妙趣提示: OpenTelemetry 最强大的地方是「统一标准」。一旦接入 OTEL,你可以随意切换后端(Jaeger、Zipkin、Tempo),不用改代码。v2026.5.22 的 QA-Lab smoke 测试让你可以放心使用!