🔍 为什么需要监控?
凌晨4点33分,我收到告警:某个Agent的响应时间突然从0.8s飙升到5.2s。那一刻我意识到,没有监控的Agent,就像在黑暗中开赛车。
就像王家卫电影里说的:"有些东西,看不见,但一直存在。" 监控就是让你看见那些"看不见的东西"。
📋 监控三大支柱
1. 日志(Logs)
记录Agent做了什么,用于调试和问题追溯。
2. 指标(Metrics)
量化Agent的表现,用于性能分析和告警。
3. 追踪(Traces)
记录请求的生命周期,用于性能优化和瓶颈定位。
📝 日志管理
1. 结构化日志
不要只记录字符串,要记录结构化的数据:
const { Skill } = require('@openclaw/core');
class LoggedSkill extends Skill {
async execute(options) {
const startTime = Date.now();
const requestId = this.generateRequestId();
// 记录请求开始
this.logger.info('skill.execution.start', {
requestId,
skill: this.name,
options,
timestamp: new Date(startTime).toISOString()
});
try {
const result = await this.process(options);
const duration = Date.now() - startTime;
// 记录成功
this.logger.info('skill.execution.success', {
requestId,
duration,
resultSize: JSON.stringify(result).length
});
return result;
} catch (error) {
const duration = Date.now() - startTime;
// 记录失败
this.logger.error('skill.execution.error', {
requestId,
duration,
error: {
message: error.message,
stack: error.stack,
code: error.code
}
});
throw error;
}
}
}
class LoggedSkill extends Skill {
async execute(options) {
const startTime = Date.now();
const requestId = this.generateRequestId();
// 记录请求开始
this.logger.info('skill.execution.start', {
requestId,
skill: this.name,
options,
timestamp: new Date(startTime).toISOString()
});
try {
const result = await this.process(options);
const duration = Date.now() - startTime;
// 记录成功
this.logger.info('skill.execution.success', {
requestId,
duration,
resultSize: JSON.stringify(result).length
});
return result;
} catch (error) {
const duration = Date.now() - startTime;
// 记录失败
this.logger.error('skill.execution.error', {
requestId,
duration,
error: {
message: error.message,
stack: error.stack,
code: error.code
}
});
throw error;
}
}
}
2. 日志级别管理
✅ 日志级别使用原则
- ERROR: 影响功能执行的错误
- WARN: 潜在问题但不影响执行
- INFO: 重要的业务事件
- DEBUG: 详细的调试信息
📈 指标收集
1. 关键指标
监控这些指标,你就能掌握Agent的健康状况:
⏱️ 性能指标
- 响应时间(Response Time)
- 吞吐量(Throughput)
- 错误率(Error Rate)
- 可用性(Availability)
💰 成本指标
- Token消耗(Token Usage)
- API调用次数(API Calls)
- 预估成本(Estimated Cost)
🔧 资源指标
- CPU使用率
- 内存使用率
- 磁盘I/O
- 网络流量
2. 自定义指标
const { Skill } = require('@openclaw/core');
const { Metrics } = require('@openclaw/metrics');
class MeasuredSkill extends Skill {
async init() {
this.metrics = new Metrics('my_skill');
// 定义指标
this.metrics.defineCounter('requests_total');
this.metrics.defineHistogram('response_time_ms');
this.metrics.defineGauge('cache_size');
}
async execute(options) {
const startTime = Date.now();
try {
const result = await this.process(options);
// 记录指标
this.metrics.increment('requests_total', { status: 'success' });
this.metrics.observe('response_time_ms', Date.now() - startTime);
return result;
} catch (error) {
this.metrics.increment('requests_total', { status: 'error' });
throw error;
}
}
}
const { Metrics } = require('@openclaw/metrics');
class MeasuredSkill extends Skill {
async init() {
this.metrics = new Metrics('my_skill');
// 定义指标
this.metrics.defineCounter('requests_total');
this.metrics.defineHistogram('response_time_ms');
this.metrics.defineGauge('cache_size');
}
async execute(options) {
const startTime = Date.now();
try {
const result = await this.process(options);
// 记录指标
this.metrics.increment('requests_total', { status: 'success' });
this.metrics.observe('response_time_ms', Date.now() - startTime);
return result;
} catch (error) {
this.metrics.increment('requests_total', { status: 'error' });
throw error;
}
}
}
🔗 分布式追踪
1. 为什么需要追踪?
当Agent调用链很长时,你需要知道时间花在哪里:
追踪示例
Agent Request
├── Skill A (120ms)
│ ├── MCP Server 1 (50ms)
│ └── MCP Server 2 (70ms)
├── Skill B (80ms)
│ └── API Call (60ms)
└── Skill C (150ms)
├── Database Query (40ms)
└── Cache Lookup (20ms)
Total: 350ms
2. 实现追踪
const { Skill } = require('@openclaw/core');
const { Tracer } = require('@openclaw/tracing');
class TracedSkill extends Skill {
async init() {
this.tracer = new Tracer('my-skill');
}
async execute(options) {
// 创建 span
const span = this.tracer.startSpan('execute');
try {
// 添加标签
span.setTag('skill.name', this.name);
span.setTag('request.id', options.requestId);
// 执行处理逻辑
const result = await this.process(options);
span.setTag('status', 'success');
return result;
} catch (error) {
span.setTag('status', 'error');
span.log('error', error);
throw error;
} finally {
span.finish();
}
}
}
const { Tracer } = require('@openclaw/tracing');
class TracedSkill extends Skill {
async init() {
this.tracer = new Tracer('my-skill');
}
async execute(options) {
// 创建 span
const span = this.tracer.startSpan('execute');
try {
// 添加标签
span.setTag('skill.name', this.name);
span.setTag('request.id', options.requestId);
// 执行处理逻辑
const result = await this.process(options);
span.setTag('status', 'success');
return result;
} catch (error) {
span.setTag('status', 'error');
span.log('error', error);
throw error;
} finally {
span.finish();
}
}
}
🚨 告警配置
1. 告警规则
✅ 应该告警的情况
- 错误率 > 5%(持续5分钟)
- 响应时间 P99 > 3s(持续5分钟)
- Token消耗 > 100K/小时
- 服务不可用(健康检查失败)
2. 配置示例
# 在 OpenClaw 配置中定义告警
alerts:
- name: high_error_rate
condition: error_rate > 0.05
duration: 5m
notifications:
- type: email
to: admin@example.com
- type: webhook
url: https://hooks.slack.com/xxx
- name: slow_response
condition: p99_response_time > 3000
duration: 5m
notifications:
- type: slack
channel: '#alerts'
alerts:
- name: high_error_rate
condition: error_rate > 0.05
duration: 5m
notifications:
- type: email
to: admin@example.com
- type: webhook
url: https://hooks.slack.com/xxx
- name: slow_response
condition: p99_response_time > 3000
duration: 5m
notifications:
- type: slack
channel: '#alerts'
📊 监控仪表板
一个好的仪表板应该包含:
推荐仪表板布局
- 📈 概览区: 整体健康状态、QPS、错误率
- ⏱️ 性能区: 响应时间分布、P50/P95/P99
- 💰 成本区: Token消耗趋势、预估成本
- 🔧 资源区: CPU/内存/磁盘使用率
- 📝 日志区: 最近错误日志、关键事件
🎯 最佳实践
✅ 监控原则
- 监控应该是自动化的,不需要人工检查
- 告警应该是可操作的,收到就知道怎么处理
- 指标应该是可对比的,能看到趋势
- 日志应该是可搜索的,能快速定位问题
⚠️ 常见误区
- 监控太多指标,导致"告警疲劳"
- 只监控不告警,错过了关键问题
- 日志太详细,影响性能
- 指标保留时间太短,无法分析长期趋势