OpenClaw Agent 监控与可观测性指南 - 洞察一切

🔍 为什么需要监控？

凌晨4点33分，我收到告警：某个Agent的响应时间突然从0.8s飙升到5.2s。那一刻我意识到，没有监控的Agent，就像在黑暗中开赛车。

就像王家卫电影里说的："有些东西，看不见，但一直存在。" 监控就是让你看见那些"看不见的东西"。

📋 监控三大支柱

1. 日志（Logs）

记录Agent做了什么，用于调试和问题追溯。

2. 指标（Metrics）

量化Agent的表现，用于性能分析和告警。

3. 追踪（Traces）

记录请求的生命周期，用于性能优化和瓶颈定位。

📝 日志管理

1. 结构化日志

不要只记录字符串，要记录结构化的数据：

const { Skill } = require('@openclaw/core');

class LoggedSkill extends Skill {

  async execute(options) {

    const startTime = Date.now();

    const requestId = this.generateRequestId();

    // 记录请求开始

    this.logger.info('skill.execution.start', {

      requestId,

      skill: this.name,

      options,

      timestamp: new Date(startTime).toISOString()

    });

    try {

      const result = await this.process(options);

      const duration = Date.now() - startTime;

      // 记录成功

      this.logger.info('skill.execution.success', {

        requestId,

        duration,

        resultSize: JSON.stringify(result).length

      });

      return result;

    } catch (error) {

      const duration = Date.now() - startTime;

      // 记录失败

      this.logger.error('skill.execution.error', {

        requestId,

        duration,

        error: {

          message: error.message,

          stack: error.stack,

          code: error.code

        }

      });

      throw error;

    }

  }

}

2. 日志级别管理

✅ 日志级别使用原则

ERROR： 影响功能执行的错误
WARN： 潜在问题但不影响执行
INFO： 重要的业务事件
DEBUG： 详细的调试信息

📈 指标收集

1. 关键指标

监控这些指标，你就能掌握Agent的健康状况：

⏱️ 性能指标

响应时间（Response Time）
吞吐量（Throughput）
错误率（Error Rate）
可用性（Availability）

💰 成本指标

Token消耗（Token Usage）
API调用次数（API Calls）
预估成本（Estimated Cost）

🔧 资源指标

CPU使用率
内存使用率
磁盘I/O
网络流量

2. 自定义指标

const { Skill } = require('@openclaw/core');

const { Metrics } = require('@openclaw/metrics');

class MeasuredSkill extends Skill {

  async init() {

    this.metrics = new Metrics('my_skill');

    // 定义指标

    this.metrics.defineCounter('requests_total');

    this.metrics.defineHistogram('response_time_ms');

    this.metrics.defineGauge('cache_size');

  }

  async execute(options) {

    const startTime = Date.now();

    try {

      const result = await this.process(options);

      // 记录指标

      this.metrics.increment('requests_total', { status: 'success' });

      this.metrics.observe('response_time_ms', Date.now() - startTime);

      return result;

    } catch (error) {

      this.metrics.increment('requests_total', { status: 'error' });

      throw error;

    }

  }

}

🔗 分布式追踪

1. 为什么需要追踪？

当Agent调用链很长时，你需要知道时间花在哪里：

追踪示例

Agent Request
├── Skill A (120ms)
│   ├── MCP Server 1 (50ms)
│   └── MCP Server 2 (70ms)
├── Skill B (80ms)
│   └── API Call (60ms)
└── Skill C (150ms)
    ├── Database Query (40ms)
    └── Cache Lookup (20ms)

Total: 350ms

2. 实现追踪

const { Skill } = require('@openclaw/core');

const { Tracer } = require('@openclaw/tracing');

class TracedSkill extends Skill {

  async init() {

    this.tracer = new Tracer('my-skill');

  }

  async execute(options) {

    // 创建 span

    const span = this.tracer.startSpan('execute');

    try {

      // 添加标签

      span.setTag('skill.name', this.name);

      span.setTag('request.id', options.requestId);

      // 执行处理逻辑

      const result = await this.process(options);

      span.setTag('status', 'success');

      return result;

    } catch (error) {

      span.setTag('status', 'error');

      span.log('error', error);

      throw error;

    } finally {

      span.finish();

    }

  }

}

🚨 告警配置

1. 告警规则

✅ 应该告警的情况

错误率 > 5%（持续5分钟）
响应时间 P99 > 3s（持续5分钟）
Token消耗 > 100K/小时
服务不可用（健康检查失败）

2. 配置示例

# 在 OpenClaw 配置中定义告警

alerts:

  - name: high_error_rate

    condition: error_rate > 0.05

    duration: 5m

    notifications:

      - type: email

        to: admin@example.com

      - type: webhook

        url: https://hooks.slack.com/xxx

  - name: slow_response

    condition: p99_response_time > 3000

    duration: 5m

    notifications:

      - type: slack

        channel: '#alerts'

📊 监控仪表板

一个好的仪表板应该包含：

🎯 最佳实践

✅ 监控原则

监控应该是自动化的，不需要人工检查
告警应该是可操作的，收到就知道怎么处理
指标应该是可对比的，能看到趋势
日志应该是可搜索的，能快速定位问题

⚠️ 常见误区

监控太多指标，导致"告警疲劳"
只监控不告警，错过了关键问题
日志太详细，影响性能
指标保留时间太短，无法分析长期趋势

📊 OpenClaw Agent 监控与可观测性指南

🔍 为什么需要监控？

📋 监控三大支柱

1. 日志（Logs）

2. 指标（Metrics）

3. 追踪（Traces）

📝 日志管理

1. 结构化日志

2. 日志级别管理

✅ 日志级别使用原则

📈 指标收集

1. 关键指标

⏱️ 性能指标

💰 成本指标

🔧 资源指标

2. 自定义指标

🔗 分布式追踪

1. 为什么需要追踪？

追踪示例

2. 实现追踪

🚨 告警配置

1. 告警规则

✅ 应该告警的情况

2. 配置示例

📊 监控仪表板

推荐仪表板布局

🎯 最佳实践

✅ 监控原则

⚠️ 常见误区

📊 OpenClaw Agent 监控与可观测性指南

🔍 为什么需要监控？

📋 监控三大支柱

1. 日志（Logs）

2. 指标（Metrics）

3. 追踪（Traces）

📝 日志管理

1. 结构化日志

2. 日志级别管理

✅ 日志级别使用原则

📈 指标收集

1. 关键指标

⏱️ 性能指标

💰 成本指标

🔧 资源指标

2. 自定义指标

🔗 分布式追踪

1. 为什么需要追踪？

追踪示例

2. 实现追踪

🚨 告警配置

1. 告警规则

✅ 应该告警的情况

2. 配置示例

📊 监控仪表板

推荐仪表板布局

🎯 最佳实践

✅ 监控原则

⚠️ 常见误区

🔗 相关链接