OpenClaw日志分析与监控:凌晨3点的告警,你值得拥有

3分37秒,我盯着Grafana面板上那条飙升的红色曲线。API成本翻了4倍,但用户数没变——是哪个Skill在偷偷烧Token?这篇文章教你搭一套完整的可观测性体系。

🏗️ 日志架构全景

OpenClaw的日志系统设计为三层结构:应用层产生JSON日志,采集层聚合到存储,分析层提供搜索与可视化。

日志格式配置

# openclaw.yaml
logging:
  level: "info"
  format: "json"
  output: "file"
  filePath: "/var/log/openclaw/gateway.log"
  rotation:
    maxSize: "100MB"
    maxFiles: 30
    compress: true
  fields:
    agent: "my-agent"
    environment: "production"

日志级别说明

级别用途场景
DEBUG开发调试工具调用参数、模型响应详情
INFO正常运行Skill加载、消息收发
WARN潜在问题模型Fallback、限速预警
ERROR需处理的问题API超时、工具失败

📊 Prometheus+Grafana监控搭建

启用指标采集

# openclaw.yaml
telemetry:
  prometheus:
    enabled: true
    port: 9090
    path: "/metrics"

Docker Compose一键部署

cat > docker-compose.monitoring.yml << 'EOF'
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
  grafana:
    image: grafana/grafana:latest
    ports: ["3001:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-data:/var/lib/grafana
volumes:
  grafana-data:
EOF

docker compose -f docker-compose.monitoring.yml up -d

关键Grafana面板

  • 请求总量:按时间窗口统计Agent请求频率
  • Token消耗:实时追踪输入/输出Token
  • 工具调用延迟:P50/P95/P99延迟分布
  • 错误率:按工具类型分组统计
  • 成本趋势:每日/每周API花费

🔔 告警规则配置

# alert_rules.yml
groups:
  - name: openclaw_alerts
    rules:
      - alert: HighAPICost
        expr: openclaw_cost_daily_dollars > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API日花费超过$10"
      - alert: HighErrorRate
        expr: rate(openclaw_errors_total[5m]) / rate(openclaw_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "错误率超过5%"
      - alert: GatewayDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical

通知渠道配置

# openclaw.yaml
alerts:
  channels:
    - type: "telegram"
      chatId: "${ADMIN_TELEGRAM_CHAT_ID}"
    - type: "discord"
      webhook: "${DISCORD_WEBHOOK_URL}"

📈 关键性能指标

指标名称健康阈值
请求总量openclaw_requests_total基线±20%
Token消耗openclaw_tokens_total预算内
工具延迟P95openclaw_tool_duration_p95<15秒
错误率openclaw_error_rate<1%
Skill加载时间openclaw_skill_load_seconds<2秒
活跃会话数openclaw_active_sessions按预期
模型Fallbackopenclaw_model_fallbacks_total<5%

🔍 异常检测

常见异常模式

异常模式症状排查方向
Token消耗突增成本曲线飙升检查是否有Skill循环调用
延迟飙升P95超过30秒检查模型API响应、网络
错误率上升ERROR日志密集检查API密钥、限速、配置
会话泄露活跃会话数异常检查会话清理策略
工具调用异常特定工具频繁失败检查依赖服务状态
# 快速排查命令
# 查看最近的错误日志
openclaw logs --level error --last 1h

# 按工具统计调用次数
openclaw logs --last 1h | jq -s '.[] | .tool' | sort | uniq -c | sort -rn

# 查看最慢的10次工具调用
openclaw logs --last 24h | jq -s 'sort_by(-.duration) | .[0:10]'

# 检查Token使用趋势
openclaw logs --last 7d | jq -s '[.[] | .tokens] | add'

🔧 故障排查实战

场景1:Gateway无响应

# 检查进程状态
systemctl status openclaw

# 检查端口监听
ss -tlnp | grep 3000

# 查看最近的错误
openclaw logs --level error --last 10m

# 检查内存使用
free -h && ps aux | grep openclaw

# 重启
openclaw gateway restart

场景2:模型调用持续失败

# 检查API密钥有效性
curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

# 检查余额
# OpenAI Dashboard → Usage → View usage

# 检查Fallback是否触发
openclaw logs | grep -i "fallback"

# 手动测试模型连接
openclaw test model --name claude-sonnet-4-20250514

场景3:Skill执行卡住

# 查看卡住的会话
openclaw sessions list --status active

# 终止卡住的会话
openclaw sessions kill --session-id <id>

# 查看Skill执行日志
openclaw logs --skill <skill-name> --last 5m

🔗 相关资源