Agent Fault Tolerance Agent 容错设计 - OpenClaw & Agent Skills 术语百科

"世界上有一种设计哲学叫容错，它就像给 Agent 穿了一件防弹衣——不是为了防它不犯错，而是为了犯错的时候不至于一命呜呼。凌晨 3 点 17 分，我的 Agent 又崩了。第 47 次。我看着满屏的错误日志，终于明白：完美的 Agent 不存在，但能从任何错误中恢复的 Agent，才是好 Agent。"

📖 什么是 Agent Fault Tolerance

Agent Fault Tolerance（智能体容错设计）是指让 AI Agent 在面对外部服务故障、网络异常、API 限流、模型输出异常等错误时，能够优雅降级、自动恢复、继续工作的一系列设计策略和模式。

                    一句话解释：就像你手机信号不好——不会直接死机，而是先重试，再切换 Wi-Fi，最后告诉你"暂时无法连接"。Agent 容错就是让 AI 学会这种"不死磕"的智慧。
                

Agent 容易出什么错？

错误类型	频率	影响	示例
API 限流	⭐⭐⭐⭐⭐	高	Rate limit exceeded
网络超时	⭐⭐⭐⭐	高	web_fetch 超时
模型幻觉	⭐⭐⭐⭐	中	调用不存在的工具
格式错误	⭐⭐⭐	中	JSON 解析失败
权限不足	⭐⭐	低	文件写入权限

⚙️ 五大容错策略

1. 指数退避重试 (Exponential Backoff)

最基础但最有效的策略——失败后等越来越长的时间再试：

# 指数退避重试 - 每次等待时间翻倍
async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.code === 'RATE_LIMIT') {
        # 1s, 2s, 4s...
        const wait = Math.pow(2, i) * 1000;
        await sleep(wait + Math.random() * 1000);  # 加随机抖动
      } else {
        throw error;  # 非限流错误直接抛
      }
    }
  }
  throw new Error(`Failed after ${maxRetries} retries`);
}

# 使用示例
const result = await retryWithBackoff(
  () => anthropic.messages.create({...})
);
# 第1次失败等1s，第2次等2s，第3次等4s
# 总等待: 7s（比傻等30s好多了）
                

2. 熔断器 (Circuit Breaker)

当某个服务持续失败时，直接"拉闸"，不再浪费时间请求它：

┌─────────┐ 连续失败5次 ┌─────────┐ 30秒后 ┌─────────┐
│ CLOSED │ ──────────→ │ OPEN │ ───────→ │HALF_OPEN│
│ (正常) │ │ (熔断) │ 半开测试 │ (试探) │
└─────────┘ ←────────── └─────────┘ ←────── └─────────┘
^ 试探成功试探失败 ^
└──────────────────────────────────────────────────────────┘

# 熔断器实现
class CircuitBreaker {
  constructor(failureThreshold = 5, resetTimeout = 30000) {
    this.state = 'CLOSED';  # 正常
    this.failures = 0;
    this.failureThreshold = failureThreshold;
    this.resetTimeout = resetTimeout;
  }

  async call(fn) {
    if (this.state === 'OPEN') {
      throw new Error('Circuit is OPEN - service unavailable');
    }

    try {
      const result = await fn();
      this.onSuccess();  # 成功，重置计数
      return result;
    } catch (error) {
      this.onFailure();  # 失败，增加计数
      throw error;
    }
  }

  onFailure() {
    this.failures++;
    if (this.failures >= this.failureThreshold) {
      this.state = 'OPEN';
      setTimeout(() => {
        this.state = 'HALF_OPEN';  # 30s后试探
      }, this.resetTimeout);
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = 'CLOSED';
  }
}

# 使用示例
const githubBreaker = new CircuitBreaker();
const issues = await githubBreaker.call(
  () => github.issues.list("miaoquai/tools")
);
                

3. 优雅降级 (Graceful Degradation)

# 优雅降级 - 主路径失败时走备用路径
async function generateNewsWithFallback() {
  # 方案 1：正常路径 - 多源搜索
  try {
    const results = await Promise.all([
      searchGoogle(query),
      searchHackerNews(query),
      searchReddit(query)
    ]);
    return generateDailyNews(results);
  } catch (error) {
    console.warn(`多源搜索失败: ${error.message}`);
    
    # 方案 2：降级 - 只用 RSS
    try {
      const rss = await fetchRSSFeeds();
      return generateDailyNews(rss);
    } catch (error2) {
      console.warn(`RSS 也失败了: ${error2.message}`);
      
      # 方案 3：终极降级 - 生成简化版
      return {
        status: "degraded",
        message: "今日新闻源暂时不可用，以下为缓存内容",
        content: await getCachedNews()
      };
    }
  }
}
                

4. 超时保护 (Timeout Guard)

# 每个工具调用都设超时
const toolTimeouts = {
  "web_search": 10000,   # 10秒
  "web_fetch": 15000,   # 15秒
  "write": 5000,       # 5秒
  "exec": 30000,       # 30秒
  "message": 10000     # 10秒
};

async function callWithTimeout(tool, params, timeout) {
  return Promise.race([
    tool.execute(params),
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Tool timeout')), timeout)
    )
  ]);
}

# OpenClaw Agent 配置
{
  "toolTimeouts": {
    "default": 15000,
    "perTool": toolTimeouts,
    "totalTaskTimeout": 300000  # 整个任务最多5分钟
  }
}
                

5. 检查点恢复 (Checkpoint Recovery)

# 检查点机制 - 记录进度，失败后从断点恢复
class CheckpointManager {
  async runWithCheckpoint(taskId, steps) {
    # 检查是否有之前的检查点
    const checkpoint = await this.loadCheckpoint(taskId);
    const startFrom = checkpoint?.completedStep || 0;

    for (let i = startFrom; i < steps.length; i++) {
      try {
        await steps[i].execute();
        
        # 成功，保存检查点
        await this.saveCheckpoint(taskId, {
          completedStep: i + 1,
          timestamp: Date.now()
        });
      } catch (error) {
        throw new Error(
          `Task failed at step ${i}: ${error.message}. ` +
          `Resume with checkpoint.`
        );
      }
    }
  }
}
                

🚀 OpenClaw 实战

妙趣AI 的容错架构

# OpenClaw Agent 容错配置
{
  "name": "miaoquai-ops",
  
  "faultTolerance": {
    # 重试策略
    "retry": {
      "maxRetries": 3,
      "backoff": "exponential",
      "baseDelay": 1000,
      "jitter": true
    },
    
    # 熔断器
    "circuitBreaker": {
      "failureThreshold": 5,
      "resetTimeout": 60000
    },
    
    # 超时保护
    "timeouts": {
      "toolCall": 15000,
      "totalTask": 300000
    },
    
    # 降级策略
    "fallback": {
      "enabled": true,
      "actions": [
        "使用缓存内容",
        "生成简化版输出",
        "记录错误并跳过"
      ]
    },
    
    # 检查点
    "checkpoint": {
      "enabled": true,
      "storage": "/var/www/miaoquai/.checkpoints/"
    }
  }
}
                

💡 OpenClaw 容错最佳实践：
1. 所有外部调用都加超时：特别是 web_fetch 和 exec
2. 生产环境必须开熔断器：防止级联故障
3. 长任务必开检查点：日报生成、SEO 批量处理
4. 降级方案要预先测试：别等到线上出问题才测
5. 错误日志要分类：限流/超时/格式错误分开处理

📊 容错 vs 不容错

指标	无容错	有容错
日报生成成功率	75%	98%
平均故障恢复时间	需要人工介入	自动恢复 < 30s
API 限流影响	任务完全中断	降级继续运行
长任务完成率	60%	95%

📝 总结

Agent Fault Tolerance 不是锦上添花，而是生产环境的生命线。就像你坐飞机——你不希望飞机有一个零件坏了就掉下来，你希望它有冗余设计。五个策略：重试、熔断、降级、超时、检查点——构成了 Agent 的"免疫系统"。

凌晨 3 点 17 分，Agent 第 47 次崩溃。但这次，它在 5 秒内自动恢复了。因为有了容错设计，我终于可以安心睡觉了。