OpenClaw 错误处理模式

前言：优雅地失败

世界上有一种程序，叫做"能在失败中优雅恢复的程序"。凌晨5点13分，我的Agent在执行批量SEO任务时撞上了API限流。没有崩溃，没有卡死，它自动切换到备用方案，默默完成了任务。第二天早上我才发现日志里记录了这次惊险的故障恢复。

这就是错误处理的艺术。这篇教程教你如何让你的Agent不怕出错。

错误分类

可恢复错误

网络超时、API限流、临时文件锁定

自动重试即可

部分失败

批量任务中部分失败、工具调用参数错误

跳过失败项继续

致命错误

权限不足、内存溢出、数据损坏

需要人工干预

模式一：指数退避重试

网络请求失败时最常用的策略。每次重试间隔加倍，避免雪崩：

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      const delay = Math.pow(2, i) * 1000; // 1s, 2s, 4s
      console.log(`重试 ${i + 1}/${maxRetries}，等待 ${delay}ms`);
      
      if (i === maxRetries - 1) throw error;
      await sleep(delay);
    }
  }
}

// 使用示例
const result = await retryWithBackoff(() => webFetch(url));

最佳实践：OpenClaw的exec工具支持yieldMs和timeout参数，利用这些参数实现优雅的重试和超时控制。

模式二：优雅降级

当主要方案不可用时，切换到备选方案：

async function searchWithFallback(query) {
  // 主方案：web_search
  try {
    return await primarySearch(query);
  } catch (e) {
    console.log("主搜索失败，切换备用方案");
  }
  
  // 备用1：web_fetch
  try {
    const content = await webFetch("https://search.com?q=" + query);
    return extractResults(content);
  } catch (e) {
    console.log("备用搜索也失败，使用缓存");
  }
  
  // 备用2：返回缓存
  return getFromCache(query) || "暂无搜索结果";
}

模式三：批量容错

批量处理时，个别失败不应该影响整体：

async function batchProcess(items) {
  const results = [];
  const errors = [];
  
  for (const item of items) {
    try {
      const result = await processItem(item);
      results.push({ item, status: "success", result });
    } catch (error) {
      results.push({ item, status: "failed", error: error.message });
      errors.push({ item, error });
    }
  }
  
  // 汇总报告
  console.log(`成功: ${results.length - errors.length}/${results.length}`);
  if (errors.length > 0) {
    console.log("失败项:", errors);
  }
  
  return { results, errors };
}

模式四：超时控制

防止任务无限挂起：

// exec工具的超时控制
await exec({
  command: "some-long-running-command",
  timeout: 300,  // 5分钟超时
  yieldMs: 10000  // 10秒后后台运行
});

// 子Agent的超时控制
await sessions_spawn({
  task: "处理数据",
  runtime: "subagent",
  runTimeoutSeconds: 300  // 5分钟超时
});

模式五：审批流程安全阀

敏感操作的审批机制是最重要的安全措施：

allow-once：单次审批，用完作废
elevated：需要提升权限的操作
审批命令：用户通过/approve审批具体命令

红线：永远不要绕过审批机制。每次审批只能覆盖一个命令。如果一个操作需要多次敏感调用，需要多次审批。

模式六：进程监控

使用process工具管理后台任务：

// 启动后台任务
await exec({
  command: "long-running-task",
  background: true
});

// 检查状态
const status = await process({
  action: "poll",
  sessionId: "task-id",
  timeout: 30000  // 等待30秒
});

// 如需干预
await process({
  action: "kill",
  sessionId: "task-id"
});

实战：妙趣AI的容错体系

我们的Agent采用了多层容错设计：

L1：工具级——单个工具调用失败时重试3次
L2：任务级——批量任务中单个失败不影响整体
L3：会话级——子Agent超时自动终止并报告
L4：通知级——所有错误记录日志并通知管理员

真实案例：在一次批量SEO生成任务中，326个页面中有12个因磁盘空间不足失败。容错系统跳过失败项，继续生成剩余314个，最终通过通知系统提醒管理员清理磁盘。

最佳实践总结

所有外部调用都必须有超时设置
批量任务必须有部分容错能力
敏感操作必须有审批流程
错误必须有日志记录
每个错误场景都要有对应的恢复策略