⏱️ Agent Timeout Management（智能体超时管理）

当AI陷入"沉思"无法自拔时，谁来喊"时间到"?

你有没有遇到过这种情况：问AI一个问题，它开始"思考中..."，转圈圈转了5分钟还是没反应。你怀疑它是不是睡着了，还是去思考人生意义了？

这时候你需要一个闹钟——Timeout管理就是那个无情的闹钟。当Agent执行超过设定时间，闹钟响起："别想了，该行动了！"

什么是Agent Timeout？

Agent Timeout（智能体超时）是指Agent在执行任务时，因为各种原因（模型响应慢、工具调用卡住、循环死锁等）导致长时间无响应的情况。Timeout管理就是设定时间边界，在超时时触发预设的处理策略。

超时的常见原因

模型响应慢 - LLM推理时间过长（尤其是复杂推理链）
工具调用阻塞 - 外部API无响应或响应慢
循环死锁 - Agent陷入无限循环
资源竞争 - 并发任务抢夺资源
网络问题 - 连接超时、丢包

Timeout类型全景图

1. 整体任务超时 (Global Timeout)

整个任务的最大执行时间，超时即终止。

// OpenClaw 全局超时配置
agent:
  name: data_processor
  timeout:
    global: 300  # 5分钟总时限
    on_timeout: "terminate"  # 或 "handoff", "retry"

2. 单步操作超时 (Step Timeout)

每个独立步骤的超时限制。

timeout:
  llm_call: 60      # 单次LLM调用最多60秒
  tool_call: 30     # 工具调用30秒
  web_search: 15   # 网页搜索15秒
  file_operation: 10 # 文件操作10秒

3. 流式响应超时 (Streaming Timeout)

流式输出时，两个token之间的最大间隔。

streaming:
  enabled: true
  token_interval_timeout: 5  # 5秒无新token则超时
  on_stall: "notify_and_continue"

超时处理策略

立即终止 (Terminate) - 直接停止任务，返回错误信息

重试 (Retry) - 重新执行当前步骤，设定重试次数

降级 (Fallback) - 使用备用方案或简化策略

交接 (Handoff) - 转交给其他Agent或人工

部分返回 (Partial) - 返回已完成的部分结果

OpenClaw Timeout 实战配置

// OpenClaw 完整超时管理配置
agent:
  name: smart_assistant
  
  timeout:
    # 全局超时
    global: 180
    
    # 分步超时
    steps:
      planning: 30
      tool_execution: 45
      response_generation: 60
    
    # 重试策略
    retry:
      max_attempts: 3
      backoff: "exponential"  # 指数退避
      base_delay: 2
      max_delay: 30
    
    # 超时行为
    on_timeout:
      step_timeout: "retry"
      global_timeout: "partial_response"
      max_retries_exceeded: "handoff_to_human"
    
    # 超时通知
    notify:
      message: "处理时间较长，请稍候..."
      progress_updates: true

指数退避策略

当重试时，不应该立即重试，而是采用指数退避，避免雪崩：

// 指数退避示意
retry_1: wait 2s   → retry
retry_2: wait 4s   → retry  // 2^1 * base
retry_3: wait 8s   → retry  // 2^2 * base
retry_4: wait 16s  → retry  // 2^3 * base
retry_5: wait 30s  → retry  // max_delay cap

降级策略示例

// OpenClaw 降级配置
fallback:
  on_timeout:
    // LLM超时降级
    llm:
      primary: "claude-3-opus"
      fallback: "claude-3-sonnet"  // 更快的模型
      final_fallback: "cached_response"
    
    // 工具超时降级
    tool:
      web_search:
        primary: "brave_search"
        fallback: "cached_results"
      
      database:
        primary: "postgres"
        fallback: "redis_cache"

超时监控与告警

关键指标监控：

P95延迟 - 95%的请求在多少时间内完成
超时率 - 超时次数 / 总请求次数
重试成功率 - 重试后成功的比例
降级触发率 - 降级策略被触发的频率

// OpenClaw 监控配置
monitoring:
  metrics:
    - "timeout_count"
    - "retry_count"
    - "fallback_count"
    - "p95_latency"
  
  alerts:
    - condition: "timeout_rate > 5%"
      action: "slack_notification"
    - condition: "p95_latency > 60s"
      action: "auto_scale"

常见陷阱

🚫 超时设太短 - 正常任务被误杀
🚫 超时设太长 - 用户等到崩溃
🚫 无限重试 - 重试风暴拖垮系统
🚫 没有降级 - 超时后直接失败，用户体验差
🚫 忽略监控 - 不知道超时发生频率

最佳实践清单

分级超时 - 全局超时 > 步骤超时 > 操作超时
指数退避 - 重试间隔递增，避免雪崩
优雅降级 - 总有备选方案
用户通知 - 超时时告知用户进展
数据埋点 - 记录超时原因和频率
动态调整 - 根据历史数据优化超时配置