"世界上有一种Agent叫做妙趣,在0和1之间流浪...直到有一天,它学会了在崩溃的边缘优雅地转身。"
Agent恢复与弹性(Agent Recovery & Resilience)是指AI Agent在面对故障、错误或异常情况时,能够自动检测问题、恢复状态并继续执行任务的能力。
想象一下,你是一个Agent,正在执行任务。突然,API挂了,网络断了,模型超时了。这时候,你是选择"我裂开了"还是选择"我还能抢救一下"?
Agent弹性就是那个让你"还能抢救一下"的超能力。就像周星驰电影里的主角,被打倒了99次,第100次还是能站起来说:"我还没输!"
| 组件 | 功能 | 类比 |
|---|---|---|
| 故障检测器 | 识别异常和错误 | 就像你妈喊你回家吃饭,你总能听到 |
| 状态持久化 | 保存当前状态 | 就像你把作业拍照发朋友圈,以防万一 |
| 恢复策略 | 决定如何恢复 | 就像你被老师批评后,选择道歉还是解释 |
| 降级机制 | 降低功能保证核心 | 就像你没钱时,选择吃泡面而不是饿死 |
Agent弹性系统的工作流程如下:
不要盲目重试!如果是因为参数错误导致的失败,重试100次也没用。就像你表白被拒绝了,再表白100次也不会成功(除非你变帅了)。
最简单的恢复策略,适合临时性故障。
# 指数退避重试
async def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return await func()
except TemporaryError as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt # 1s, 2s, 4s
await asyncio.sleep(wait_time)
定期保存状态,失败时从检查点恢复。
class AgentCheckpoint:
def __init__(self):
self.checkpoints = []
def save(self, state):
self.checkpoints.append({
'timestamp': time.time(),
'state': state.copy()
})
def restore(self):
if self.checkpoints:
return self.checkpoints[-1]['state']
return None
当错误率超过阈值时,停止调用以防止级联故障。
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.state = 'closed' # closed, open, half-open
def call(self, func):
if self.state == 'open':
if time.time() - self.last_failure > self.timeout:
self.state = 'half-open'
else:
raise CircuitOpenError()
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
OpenClaw内置了多种弹性机制,让Agent能够优雅地处理故障:
# openclaw.config.js
module.exports = {
resilience: {
retry: {
maxAttempts: 3,
backoff: 'exponential',
initialDelay: 1000
},
timeout: {
toolCall: 30000,
modelResponse: 60000
},
circuitBreaker: {
failureThreshold: 5,
resetTimeout: 60000
},
fallback: {
models: ['gpt-4', 'claude-3-opus', 'gemini-pro']
}
}
}
一个完整的Agent弹性实现示例:
import asyncio
from dataclasses import dataclass
from enum import Enum
from typing import Optional, Callable
class AgentState(Enum):
IDLE = "idle"
RUNNING = "running"
RECOVERING = "recovering"
FAILED = "failed"
@dataclass
class ResilientAgent:
name: str
max_retries: int = 3
checkpoint_interval: int = 10
def __post_init__(self):
self.state = AgentState.IDLE
self.checkpoints = []
self.error_count = 0
async def execute_with_resilience(self, task: Callable):
"""带弹性的任务执行"""
for attempt in range(self.max_retries):
try:
self.state = AgentState.RUNNING
result = await task()
self.error_count = 0 # 成功后重置
return result
except Exception as e:
self.error_count += 1
print(f"尝试 {attempt + 1} 失败: {e}")
if attempt < self.max_retries - 1:
self.state = AgentState.RECOVERING
await self._recover()
else:
self.state = AgentState.FAILED
raise
async def _recover(self):
"""恢复策略"""
# 1. 保存当前状态
self._save_checkpoint()
# 2. 等待一段时间
await asyncio.sleep(2 ** self.error_count)
# 3. 尝试从检查点恢复
if self.checkpoints:
last_checkpoint = self.checkpoints[-1]
print(f"从检查点恢复: {last_checkpoint}")
def _save_checkpoint(self):
"""保存检查点"""
self.checkpoints.append({
'timestamp': time.time(),
'state': self.state,
'error_count': self.error_count
})
# 使用示例
agent = ResilientAgent("miaoquai", max_retries=3)
result = await agent.execute_with_resilience(some_task)