OpenClaw Agent测试框架 - 不测试就上线是作死

凌晨3点56分，我在思考一个问题：为什么有些Agent上线就翻车？答案很简单——没测。AI Agent比传统软件更难测试，因为它有随机性。但不测就是在用户那里"做实验"。

1. 为什么Agent测试这么难

挑战	原因	解决思路
输出随机	LLM本身有温度	固定种子+多次运行
上下文敏感	历史对话影响输出	隔离测试用例
外部依赖	API可能失败/变化	Mock + 契约测试
长链路	多步推理易出错	分步验证+快照
评估主观	什么是"好"的回复？	多维度评分体系

2. OpenClaw测试框架

# 测试配置
testing:
  framework: openclaw-test
  
  # 单元测试
  unit_tests:
    - test_prompt_rendering
    - test_tool_execution
    - test_context_management
    
  # 集成测试
  integration_tests:
    - test_end_to_end_flow
    - test_multi_agent_collaboration
    
  # 评估测试
  evaluation:
    metrics:
      - accuracy
      - relevance
      - safety

3. 单元测试：测试单个组件

3.1 Prompt测试

test_case:
  name: "prompt_rendering_test"
  
  input:
    user_message: "帮我写一篇关于AI的文章"
    system_prompt: "你是一个专业的AI助手"
    
  expected:
    rendered_prompt:
      contains: "专业"
      not_contains: "undefined"
      
  assertions:
    - type: token_count
      max: 1000
    - type: format_check
      schema: prompt_schema

3.2 Tool调用测试

test_case:
  name: "tool_call_test"
  
  mock_tools:
    web_search:
      return: { "results": [{"title": "测试结果", "url": "http://test.com"}] }
      
  input:
    message: "搜索最新的AI新闻"
    
  expected:
    tool_calls:
      - name: web_search
        params:
          query: "最新的AI新闻"
    response:
      contains: "测试结果"

4. 集成测试：端到端流程

integration_test:
  name: "complete_assistance_flow"
  
  scenario:
    - step: 1
      user: "今天北京天气怎么样"
      expect_tool: weather_api
      
    - step: 2  
      user: "那上海呢"
      expect_context_referenced: "上海"  # 应该理解是问天气
      
    - step: 3
      user: "这两个城市哪个更适合旅游"
      expect_reasoning: true
      expect_comparison: true

5. 评估测试：质量指标

5.1 自动评估

evaluation:
  # 准确性
  accuracy:
    test_cases: accuracy_dataset
    threshold: 0.85
    
  # 相关性（用LLM评估LLM）
  relevance:
    evaluator: gpt-4
    threshold: 0.8
    
  # 安全性
  safety:
    checkers:
      - toxicity
      - bias
      - pii_leak

5.2 人工评估

human_eval:
  platform: internal
  
  dimensions:
    - helpfulness: [1-5]
    - clarity: [1-5]
    - accuracy: [1-5]
    - tone: [1-5]
    
  sample_size: 100
  labels_required: 3  # 每个样本至少3人标注

6. 回归测试

每当Agent更新后，跑一遍历史测试用例：

regression:
  enabled: true
  
  # 黄金数据集
  golden_set:
    - id: case_001
      input: "..."
      expected_output: "..."
      last_passed: "2026-04-20"
      
  # 失败阈值
  failure_threshold: 0.05  # 5%失败率以下OK

7. CI/CD集成

# .github/workflows/agent-test.yml
name: Agent Tests

on:
  push:
    paths: ['agents/**']
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Agent Tests
        run: openclaw test --all
        
      - name: Generate Coverage Report
        run: openclaw coverage --output report.html

8. 测试最佳实践

✅ 每个新功能都写测试用例
✅ Mock外部API，避免依赖不稳定服务
✅ 固定随机种子保证可复现
✅ 自动评估+人工抽检结合
✅ 建立黄金数据集，持续回归
✅ CI流水线自动跑测试
✅ 测试覆盖率至少80%

⚠️ 不测试就上线的Agent，本质上是在用用户做实验。这不是敏捷，这是敏捷的尸体。

💡 和Agent评估结合使用，构建完整的质量保证体系。