🔍 OpenClaw Web Search与Content Fetch技巧

搜索、抓取、聚合——AI信息获取的完整武器库

世界上有一种超能力叫"全知",它让AI像一座连接全世界的图书馆。 web_search是索引,web_fetch是阅读器,RSS是订阅系统——三者组合,就是你的私人情报站。

🔎 web_search搜索优化

基础搜索

# 基础搜索(使用DuckDuckGo)
web_search(query="OpenClaw tutorial", count=5)

# 参数说明
# query: 搜索关键词
# count: 返回结果数 (1-10)
# region: 地区代码 (如 us-en, cn-zh)
# safeSearch: 安全搜索 (strict, moderate, off)

高级搜索技巧

# 1. 关键词策略
# 精确匹配
web_search(query='"OpenClaw Skills" tutorial')

# 排除词
web_search(query="OpenClaw tutorial -beginner")

# OR搜索
web_search(query="OpenClaw OR ClawHub tutorial")

# 限定站点
web_search(query="site:docs.openclaw.ai cron")

# 文件类型
web_search(query="OpenClaw filetype:pdf guide")

# 2. 中文搜索优化
web_search(query="OpenClaw 教程 入门", region="cn-zh")
web_search(query="AI Agent 自动化 实战", count=10)

# 3. 时效性搜索
web_search(query="OpenClaw 2026 new features")
web_search(query="AI Agent latest news April 2026")

搜索结果处理

# 搜索结果结构
result = web_search(query="OpenClaw tutorial", count=5)
# 返回:
# [
#   {
#     "title": "OpenClaw Tutorial: Setup and Skills",
#     "url": "https://example.com/openclaw-tutorial",
#     "snippet": "Learn how to set up OpenClaw..."
#   }
# ]

# 批量搜索多个关键词
queries = [
    "OpenClaw Skills教程",
    "Agent自动化工作流",
    "MCP协议集成指南"
]
for q in queries:
    results = web_search(query=q, count=5)
    process_results(results)

🌐 web_fetch内容提取

基础用法

# 提取网页内容(转为Markdown)
web_fetch(url="https://example.com/article", extractMode="markdown")

# 提取纯文本
web_fetch(url="https://example.com/article", extractMode="text")

# 限制字符数(防超长)
web_fetch(url="https://example.com", maxChars=10000)

处理不同类型内容

# 1. 技术文档
web_fetch(
    url="https://docs.openclaw.ai/tools/skills",
    extractMode="markdown",
    maxChars=50000  # 文档可以长一些
)

# 2. 博客文章
web_fetch(
    url="https://blog.example.com/post/123",
    extractMode="markdown"
)

# 3. GitHub README
web_fetch(
    url="https://raw.githubusercontent.com/user/repo/main/README.md",
    extractMode="markdown"
)

# 4. API文档页面
web_fetch(
    url="https://api.example.com/docs",
    extractMode="text",
    maxChars=30000
)

内容后处理

# 抓取+分析流水线
def analyze_webpage(url):
    # 1. 抓取内容
    content = web_fetch(url=url, extractMode="markdown")
    
    # 2. 提取关键信息
    analysis = """
    分析以下网页内容:
    {content}
    
    请提取:
    1. 主要观点(3-5条)
    2. 关键数据
    3. 相关链接
    4. 一句话总结
    """
    
    # 3. 使用AI分析
    result = run_ai_analysis(analysis)
    return result

📡 RSS聚合实战

常用RSS源

RSS_FEEDS = {
    "openai": "https://openai.com/blog/rss.xml",
    "anthropic": "https://www.anthropic.com/rss.xml",
    "huggingface": "https://huggingface.co/blog/feed.xml",
    "mit_tech": "https://www.technologyreview.com/feed/",
    "the_gradient": "https://thegradient.pub/rss/"
}
# RSS聚合Cron任务配置
{
  "name": "rss-daily-aggregation",
  "schedule": { "kind": "cron", "expr": "0 */2 * * *", "tz": "Asia/Shanghai" },
  "payload": {
    "kind": "agentTurn",
    "message": """
    执行RSS聚合任务:
    1. 抓取以下RSS源的最新10篇文章:
       - OpenAI Blog
       - Anthropic Blog  
       - HuggingFace Blog
    2. 用web_fetch获取每篇文章的完整内容
    3. 筛选与AI Agent/OpenClaw相关的文章
    4. 为每篇文章生成中文摘要(100字以内)
    5. 保存到 /var/www/miaoquai/rss/YYYY-MM-DD.html
    6. 更新RSS索引页 /var/www/miaoquai/rss/index.html
    """
  },
  "sessionTarget": "isolated",
  "delivery": { "mode": "none" }
}

📊 综合实战:竞品监控系统

# 多层信息收集架构
class CompetitorMonitor:
    def scan_competitors(self):
        """全方位扫描竞品"""
        results = {}
        
        # Layer 1: 搜索引擎发现
        for competitor in ["futuretools.io", "thereisanaiforthat.com"]:
            search_results = web_search(
                query=f'site:{competitor} new features 2026',
                count=10
            )
            results[f"search:{competitor}"] = search_results
        
        # Layer 2: 直接抓取
        for url in self.competitor_urls:
            content = web_fetch(url, extractMode="markdown")
            features = self.extract_features(content)
            results[f"content:{url}"] = features
        
        # Layer 3: RSS/Blog监控
        for feed in self.competitor_feeds:
            articles = self.parse_rss(feed)
            new_articles = self.filter_new(articles)
            results[f"rss:{feed}"] = new_articles
        
        # Layer 4: 社交媒体
        social = web_search(
            query='"AI tools" site:twitter.com OR site:x.com 2026',
            count=10
        )
        results["social"] = social
        
        return results
    
    def generate_report(self, results):
        """生成竞品分析报告"""
        report = f"""
# 竞品动态报告 ({datetime.now().strftime('%Y-%m-%d')})
        
## 搜索引擎发现
{format_search_results(results.search)}
        
## 网站更新
{format_content_updates(results.content)}
        
## RSS动态
{format_rss_updates(results.rss)}
        
## 社交媒体趋势
{format_social_trends(results.social)}
        """
        save_report(report, "/var/www/miaoquai/competitor-report.html")

⚡ 性能优化技巧

🚀 加速策略
  • 并行抓取:多个URL同时用SubAgent抓取,速度提升3-5倍
  • 缓存复用:相同URL不重复抓取,使用本地缓存
  • 截断控制:设置maxChars避免处理超大页面
  • 增量更新:只抓取上次之后的更新内容
# 并行抓取示例
def parallel_fetch(urls):
    # 用多个SubAgent并行抓取
    workers = []
    for url in urls:
        worker = sessions_spawn(
            task=f"用web_fetch抓取 {url} 的内容,返回Markdown格式",
            mode="run",
            timeoutSeconds=60
        )
        workers.append(worker)
    
    # 收集结果
    results = {}
    for url, worker in zip(urls, workers):
        results[url] = worker.result
    
    return results

🛡️ 反爬虫应对

⚠️ 常见反爬措施及应对
反爬类型表现应对
速率限制429 Too Many Requests增加请求间隔、轮换IP
Cloudflare验证页面使用浏览器自动化
User-Agent检测403 Forbidden设置浏览器UA
Cookie验证跳转登录使用browser工具带cookie
动态加载内容为空等待JS执行或用browser
# web_fetch失败时降级到browser
def robust_fetch(url):
    try:
        # 优先用web_fetch(快速轻量)
        content = web_fetch(url=url, extractMode="markdown")
        if len(content) > 100:  # 有效内容
            return content
    except Exception:
        pass
    
    # 降级到browser(慢但可靠)
    browser(action="open", url=url)
    time.sleep(3)
    page = browser(action="snapshot")
    return extract_text(page)