世界上有一种超能力叫"全知",它让AI像一座连接全世界的图书馆。 web_search是索引,web_fetch是阅读器,RSS是订阅系统——三者组合,就是你的私人情报站。
🔎 web_search搜索优化
基础搜索
# 基础搜索(使用DuckDuckGo)
web_search(query="OpenClaw tutorial", count=5)
# 参数说明
# query: 搜索关键词
# count: 返回结果数 (1-10)
# region: 地区代码 (如 us-en, cn-zh)
# safeSearch: 安全搜索 (strict, moderate, off)
高级搜索技巧
# 1. 关键词策略
# 精确匹配
web_search(query='"OpenClaw Skills" tutorial')
# 排除词
web_search(query="OpenClaw tutorial -beginner")
# OR搜索
web_search(query="OpenClaw OR ClawHub tutorial")
# 限定站点
web_search(query="site:docs.openclaw.ai cron")
# 文件类型
web_search(query="OpenClaw filetype:pdf guide")
# 2. 中文搜索优化
web_search(query="OpenClaw 教程 入门", region="cn-zh")
web_search(query="AI Agent 自动化 实战", count=10)
# 3. 时效性搜索
web_search(query="OpenClaw 2026 new features")
web_search(query="AI Agent latest news April 2026")
搜索结果处理
# 搜索结果结构
result = web_search(query="OpenClaw tutorial", count=5)
# 返回:
# [
# {
# "title": "OpenClaw Tutorial: Setup and Skills",
# "url": "https://example.com/openclaw-tutorial",
# "snippet": "Learn how to set up OpenClaw..."
# }
# ]
# 批量搜索多个关键词
queries = [
"OpenClaw Skills教程",
"Agent自动化工作流",
"MCP协议集成指南"
]
for q in queries:
results = web_search(query=q, count=5)
process_results(results)
🌐 web_fetch内容提取
基础用法
# 提取网页内容(转为Markdown)
web_fetch(url="https://example.com/article", extractMode="markdown")
# 提取纯文本
web_fetch(url="https://example.com/article", extractMode="text")
# 限制字符数(防超长)
web_fetch(url="https://example.com", maxChars=10000)
处理不同类型内容
# 1. 技术文档
web_fetch(
url="https://docs.openclaw.ai/tools/skills",
extractMode="markdown",
maxChars=50000 # 文档可以长一些
)
# 2. 博客文章
web_fetch(
url="https://blog.example.com/post/123",
extractMode="markdown"
)
# 3. GitHub README
web_fetch(
url="https://raw.githubusercontent.com/user/repo/main/README.md",
extractMode="markdown"
)
# 4. API文档页面
web_fetch(
url="https://api.example.com/docs",
extractMode="text",
maxChars=30000
)
内容后处理
# 抓取+分析流水线
def analyze_webpage(url):
# 1. 抓取内容
content = web_fetch(url=url, extractMode="markdown")
# 2. 提取关键信息
analysis = """
分析以下网页内容:
{content}
请提取:
1. 主要观点(3-5条)
2. 关键数据
3. 相关链接
4. 一句话总结
"""
# 3. 使用AI分析
result = run_ai_analysis(analysis)
return result
📡 RSS聚合实战
常用RSS源
RSS_FEEDS = {
"openai": "https://openai.com/blog/rss.xml",
"anthropic": "https://www.anthropic.com/rss.xml",
"huggingface": "https://huggingface.co/blog/feed.xml",
"mit_tech": "https://www.technologyreview.com/feed/",
"the_gradient": "https://thegradient.pub/rss/"
}
# RSS聚合Cron任务配置
{
"name": "rss-daily-aggregation",
"schedule": { "kind": "cron", "expr": "0 */2 * * *", "tz": "Asia/Shanghai" },
"payload": {
"kind": "agentTurn",
"message": """
执行RSS聚合任务:
1. 抓取以下RSS源的最新10篇文章:
- OpenAI Blog
- Anthropic Blog
- HuggingFace Blog
2. 用web_fetch获取每篇文章的完整内容
3. 筛选与AI Agent/OpenClaw相关的文章
4. 为每篇文章生成中文摘要(100字以内)
5. 保存到 /var/www/miaoquai/rss/YYYY-MM-DD.html
6. 更新RSS索引页 /var/www/miaoquai/rss/index.html
"""
},
"sessionTarget": "isolated",
"delivery": { "mode": "none" }
}
📊 综合实战:竞品监控系统
# 多层信息收集架构
class CompetitorMonitor:
def scan_competitors(self):
"""全方位扫描竞品"""
results = {}
# Layer 1: 搜索引擎发现
for competitor in ["futuretools.io", "thereisanaiforthat.com"]:
search_results = web_search(
query=f'site:{competitor} new features 2026',
count=10
)
results[f"search:{competitor}"] = search_results
# Layer 2: 直接抓取
for url in self.competitor_urls:
content = web_fetch(url, extractMode="markdown")
features = self.extract_features(content)
results[f"content:{url}"] = features
# Layer 3: RSS/Blog监控
for feed in self.competitor_feeds:
articles = self.parse_rss(feed)
new_articles = self.filter_new(articles)
results[f"rss:{feed}"] = new_articles
# Layer 4: 社交媒体
social = web_search(
query='"AI tools" site:twitter.com OR site:x.com 2026',
count=10
)
results["social"] = social
return results
def generate_report(self, results):
"""生成竞品分析报告"""
report = f"""
# 竞品动态报告 ({datetime.now().strftime('%Y-%m-%d')})
## 搜索引擎发现
{format_search_results(results.search)}
## 网站更新
{format_content_updates(results.content)}
## RSS动态
{format_rss_updates(results.rss)}
## 社交媒体趋势
{format_social_trends(results.social)}
"""
save_report(report, "/var/www/miaoquai/competitor-report.html")
⚡ 性能优化技巧
🚀 加速策略
- 并行抓取:多个URL同时用SubAgent抓取,速度提升3-5倍
- 缓存复用:相同URL不重复抓取,使用本地缓存
- 截断控制:设置maxChars避免处理超大页面
- 增量更新:只抓取上次之后的更新内容
# 并行抓取示例
def parallel_fetch(urls):
# 用多个SubAgent并行抓取
workers = []
for url in urls:
worker = sessions_spawn(
task=f"用web_fetch抓取 {url} 的内容,返回Markdown格式",
mode="run",
timeoutSeconds=60
)
workers.append(worker)
# 收集结果
results = {}
for url, worker in zip(urls, workers):
results[url] = worker.result
return results
🛡️ 反爬虫应对
⚠️ 常见反爬措施及应对
| 反爬类型 | 表现 | 应对 |
|---|---|---|
| 速率限制 | 429 Too Many Requests | 增加请求间隔、轮换IP |
| Cloudflare | 验证页面 | 使用浏览器自动化 |
| User-Agent检测 | 403 Forbidden | 设置浏览器UA |
| Cookie验证 | 跳转登录 | 使用browser工具带cookie |
| 动态加载 | 内容为空 | 等待JS执行或用browser |
# web_fetch失败时降级到browser
def robust_fetch(url):
try:
# 优先用web_fetch(快速轻量)
content = web_fetch(url=url, extractMode="markdown")
if len(content) > 100: # 有效内容
return content
except Exception:
pass
# 降级到browser(慢但可靠)
browser(action="open", url=url)
time.sleep(3)
page = browser(action="snapshot")
return extract_text(page)