← 返回
未分类 中文

WebScraper

Extract readable content from web pages. Use when: user wants to read article content, fetch documentation, grab product info, or get text from URLs. NOT for...
从网页提取可读内容。适用于:用户想阅读文章、获取文档资料、产品信息或从URL提取文本。不适用于...
lesliepie
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 527
下载
💾 1
安装
1
版本
#latest

概述

WebScraper Skill

Extract and parse content from web pages into readable markdown or plain text.

When to Use

USE this skill when:

  • "Read this article: [URL]"
  • "What does this page say?"
  • "Get the content from [URL]"
  • Fetch documentation, blog posts, news articles
  • Extract product information from e-commerce sites
  • Grab API documentation or tutorials
  • Summarize web page content

When NOT to Use

DON'T use this skill when:

  • Login-required pages (use BrowserAgent with session)
  • Heavy JavaScript-rendered content (use BrowserAgent)
  • Interactive web apps (dashboards, SPAs)
  • CAPTCHA-protected sites
  • Sites with strict anti-bot measures
  • Real-time data (stock tickers, live scores)

Commands

Fetch URL Content

# Using OpenClaw web_fetch tool (recommended)
# Called via tool, not direct CLI

# Basic fetch (markdown output)
web_fetch(url: "https://example.com/article")

# Text-only mode (no markdown)
web_fetch(url: "https://example.com/article", extractMode: "text")

# Limit content length
web_fetch(url: "https://example.com/article", maxChars: 5000)

Using curl (fallback)

# Simple HTML fetch
curl -s "https://example.com" | html2text -width 80

# With user-agent (avoid bot detection)
curl -s -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "https://example.com"

# Fetch and extract main content (requires readability-cli)
curl -s "https://example.com" | readability

# Get just the title
curl -s "https://example.com" | grep -oP '(?<=<title>).*?(?=</title>)'

Using Node.js (advanced)

# Install cheerio for HTML parsing
npm install -g cheerio

# Parse HTML with Node
node -e "
const cheerio = require('cheerio');
const html = \`\$(curl -s 'https://example.com')\`;
const \$ = cheerio.load(html);
console.log(\$('article').text());
"

Response Format

When fetching content, structure responses as:

## 📄 [Page Title]

**Source:** [URL](https://...)
**Fetched:** 2026-03-20

### Content

[Extracted content here...]

---
*Summary: [1-2 sentence summary if helpful]*

Best Practices

1. Respect Rate Limits

# Add delay between requests
sleep 2 && curl "https://example.com/page1"
sleep 2 && curl "https://example.com/page2"

2. Use Proper User-Agent

# Desktop Chrome
curl -A "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

# Mobile Safari
curl -A "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1"

3. Handle Errors

# Check HTTP status
curl -s -o /dev/null -w "%{http_code}" "https://example.com"

# Timeout after 10 seconds
curl -s --max-time 10 "https://example.com"

# Retry on failure
curl -s --retry 3 "https://example.com"

4. Extract Specific Content

# Get all links
curl -s "https://example.com" | grep -oP 'href="\K[^"]+' | head -20

# Get images
curl -s "https://example.com" | grep -oP 'src="\K[^"]+\.(jpg|png|webp)'

# Get meta description
curl -s "https://example.com" | grep -oP '(?<=<meta name="description" content=")[^"]+'

Integration with OpenClaw

Using web_fetch Tool

// In your agent code
const content = await web_fetch({
  url: "https://example.com/article",
  extractMode: "markdown",  // or "text"
  maxChars: 10000
});

Batch Processing

For multiple URLs, process sequentially with delays:

URL1 → fetch → wait 2s → URL2 → fetch → wait 2s → URL3 → fetch

Common Use Cases

1. Article Summarization

1. Fetch article content
2. Extract main text (remove nav, footer, ads)
3. Generate summary
4. Return with source attribution

2. Product Information

1. Fetch product page
2. Extract: name, price, description, specs
3. Format as structured data
4. Return comparison-ready format

3. Documentation Lookup

1. Fetch docs page
2. Extract relevant section
3. Search for specific topic
4. Return code examples + explanations

Troubleshooting

ProblemSolution
-------------------
Content empty/missingSite uses JS rendering → use BrowserAgent
Blocked by siteAdd User-Agent, add delay, use proxy
TimeoutIncrease timeout, check URL validity
Garbled textCheck charset, try text mode
Login requiredUse BrowserAgent with session cookies

Related Skills

  • BrowserAgent - For interactive/JS-heavy sites
  • web_search - For finding URLs before fetching
  • coding-agent - For processing extracted data

Security Notes

⚠️ Important:

  • Respect robots.txt
  • Don't scrape personal data
  • Honor copyright/terms of service
  • Add delays between requests (2-5s)
  • Don't overload servers
  • Use official APIs when available

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-02 02:05 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,064 📥 801,616
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,363 📥 319,211
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 672 📥 324,639