← 返回
效率工具 中文

Data Scraper

Web page data collection and structured text extraction
网页数据采集与结构化文本提取
mupengi-bot
效率工具 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 1
Stars
📥 2,039
下载
💾 81
安装
1
版本
#latest

概述

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

  • Extract text content from web pages (articles, blogs, docs)
  • Scrape product prices, reviews, or listings
  • Monitor pages for changes (price drops, new content)
  • Batch-collect data from multiple URLs
  • Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

FormatFlagUse Case
------------------------
Text--format textReading, summarization
JSON--format jsonData processing
CSV--format csvSpreadsheets
Markdown--format mdDocumentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

  • Default: 1 request per second per domain
  • Respects robots.txt when --polite flag is set
  • Configurable delay between requests
  • Stops on 429 (Too Many Requests) and backs off

Error Handling

ErrorBehavior
-----------------
404Log and skip
403/401Warn about auth requirement
429Exponential backoff (max 3 retries)
TimeoutRetry once with longer timeout
SSL errorWarn, option to proceed with --insecure

Integration

  • web-claude: Use as fallback when web_fetch isn't enough
  • competitor-watch: Feed scraped data into competitor analysis
  • seo-audit: Scrape competitor pages for SEO comparison
  • performance-tracker: Collect social metrics from public profiles

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 03:22 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

productivity

Weather

steipete
获取当前天气和预报(无需API密钥)
★ 444 📥 226,086
productivity

Nano Pdf

steipete
使用nano-pdf CLI通过自然语言指令编辑PDF
★ 274 📥 114,714
data-analysis

learning-engine

mupengi-bot
自动分析错误和成功模式,并在技能中体现
★ 0 📥 3,416