← 返回
开发者工具 中文

Smart Web Scraper

Extract structured data from any web page. Supports CSS selectors, auto-detection of tables and lists, JSON/CSV output formats. Use when asked to scrape a we...
从任意网页提取结构化数据。支持 CSS 选择器,自动识别表格和列表,输出 JSON/CSV 格式。适用于在需要抓取网页时使用...
mariusfit
开发者工具 clawhub v1.0.0 1 版本 99898 Key: 无需
★ 0
Stars
📥 2,937
下载
💾 39
安装
1
版本
#automation#data#latest#scraping#web

概述

Smart Web Scraper

Extract structured data from web pages into clean JSON or CSV.

Quick Start

# Scrape a page, extract all text content
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com"

# Extract specific elements with CSS selector
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com/products" -s ".product-card"

# Auto-detect and extract tables
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing"

# Extract all links from a page
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com"

# Extract structured data (title, meta, headings, links)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py structure "https://example.com"

# Output as JSON
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".item" -f json

# Output as CSV
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s "table tr" -f csv

# Save to file
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".product" -f json -o products.json

# Multi-page scrape (follow pagination)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

Commands

CommandArgsDescription
----------------------------
extract [-s selector] [-f format] [-o file]Extract content, optionally filtered by CSS selector
tables [-f format] [-o file]Auto-detect and extract all HTML tables
links [--external] [--internal]Extract all links (href + text)
structureExtract page structure: title, meta, headings, images, links
crawl --pages N [-s selector] [-f format] [-o file]Follow pagination links, extract from multiple pages

Output Formats

FormatFlagDescription
---------------------------
Text-f textPlain text (default)
JSON-f jsonStructured JSON array
CSV-f csvComma-separated values
Markdown-f mdMarkdown-formatted

Examples

Extract product listings

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://shop.example.com" -s ".product" -f json

Output:

[
  {"text": "Widget Pro - $29.99", "tag": "div", "class": "product"},
  {"text": "Widget Max - $49.99", "tag": "div", "class": "product"}
]

Extract pricing table

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing" -f csv

Get all external links

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com" --external

Rate Limiting

  • Default: 1 request per second (respectful crawling)
  • Override with --delay 0.5 (seconds between requests)
  • Respects robots.txt by default (override with --ignore-robots)

Notes

  • Requires beautifulsoup4 and lxml (auto-installed by uv run --with)
  • Uses a standard browser User-Agent to avoid blocks
  • Handles redirects, encoding detection, and error pages gracefully
  • No JavaScript rendering (use for static HTML pages)

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 01:14 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

Security Hardener

mariusfit
审计并加固 OpenClaw 配置的安全性。扫描 openclaw.json 以查找漏洞、暴露的凭据、不安全的网关设置以及过度权限...
★ 1 📥 2,452
developer-tools

CodeConductor.ai

larsonreever
AI驱动平台,提供快速全栈开发、智能体、工作流自动化及低代码AI集成的可扩展产品创建。
★ 65 📥 179,753
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 666 📥 323,714