← 返回
未分类 中文

Data Scraper

Extract data from websites and APIs for analysis. Use when user needs to collect product prices from e-commerce sites, gather news articles, extract structur...
从网站和API提取数据用于分析。当用户需要收集电商网站的产品价格、获取新闻文章、提取结构化数据时使用。
dinghaibin dinghaibin 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 318
下载
💾 0
安装
1
版本
#latest

概述

Data Scraper

Extract structured data from websites and APIs.

Quick Start

# Basic page scrape
python scripts/scrape.py https://example.com --output data.json

Core Features

  • CSS/XPath selectors: Target specific elements
  • Multiple output formats: JSON, CSV, Markdown
  • Pagination support: Scrape multiple pages
  • Rate limiting: Respect server limits
  • Authentication: Handle login/sessions

Usage

python scripts/scrape.py [OPTIONS]

Options:
  --url TEXT          URL to scrape (required)
  --selector TEXT     CSS selector for data extraction
  --output PATH       Output file path
  --format FORMAT     Output format: json, csv, markdown
  --limit NUM         Maximum items to scrape
  --wait SECS         Wait between requests
  --login URL         Login URL for authenticated scraping

Examples

Product Price Collection

python scripts/scrape.py \
  --url "https://example.com/products" \
  --selector ".product" \
  --output prices.json \
  --format json

News Article Aggregation

python scripts/scrape.py \
  --url "https://news.example.com/latest" \
  --selector "article" \
  --output news.md \
  --format markdown

Configuration File

Create scrape.yaml for complex scraping:

url: https://example.com/products
selectors:
  items: ".product-card"
  title: ".product-title"
  price: ".price::text"
  image: "img::attr(src)"
  link: "a::attr(href)"

pagination:
  type: click
  button: ".next-page"
  max_pages: 10

output:
  format: json
  file: products.json

Best Practices

  1. Check robots.txt before scraping
  2. Add delays between requests
  3. Cache responses for development
  4. Handle errors gracefully
  5. Store raw HTML for debugging

Legal Note

Ensure you have permission to scrape target websites. Check Terms of Service and robots.txt.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 23:41 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 216 📥 71,425
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 298 📥 142,891
dev-programming

Url Shortener

dinghaibin
创建和管理短链接,支持自定义别名和追踪。适用于缩短长链接、创建易记的自定义链接、追踪点击统计等场景。
★ 1 📥 2,648