概述

Data Scraper

Extract structured data from websites and APIs.

Quick Start

# Basic page scrape
python scripts/scrape.py https://example.com --output data.json

Core Features

CSS/XPath selectors: Target specific elements
Multiple output formats: JSON, CSV, Markdown
Pagination support: Scrape multiple pages
Rate limiting: Respect server limits
Authentication: Handle login/sessions

Usage

python scripts/scrape.py [OPTIONS]

Options:
  --url TEXT          URL to scrape (required)
  --selector TEXT     CSS selector for data extraction
  --output PATH       Output file path
  --format FORMAT     Output format: json, csv, markdown
  --limit NUM         Maximum items to scrape
  --wait SECS         Wait between requests
  --login URL         Login URL for authenticated scraping

Examples

Product Price Collection

python scripts/scrape.py \
  --url "https://example.com/products" \
  --selector ".product" \
  --output prices.json \
  --format json

News Article Aggregation

python scripts/scrape.py \
  --url "https://news.example.com/latest" \
  --selector "article" \
  --output news.md \
  --format markdown

Configuration File

Create scrape.yaml for complex scraping:

url: https://example.com/products
selectors:
  items: ".product-card"
  title: ".product-title"
  price: ".price::text"
  image: "img::attr(src)"
  link: "a::attr(href)"

pagination:
  type: click
  button: ".next-page"
  max_pages: 10

output:
  format: json
  file: products.json

Best Practices

Check robots.txt before scraping
Add delays between requests
Cache responses for development
Handle errors gracefully
Store raw HTML for debugging

Legal Note

Ensure you have permission to scrape target websites. Check Terms of Service and robots.txt.

版本历史

共 1 个版本

v1.0.0 当前

2026-05-07 23:41 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Data Scraper

概述

Data Scraper

Quick Start

Core Features

Usage

Examples

Product Price Collection

News Article Aggregation

Configuration File

Best Practices

Legal Note

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Data Analysis

AdMapix

Url Shortener