概述

Scrapling Web Scraping Skill

Goal

Use Scrapling to extract web data with minimal selector breakage and better anti-bot resilience.

Prefer this skill when users ask for:

website scraping
data extraction from HTML pages
Cloudflare/anti-bot resistant scraping
multi-page crawling
converting scraping tasks into reusable Python scripts

Safety and Legality

Before scraping, always:

Confirm the target is allowed by user intent and local laws.
Avoid unauthorized access, login bypass, or private data scraping.
Respect target website terms and reasonable request rates.
For high-volume jobs, add delays and domain-level throttling.

Default Environment (this machine)

All dependencies should live under D:\clawtest.

Recommended setup commands:

python -m venv D:\clawtest\.venv
D:\clawtest\.venv\Scripts\python -m pip install -U pip
D:\clawtest\.venv\Scripts\python -m pip install "scrapling[fetchers]"
D:\clawtest\.venv\Scripts\scrapling install

Notes:

If the task is simple static HTML extraction, pip install scrapling is enough.
scrapling install is needed for browser-based fetchers.

Fetcher Selection Guide

Choose the lightest option that works:

Fetcher:

Best for static pages and speed.

StealthyFetcher:

Best default when anti-bot checks likely exist.

DynamicFetcher:

Use when data is rendered by JavaScript.

Spider:

Use for multi-page crawl, queueing, concurrency, and structured export.

Standard Workflow

Identify target fields and output schema first.
Pick fetcher (Fetcher -> StealthyFetcher -> DynamicFetcher escalation).
Extract with CSS/XPath and normalize into JSON-friendly fields.
Save data to JSON/JSONL/CSV.
Add retry, timeout, and polite delays for production.

Code Templates

1) Single Page Extraction (Stealthy default)

from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
url = "https://example.com/products"
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=45000)

items = []
for card in page.css(".product-card", auto_save=True):
    items.append({
        "title": card.css("h2::text").get(default="").strip(),
        "price": card.css(".price::text").get(default="").strip(),
        "url": card.css("a::attr(href)").get(default="")
    })

print(items)

2) Adaptive Re-location for changed layouts

# First run stores fingerprints:
products = page.css(".product-card", auto_save=True)

# Future run can recover after layout drift:
products = page.css(".product-card", adaptive=True)

3) Spider Crawl Skeleton

from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "product_spider"
    start_urls = ["https://example.com/catalog"]

    async def parse(self, response: Response):
        for card in response.css(".product-card"):
            yield {
                "title": card.css("h2::text").get(default="").strip(),
                "price": card.css(".price::text").get(default="").strip(),
            }

        for href in response.css("a.next::attr(href)").all():
            yield response.follow(href, callback=self.parse)

if __name__ == "__main__":
    ProductSpider().start()

Expected Assistant Output Format

When executing a user task with this skill, respond with:

chosen fetcher/spider strategy and why
runnable script (or patch) tailored to target site
exact install/run commands for current machine
output path and data schema
anti-bot reliability notes and fallback plan

Practical Fallback Order

If extraction fails:

Validate selectors on fresh HTML.
Switch Fetcher -> StealthyFetcher.
Switch to DynamicFetcher for JS-rendered content.
Add adaptive selectors (auto_save=True then adaptive=True).
Add retries, backoff, and lower request rate.

版本历史

共 1 个版本

v1.0.0 当前

2026-03-30 19:18 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)