← 返回
未分类

Web Retrieval

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...
Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...
seanford seanford 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 203
下载
💾 1
安装
1
版本
#latest

概述

Web Retrieval — Scrapling Expert

Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.

Fetcher Selection Guide

Always start with the cheapest fetcher that works. Escalate only if needed.

FetcherCLI modeSpeedUse when
------------------------------------
Fetcher (curl_cffi)getFast (~1s)Static HTML, APIs, most public pages. Impersonates Chrome.
DynamicFetcherfetchMedium (~5s)JS-rendered pages, SPAs, pages that need browser execution
StealthyFetcherstealthySlow (~10s)Cloudflare, heavy anti-bot, fingerprint detection

Decision tree:

  1. Try get first — it handles 80% of pages
  2. If content is just a title or empty → escalate to fetch
  3. If blocked/Cloudflare detected → escalate to stealthy
  4. If still blocked → add --solve-cloudflare and/or --wait 3000

Fetch Script

FETCH="python3 $SKILL_DIR/scripts/fetch"

# Basic fetch (auto-escalates through modes)
$FETCH https://example.com

# Force specific mode
$FETCH https://example.com --mode stealthy

# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"

# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"

# Save to file
$FETCH https://example.com /tmp/output.md

# Plain text output
$FETCH https://example.com --text

# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html

# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare

# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources

# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle

Crawl Script

CRAWL="python3 $SKILL_DIR/scripts/crawl"

# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/

# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json

# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/

# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/

# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/

# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json

Python API (for sub-agents / scripts)

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher

# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text

# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
    page = await f.async_fetch("https://spa.example.com")

# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)

# CSS selector extraction
results = page.css("div.article-body p")  # returns list of elements
first = page.css_first("h1").text

# Response properties
page.status  # HTTP status
page.url     # final URL (after redirects)
page.html    # raw HTML string
page.find("div", {"class": "content"})  # BeautifulSoup-style

Scrapling Spider (site crawl with full control)

from scrapling.spiders import Spider, Request
from scrapling import Fetcher

class DocsCrawler(Spider):
    start_urls = ["https://docs.example.com"]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    async def parse(self, response):
        # Yield scraped data
        yield {
            "url": response.url,
            "title": response.css_first("h1").text if response.css_first("h1") else "",
            "body": response.get_all_text(ignore_tags=("script", "style")),
        }
        # Follow links
        for link in response.css("a[href]"):
            href = link.attrib.get("href", "")
            if href.startswith("/") or "docs.example.com" in href:
                yield Request(response.urljoin(href), callback=self.parse)

# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)

Key Scrapling CSS/Response Methods

MethodDescription
---------------------
page.css("selector")All matching elements
page.css_first("selector")First match or None
el.textText content of element
el.attrib["href"]Attribute value
page.get_all_text()Full page text (strips scripts/styles)
page.htmlRaw HTML
page.find(tag, attrs)BeautifulSoup-style find
page.urljoin(href)Resolve relative URL
response.urlFinal URL after redirects
response.statusHTTP status code

Output Formats

All CLI commands support three output formats via file extension:

  • .md — Markdown (default, best for LLM consumption)
  • .txt — Plain text
  • .html — Raw HTML (use for link extraction or further parsing)

Tips

  • JS pages with lazy loading: use --wait 2000 + --network-idle
  • Dynamic content: --wait-selector ".target-class" waits until element appears
  • Rate limiting: add --wait 1000 between requests in crawl mode
  • Cloudflare 403/503: --mode stealthy --solve-cloudflare
  • Missing content: try --mode fetch --network-idle before escalating to stealthy
  • CSS selectors: use for targeted extraction to reduce noise in output
  • Deep research crawls: use --checkpoint-dir so crawl survives interruption

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-29 21:38 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

knowledge-management

Skilled OpenClaw Advisor

seanford
查询本地 OpenClaw 文档索引,获取关于配置、功能、CLI 命令、渠道、提供商、插件、计划任务、会话、代理等方面的准确答案。
★ 0 📥 880
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 211 📥 69,806
data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 274 📥 100,815