Use this skill when the user wants to:
.md files| Mode | Fetcher Class | Best For |
|---|---|---|
| ------ | -------------- | ---------- |
http (default) | Fetcher | Fast static pages, RSS, APIs |
async | AsyncFetcher | Batch of 5+ static URLs in parallel |
stealth | StealthyFetcher | Anti-bot sites, Cloudflare, fingerprint checks |
dynamic | PlayWrightFetcher | Heavy SPAs, React/Vue/Angular apps |
Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch
to stealth. If the content is rendered client-side (empty on first load), use dynamic.
Use async when scraping many static URLs at once to save time.
--url URL — one target URL (repeat flag for multiple: --url A --url B)--url-file FILE — plain text file with one URL per line--mode http|async|stealth|dynamic — fetcher backend (default: http)--selector CSS — CSS selector for the main content area (omit = full page)--preserve-links — keep hyperlinks in the Markdown output--output-dir DIR — save per-page .md files and a master index.json here--auto-save — fingerprint & persist selected elements to the local DB on first run--auto-match — on subsequent runs, find elements by fingerprint even if the sitelayout has changed (do NOT need to update the CSS selector)
--headless true|false|virtual — headless mode; virtual uses Xvfb (default: true)--network-idle — wait until no network activity for ≥500 ms before capturing--block-images — block image loading (saves bandwidth and proxy quota)--disable-resources — drop fonts/images/media/stylesheets for ~25% faster loads--wait-selector CSS — pause until this element appears in the DOM--wait-selector-state attached|visible|detached|hidden — element state (default: attached)--timeout MS — global timeout in ms (default: 30 000)--wait MS — extra idle wait after page load in ms--humanize SECONDS — simulate human-like cursor movement (max duration in seconds)--geoip — spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation--block-webrtc — prevent real-IP leaks via WebRTC--disable-ads — install uBlock Origin in the browser session--proxy URL — HTTP/SOCKS proxy as a URL string, or JSON: '{"server":"host:port","username":"u","password":"p"}'
--retry N — retry failed requests up to N times with exponential backoff (max 30 s)http:// or https:// pages. headers, footers, or cookie banners — use --selector to target the content area.
--auto-save is used, always also pass --selector so Scrapling knows whichelement fingerprint to record.
--auto-match instead of --auto-save.Do not use both flags at once.
--mode async for batch jobs with 5+ static URLs for parallel execution.--disable-resources with --block-images in stealth/dynamic mode whenyou only need text content — this can cut load times by up to 40%.
ok field and per-result ok fields before using content.ok is false, report the exact error string — do not invent or guess content.--network-idle is insufficient, use --wait-selector for a specific DOM elementto guarantee the content has loaded before capture.
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--proxy "http://user:pass@host:port" \
--humanize 2.0 \
--geoip \
--block-webrtc \
--network-idle
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode dynamic \
--wait-selector ".product-list" \
--network-idle \
--disable-resources
python3 "{baseDir}/scrape_to_markdown.py" \
--mode async \
--url "<URL1>" --url "<URL2>" --url "<URL3>"
python3 "{baseDir}/scrape_to_markdown.py" \
--url-file urls.txt \
--mode stealth \
--disable-resources \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-save \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-match \
--output-dir outputs
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--selector "main article" \
--auto-match \
--preserve-links \
--network-idle \
--disable-resources \
--timeout 60000 \
--retry 3 \
--output-dir outputs
JSON is printed to stdout. Always check ok before using content.
Top-level fields:
ok — true only if every URL succeededtotal / succeeded / failed — count summaryresults — array of per-URL result objectsoutput_index_file — path to saved index.json (if --output-dir used)Per-URL result fields (when ok: true):
url — the requested URLstatus — HTTP status code (e.g. 200)title — page textmarkdown — extracted content as Markdown ← use this as main contentmarkdown_length — character count (useful for quality checks)output_markdown_file — path to saved .md file (if --output-dir used)On failure (ok: false in a result):
error — exact error message; report this verbatim, do not invent content共 1 个版本