Web Markdown Scraper

Use this skill when the user wants to:

Scrape one or more public webpages (static or JavaScript-rendered)
Convert HTML pages into clean Markdown
Extract article/body text for summarization, analysis, or indexing
Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
Scrape many URLs concurrently (async mode)
Track page elements reliably across website redesigns (automatch)
Save the extracted results as .md files

Fetcher Mode Selection Guide

Mode	Fetcher Class	Best For
------	--------------	----------
`http` (default)	`Fetcher`	Fast static pages, RSS, APIs
`async`	`AsyncFetcher`	Batch of 5+ static URLs in parallel
`stealth`	`StealthyFetcher`	Anti-bot sites, Cloudflare, fingerprint checks
`dynamic`	`PlayWrightFetcher`	Heavy SPAs, React/Vue/Angular apps

Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch

to stealth. If the content is rendered client-side (empty on first load), use dynamic.

Use async when scraping many static URLs at once to save time.

Inputs

URL sources

--url URL — one target URL (repeat flag for multiple: --url A --url B)
--url-file FILE — plain text file with one URL per line

Fetcher

--mode http|async|stealth|dynamic — fetcher backend (default: http)

Content extraction

--selector CSS — CSS selector for the main content area (omit = full page)
--preserve-links — keep hyperlinks in the Markdown output
--output-dir DIR — save per-page .md files and a master index.json here

AutoMatch — production resilience

--auto-save — fingerprint & persist selected elements to the local DB on first run
--auto-match — on subsequent runs, find elements by fingerprint even if the site

layout has changed (do NOT need to update the CSS selector)

Browser options (stealth / dynamic only)

--headless true|false|virtual — headless mode; virtual uses Xvfb (default: true)
--network-idle — wait until no network activity for ≥500 ms before capturing
--block-images — block image loading (saves bandwidth and proxy quota)
--disable-resources — drop fonts/images/media/stylesheets for ~25% faster loads
--wait-selector CSS — pause until this element appears in the DOM
--wait-selector-state attached|visible|detached|hidden — element state (default: attached)
--timeout MS — global timeout in ms (default: 30 000)
--wait MS — extra idle wait after page load in ms

StealthyFetcher extras (stealth mode only)

--humanize SECONDS — simulate human-like cursor movement (max duration in seconds)
--geoip — spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation
--block-webrtc — prevent real-IP leaks via WebRTC
--disable-ads — install uBlock Origin in the browser session
--proxy URL — HTTP/SOCKS proxy as a URL string, or JSON:

'{"server":"host:port","username":"u","password":"p"}'

Reliability

--retry N — retry failed requests up to N times with exponential backoff (max 30 s)

Rules

Only process public http:// or https:// pages.
Never bypass login walls, CAPTCHAs, paywalls, or access controls.
Prefer the main article or body content; avoid polluting the output with navigation,

headers, footers, or cookie banners — use --selector to target the content area.

When --auto-save is used, always also pass --selector so Scrapling knows which

element fingerprint to record.

On subsequent runs for layout-changed pages, use --auto-match instead of --auto-save.

Do not use both flags at once.

Use --mode async for batch jobs with 5+ static URLs for parallel execution.
Combine --disable-resources with --block-images in stealth/dynamic mode when

you only need text content — this can cut load times by up to 40%.

Always inspect the top-level ok field and per-result ok fields before using content.
If ok is false, report the exact error string — do not invent or guess content.
When --network-idle is insufficient, use --wait-selector for a specific DOM element

to guarantee the content has loaded before capture.

Command Patterns

Basic static page

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"

Static page — target specific content area

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"

Stealth mode — bypass anti-bot protection

python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle

Stealth + proxy + human fingerprint (maximum stealth)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode stealth \
  --proxy "http://user:pass@host:port" \
  --humanize 2.0 \
  --geoip \
  --block-webrtc \
  --network-idle

Dynamic SPA page (Playwright Chromium)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode dynamic \
  --wait-selector ".product-list" \
  --network-idle \
  --disable-resources

Async concurrent batch (multiple URLs)

python3 "{baseDir}/scrape_to_markdown.py" \
  --mode async \
  --url "<URL1>" --url "<URL2>" --url "<URL3>"

Batch from file + stealth + save to disk

python3 "{baseDir}/scrape_to_markdown.py" \
  --url-file urls.txt \
  --mode stealth \
  --disable-resources \
  --output-dir outputs

First-run automatch setup (save fingerprint)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --selector ".article-body" \
  --auto-save \
  --output-dir outputs

Subsequent run after site layout change (adaptive match)

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --selector ".article-body" \
  --auto-match \
  --output-dir outputs

Full production scrape

python3 "{baseDir}/scrape_to_markdown.py" \
  --url "<URL>" \
  --mode stealth \
  --selector "main article" \
  --auto-match \
  --preserve-links \
  --network-idle \
  --disable-resources \
  --timeout 60000 \
  --retry 3 \
  --output-dir outputs

Output Handling

JSON is printed to stdout. Always check ok before using content.

Top-level fields:

ok — true only if every URL succeeded
total / succeeded / failed — count summary
results — array of per-URL result objects
output_index_file — path to saved index.json (if --output-dir used)

Per-URL result fields (when ok: true):

url — the requested URL
status — HTTP status code (e.g. 200)
title — page </code> text</li><li><code>markdown</code> — extracted content as Markdown ← <strong>use this as main content</strong></li><li><code>markdown_length</code> — character count (useful for quality checks)</li><li><code>output_markdown_file</code> — path to saved <code>.md</code> file (if <code>--output-dir</code> used)</li></ul><p><strong>On failure (<code>ok: false</code> in a result):</strong></p><ul><li><code>error</code> — exact error message; report this verbatim, do not invent content</li></ul></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 1 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v1.0.0</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-03-29 14:59 安全安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=0707c525b558087aa5dd929a0ffed304" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/25/408040_3dcd4059f8f3ebcf227ca37b60782b01.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781396950%3B1812932950&q-key-time=1781396950%3B1812932950&q-header-list=host&q-url-param-list=&q-signature=45b2506ee90e66b7fadb318eb970b45878750851" target="_blank">查看报告</a> </div> </div> </div> </div>  <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">content-creation</span> <h3><a href="/s/humanizer">Humanizer</a></h3> <div class="rec-owner">biostartechnology</div> <div class="rec-desc">消除AI写作痕迹，使文本更自然真实。基于维基百科"AI写作特征"指南，识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 860</span> <span style="color:#5b6abf;">📥 200,035</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">content-creation</span> <h3><a href="/s/admapix">AdMapix</a></h3> <div class="rec-owner">fly0pants</div> <div class="rec-desc">广告情报与应用数据分析助手，支持搜索广告素材、分析应用排名、下载量、收入及市场洞察，用于广告素材和竞品分析。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 295</span> <span style="color:#5b6abf;">📥 136,524</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">content-creation</span> <h3><a href="/s/youtube-api-skill">YouTube</a></h3> <div class="rec-owner">byungkyu</div> <div class="rec-desc">使用托管OAuth集成YouTube Data API，支持搜索视频、管理播放列表、获取频道数据及评论互动，适用于用户需要时使用此技能。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 142</span> <span style="color:#5b6abf;">📥 41,096</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>

Scrapling Web Extractor

概述