> Step in. Go back. Scrape the dead. Automate the living.
Three-layer web intelligence stack. Pick the right layer — or combine all three.
| Layer | Tool | Job |
|---|---|---|
| ------- | ------ | ----- |
| Live | Scrapling | Extract content from any live URL |
| Historical | Wayback Machine CDX API | Travel back in time to any archived snapshot |
| Interactive | playwright-cli | Drive a real browser — login, click, scroll, fill forms |
Need web data?
│
├─ Historical / "what did it look like before"?
│ └─ Wayback CDX API → scrape snapshot via Scrapling or web_fetch
│
├─ Need to click / log in / fill forms first?
│ └─ playwright-cli → authenticate → hand off to Scrapling
│
└─ Just current content?
├─ Static / simple → scrapling get
├─ JS-heavy / React → scrapling fetch --network-idle
└─ Heavily protected sites → scrapling stealthy-fetch --solve-cloudflare
# 1. Static sites, blogs, docs
scrapling extract get "https://example.com" output.md
# 2. JS-heavy / React / Next.js / dynamic content
scrapling extract fetch "https://example.com" output.md --network-idle --wait 3000
# 3. Cloudflare / rendering-protected
scrapling extract stealthy-fetch "https://example.com" output.md --solve-cloudflare
scrapling extract fetch "https://example.com" output.md --css-selector "main article"
scrapling extract get "https://example.com" output.md --css-selector ".pricing-table"
Rules:
.md output for readable text, .html only for structure parsing--css-selector to avoid giant HTML blobsSee references/scrapling.md for full CLI flags, spider framework, and Python API.
curl -s "https://web.archive.org/cdx/search/cdx?url=example.com&output=json&fl=timestamp,statuscode&filter=statuscode:200&limit=20"
curl -s "https://web.archive.org/cdx/search/cdx?url=example.com&output=json&collapse=timestamp:4&fl=timestamp,statuscode&filter=statuscode:200"
# Scrapling for clean extraction:
scrapling extract get "https://web.archive.org/web/20230601000000/https://example.com/" archive.md
# Or read via web_fetch:
# web_fetch: https://web.archive.org/web/20230601000000/https://example.com/
curl -s "https://archive.org/wayback/available?url=example.com" | python3 -m json.tool
See references/wayback.md for full CDX API reference and ia CLI usage.
Use when the page requires login, clicking, or dynamic interaction before content is accessible.
# Open browser
playwright-cli open https://app.example.com
# Snapshot to get element refs
playwright-cli snapshot
# Interact
playwright-cli click e12
playwright-cli fill e5 "username@example.com"
playwright-cli press Tab
playwright-cli fill e6 "password"
playwright-cli press Enter
# Capture state
playwright-cli screenshot
playwright-cli eval "document.title"
# Close
playwright-cli close
# 1. playwright-cli open → log in → navigate to target
# 2. playwright-cli screenshot # verify you're authenticated
# 3. scrapling extract get <url> output.md # scrape while session is active
# 1. Scrape current state
scrapling extract get "https://competitor.com/pricing" current.md
# 2. Find yearly snapshots
curl -s "https://web.archive.org/cdx/search/cdx?url=competitor.com/pricing&output=json&collapse=timestamp:4&fl=timestamp&filter=statuscode:200"
# 3. Scrape archived version from any year
scrapling extract get "https://web.archive.org/web/20230101000000/https://competitor.com/pricing" archive.md
# 4. Diff
diff archive.md current.md
# playwright handles auth → Scrapling does the bulk lift
playwright-cli open https://example.com/login
playwright-cli fill e5 "your@email.com"
playwright-cli fill e6 "password"
playwright-cli press Enter
playwright-cli screenshot # verify you're in
scrapling extract get "https://example.com/members/content" output.md
This skill opens real browser sessions and can scrape login-protected pages. A few things to understand before using it:
When you need a full CLI harness for any desktop or web application:
# Install once in Claude Code
/plugin marketplace add HKUDS/CLI-Anything
/plugin install cli-anything
# Build a complete CLI for any software (7-phase pipeline)
/cli-anything:cli-anything ./target-app
# Iteratively refine
/cli-anything:refine ./target-app "focus on data export workflows"
共 1 个版本