概述

THE_TIME_MASHEEN

> Step in. Go back. Scrape the dead. Automate the living.

Three-layer web intelligence stack. Pick the right layer — or combine all three.

Layer	Tool	Job
-------	------	-----
Live	Scrapling	Extract content from any live URL
Historical	Wayback Machine CDX API	Travel back in time to any archived snapshot
Interactive	playwright-cli	Drive a real browser — login, click, scroll, fill forms

Decision Tree

Need web data?
│
├─ Historical / "what did it look like before"?
│    └─ Wayback CDX API → scrape snapshot via Scrapling or web_fetch
│
├─ Need to click / log in / fill forms first?
│    └─ playwright-cli → authenticate → hand off to Scrapling
│
└─ Just current content?
     ├─ Static / simple       → scrapling get
     ├─ JS-heavy / React      → scrapling fetch --network-idle
     └─ Heavily protected sites → scrapling stealthy-fetch --solve-cloudflare

Layer 1 — Live Scraping (Scrapling)

Escalation path (always start at the top)

# 1. Static sites, blogs, docs
scrapling extract get "https://example.com" output.md

# 2. JS-heavy / React / Next.js / dynamic content
scrapling extract fetch "https://example.com" output.md --network-idle --wait 3000

# 3. Cloudflare / rendering-protected
scrapling extract stealthy-fetch "https://example.com" output.md --solve-cloudflare

Extract specific sections (saves tokens)

scrapling extract fetch "https://example.com" output.md --css-selector "main article"
scrapling extract get "https://example.com" output.md --css-selector ".pricing-table"

Rules:

Always clean up temp files after reading
Use .md output for readable text, .html only for structure parsing
Use --css-selector to avoid giant HTML blobs

See references/scrapling.md for full CLI flags, spider framework, and Python API.

Layer 2 — Time Travel (Wayback Machine)

Find all snapshots of a URL

curl -s "https://web.archive.org/cdx/search/cdx?url=example.com&output=json&fl=timestamp,statuscode&filter=statuscode:200&limit=20"

One snapshot per year (change tracking)

curl -s "https://web.archive.org/cdx/search/cdx?url=example.com&output=json&collapse=timestamp:4&fl=timestamp,statuscode&filter=statuscode:200"

Scrape a specific point in time

# Scrapling for clean extraction:
scrapling extract get "https://web.archive.org/web/20230601000000/https://example.com/" archive.md

# Or read via web_fetch:
# web_fetch: https://web.archive.org/web/20230601000000/https://example.com/

Check if a URL has ever been archived

curl -s "https://archive.org/wayback/available?url=example.com" | python3 -m json.tool

See references/wayback.md for full CDX API reference and ia CLI usage.

Layer 3 — Interactive Automation (playwright-cli)

Use when the page requires login, clicking, or dynamic interaction before content is accessible.

# Open browser
playwright-cli open https://app.example.com

# Snapshot to get element refs
playwright-cli snapshot

# Interact
playwright-cli click e12
playwright-cli fill e5 "username@example.com"
playwright-cli press Tab
playwright-cli fill e6 "password"
playwright-cli press Enter

# Capture state
playwright-cli screenshot
playwright-cli eval "document.title"

# Close
playwright-cli close

Handoff pattern — authenticate then bulk scrape

# 1. playwright-cli open → log in → navigate to target
# 2. playwright-cli screenshot  # verify you're authenticated
# 3. scrapling extract get <url> output.md  # scrape while session is active

Combining All Three

Live vs. archived comparison (price changes, content drift, competitive intel)

# 1. Scrape current state
scrapling extract get "https://competitor.com/pricing" current.md

# 2. Find yearly snapshots
curl -s "https://web.archive.org/cdx/search/cdx?url=competitor.com/pricing&output=json&collapse=timestamp:4&fl=timestamp&filter=statuscode:200"

# 3. Scrape archived version from any year
scrapling extract get "https://web.archive.org/web/20230101000000/https://competitor.com/pricing" archive.md

# 4. Diff
diff archive.md current.md

Login-gated site — full extraction

# playwright handles auth → Scrapling does the bulk lift
playwright-cli open https://example.com/login
playwright-cli fill e5 "your@email.com"
playwright-cli fill e6 "password"
playwright-cli press Enter
playwright-cli screenshot  # verify you're in
scrapling extract get "https://example.com/members/content" output.md

Security

This skill opens real browser sessions and can scrape login-protected pages. A few things to understand before using it:

You control the browser. playwright-cli drives a browser on your machine. It navigates to URLs you specify and interacts with elements you tell it to.
All data stays local. Any session state used during automation exists only on your machine and is used only for the scraping task you initiate.
Use only on sites you have access to. This skill is designed for legitimate web research — competitive intelligence, content monitoring, archival work, and accessing sites you have an account on. It is not a tool for unauthorized access.
Review commands before running. As with any automation tool, understand what you're running before you run it.

CLI-Anything — make any software agent-native

When you need a full CLI harness for any desktop or web application:

# Install once in Claude Code
/plugin marketplace add HKUDS/CLI-Anything
/plugin install cli-anything

# Build a complete CLI for any software (7-phase pipeline)
/cli-anything:cli-anything ./target-app

# Iteratively refine
/cli-anything:refine ./target-app "focus on data export workflows"

版本历史

共 1 个版本

v1.0.2 当前

2026-03-19 13:02 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

THE_TIME_MASHEEN

概述

THE_TIME_MASHEEN

Decision Tree

Layer 1 — Live Scraping (Scrapling)

Escalation path (always start at the top)

Extract specific sections (saves tokens)

Layer 2 — Time Travel (Wayback Machine)

Find all snapshots of a URL

One snapshot per year (change tracking)

Scrape a specific point in time

Check if a URL has ever been archived

Layer 3 — Interactive Automation (playwright-cli)

Handoff pattern — authenticate then bulk scrape

Combining All Three

Live vs. archived comparison (price changes, content drift, competitive intel)

Login-gated site — full extraction

Security

CLI-Anything — make any software agent-native

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Excel / XLSX

A股量化 AkShare

Shang Tsung