概述

Sitemap Content Scraper

Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.

Workflow

Ask for the website or URL scope if it is not already provided.
Run python3 {baseDir}/scripts/discover_sitemaps.py .
Summarize the discovered sitemap inventory in plain language.
If user gave a scoped URL (for example https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.
Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
Map the user request to the most relevant sitemap by name and sample URL patterns.
If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
Ask for the destination folder if it is missing.
Run python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url --output-dir , and when a scoped URL was provided add --include-substring unless the user overrides scope.
Report what was scraped, where it was saved, and any skipped or failed pages.

Quick Commands

Discover sitemap inventory:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com

Discover and preserve scope hint from a direct URL prompt:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs

Scrape one sitemap into a chosen folder:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/docs-sitemap.xml \
  --output-dir /tmp/example-docs

Filter to a subset of URLs when the sitemap mixes sections:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/sitemap.xml \
  --output-dir /tmp/example-docs \
  --include-substring /docs/ \
  --exclude-substring /tag/

Selection Rules

Prefer sitemaps explicitly named for the requested content family, such as docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.
Use the sample URLs returned by discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.
If the request is broad, offer the discovered choices instead of scraping everything by default.
If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.

Output Contract

Save one Markdown file per scraped page.
Save manifest.json at the output root with success and failure details.
Keep source URLs in the Markdown header so the corpus remains traceable.
Preserve a stable folder structure derived from the source URL path.

Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.

Trigger Examples

"Scrape example.com/docs content into ./out/docs."
"Pull the help center pages from https://example.com/help."
"Find blog sitemaps for example.com and scrape only posts."

Guardrails

Scrape only public content.
Accept only http and https targets.
Reject localhost, private IP ranges, and internal-only hostnames.
Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
Respect the chosen sitemap scope instead of broad site crawling.
Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
Do not use authentication headers, cookies, or tokens.
Ask before writing outside the intended working area.
Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.

版本历史

共 1 个版本

v1.0.2 当前

2026-05-03 09:40 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Sitemap Content Scraper

概述

Sitemap Content Scraper

Workflow

Quick Commands

Selection Rules

Output Contract

Trigger Examples

Guardrails

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Data Analysis

AdMapix

Stock Analysis