Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.
python3 {baseDir}/scripts/discover_sitemaps.py .https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url --output-dir , and when a scoped URL was provided add --include-substring unless the user overrides scope.Discover sitemap inventory:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com
Discover and preserve scope hint from a direct URL prompt:
python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs
Scrape one sitemap into a chosen folder:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/docs-sitemap.xml \
--output-dir /tmp/example-docs
Filter to a subset of URLs when the sitemap mixes sections:
python3 {baseDir}/scripts/scrape_sitemap.py \
--sitemap-url https://example.com/sitemap.xml \
--output-dir /tmp/example-docs \
--include-substring /docs/ \
--exclude-substring /tag/
docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.manifest.json at the output root with success and failure details.Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.
example.com/docs content into ./out/docs."https://example.com/help."example.com and scrape only posts."http and https targets.localhost, private IP ranges, and internal-only hostnames.共 1 个版本