← 返回
未分类 中文

Sitemap Content Scraper

Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,...
从 robots.txt 和常见 sitemap 位置发现网站 sitemap,选择合适的 sitemap 或内容族(如文档、博客、帮助中心、学院等)。
quareth quareth 来源
未分类 clawhub v1.0.2 1 版本 100000 Key: 无需
★ 2
Stars
📥 404
下载
💾 0
安装
1
版本
#latest

概述

Sitemap Content Scraper

Use this skill to turn a public website into a sitemap-driven scraping job. Prefer the existing sitemap structure over ad hoc crawling so the scrape stays bounded, explainable, and easy for the user to steer.

Workflow

  1. Ask for the website or URL scope if it is not already provided.
  2. Run python3 {baseDir}/scripts/discover_sitemaps.py .
  3. Summarize the discovered sitemap inventory in plain language.
  4. If user gave a scoped URL (for example https://example.com/docs), use scope_hint_substring from discovery output as default filter guidance.
  5. Ask which content family the user wants, such as documentation, knowledge base, blog, academy, changelog, or another category.
  6. Map the user request to the most relevant sitemap by name and sample URL patterns.
  7. If multiple sitemaps still match, ask the user to choose one or give a tighter scope.
  8. Ask for the destination folder if it is missing.
  9. Run python3 {baseDir}/scripts/scrape_sitemap.py --sitemap-url --output-dir , and when a scoped URL was provided add --include-substring unless the user overrides scope.
  10. Report what was scraped, where it was saved, and any skipped or failed pages.

Quick Commands

Discover sitemap inventory:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com

Discover and preserve scope hint from a direct URL prompt:

python3 {baseDir}/scripts/discover_sitemaps.py https://example.com/docs

Scrape one sitemap into a chosen folder:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/docs-sitemap.xml \
  --output-dir /tmp/example-docs

Filter to a subset of URLs when the sitemap mixes sections:

python3 {baseDir}/scripts/scrape_sitemap.py \
  --sitemap-url https://example.com/sitemap.xml \
  --output-dir /tmp/example-docs \
  --include-substring /docs/ \
  --exclude-substring /tag/

Selection Rules

  • Prefer sitemaps explicitly named for the requested content family, such as docs-sitemap.xml, post-sitemap.xml, kb-sitemap.xml, or academy-sitemap.xml.
  • Use the sample URLs returned by discover_sitemaps.py to explain why a sitemap looks like docs, blog, help center, or another category.
  • If the request is broad, offer the discovered choices instead of scraping everything by default.
  • If no sitemap exists, stop and ask whether the user wants a bounded crawl workflow instead. Do not silently switch strategies.

Output Contract

  • Save one Markdown file per scraped page.
  • Save manifest.json at the output root with success and failure details.
  • Keep source URLs in the Markdown header so the corpus remains traceable.
  • Preserve a stable folder structure derived from the source URL path.

Read {baseDir}/references/sitemap-selection.md when mapping user intent to sitemap candidates, handling ambiguous sitemap names, or explaining the output layout.

Trigger Examples

  • "Scrape example.com/docs content into ./out/docs."
  • "Pull the help center pages from https://example.com/help."
  • "Find blog sitemaps for example.com and scrape only posts."

Guardrails

  • Scrape only public content.
  • Accept only http and https targets.
  • Reject localhost, private IP ranges, and internal-only hostnames.
  • Enforce public-only targets using both hostname resolution checks and redirect-target checks at request time.
  • Respect the chosen sitemap scope instead of broad site crawling.
  • Avoid login flows, private dashboards, carts, checkout paths, or user-specific pages.
  • Do not use authentication headers, cookies, or tokens.
  • Ask before writing outside the intended working area.
  • Tell the user when extraction quality looks weak on JavaScript-heavy pages. The bundled scraper is HTML-first and may miss client-rendered content.

版本历史

共 1 个版本

  • v1.0.2 当前
    2026-05-03 09:40 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 216 📥 71,423
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 298 📥 142,889
data-analysis

Stock Analysis

udiedrichsen
利用Yahoo Finance数据深度分析股票和加密货币。支持投资组合管理、关注列表与提醒、股息分析、八维度股票评分、热门趋势扫描(热点扫描器)及谣言/早期信号检测。适用于股票分析、投资组合追踪、财报反应、加密货币监控、热门股票发现及在主流
★ 282 📥 58,202