← 返回
未分类 中文

Siteone Crawler

Website crawling, auditing, offline cloning, and markdown export using SiteOne Crawler (Rust). Trigger when the user asks to: crawl a website, audit/analyze...
使用 SiteOne Crawler (Rust) 进行网站爬取、审计、离线克隆和 Markdown 导出。触发条件:用户请求爬取网站、审计/分析...
tsingliuwin tsingliuwin 来源
未分类 clawhub v0.0.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 264
下载
💾 0
安装
1
版本
#latest

概述

SiteOne Crawler

Cross-platform website crawler/analyzer written in Rust.

Setup (run once)

Before first use, ensure the binary exists. If not found, install it automatically:

  1. Check if binary exists at the paths below (in order of priority):
    • $HOME/.siteone-crawler/siteone-crawler
    • Any siteone-crawler found in $PATH (via which siteone-crawler)
  2. If neither exists, download the latest release from GitHub:

```bash

INSTALL_DIR="$HOME/.siteone-crawler"

mkdir -p "$INSTALL_DIR"

# Detect OS/arch

OS=$(uname -s | tr '[:upper:]' '[:lower:]')

ARCH=$(uname -m)

case "$ARCH" in x86_64) ARCH="x64" ;; aarch64|arm64) ARCH="arm64" ;; esac

# Get latest release URL from GitHub API

RELEASE_URL=$(curl -sL https://api.github.com/repos/janreges/siteone-crawler/releases/latest \

| grep -oP "browser_download_url.*?${OS}-${ARCH}\.zip" | head -1 | sed 's/browser_download_url": "//')

curl -sL "$RELEASE_URL" -o /tmp/siteone-crawler.zip \

&& unzip -o /tmp/siteone-crawler.zip -d /tmp/siteone-crawler \

&& mv /tmp/siteone-crawler/siteone-crawler "$INSTALL_DIR/" \

&& chmod +x "$INSTALL_DIR/siteone-crawler" \

&& rm -rf /tmp/siteone-crawler /tmp/siteone-crawler.zip

```

  1. After installation, set CRAWLER to the resolved path and verify with $CRAWLER --version.

Binary

CRAWLER="$HOME/.siteone-crawler/siteone-crawler"

If the above path doesn't exist, fall back to $(which siteone-crawler) after running Setup.

Always use the resolved path. The binary outputs colored text to terminal; use --no-color for script/pipeline usage and --output json for programmatic consumption.

Common Workflows

1. Quick Audit (HTML report)

$CRAWLER --url="https://example.com" --output-html-report="/path/to/report.html"

Generates a self-contained interactive HTML audit report with quality scores (0.0-10.0) across Performance, SEO, Security, Accessibility, Best Practices.

2. Full Audit + JSON + Upload

$CRAWLER --url="https://example.com" \
  --output-html-report="/path/to/report.html" \
  --output-json-file="/path/to/result.json" \
  --upload --upload-retention="7d"

3. Offline Clone

$CRAWLER --url="https://example.com" --offline-export-dir="/path/to/offline-site" --disable-javascript

Use --disable-javascript for SPA/React sites to get a browsable static version. Use --allowed-domain-for-external-files="*" to include CDN assets.

4. Markdown Export

Multi-file (browsable):

$CRAWLER --url="https://example.com" --markdown-export-dir="/path/to/md-export"

Single-file (ideal for AI tools):

$CRAWLER --url="https://example.com" --markdown-export-dir="/tmp/md" --markdown-export-single-file="/path/to/site.md" \
  --markdown-disable-images --markdown-disable-files

5. Sitemap Generation

$CRAWLER --url="https://example.com" --sitemap-xml-file="/path/to/sitemap" --sitemap-txt-file="/path/to/sitemap"

6. CI/CD Quality Gate

$CRAWLER --url="https://example.com" --ci --ci-min-score="7.0" --ci-max-404="0" --ci-max-5xx="0"

Exit code 10 if thresholds not met. See references/cli-params.md for all --ci-* options.

7. Stress/Load Test

$CRAWLER --url="https://example.com" --workers="20" --max-reqs-per-sec="100" --max-depth="1"

Warning: high worker counts can cause DoS. Use with caution.

8. Single Page Crawl

$CRAWLER --url="https://example.com/about" --single-page --output-json-file="/path/to/result.json"

9. HTML-to-Markdown (local file)

$CRAWLER --html-to-markdown="/path/to/page.html" --html-to-markdown-output="/path/to/page.md"

10. Browse Exported Content

$CRAWLER --serve-markdown="/path/to/md-export" --serve-port="8321"
$CRAWLER --serve-offline="/path/to/offline-site" --serve-port="8321"

Key Parameters Reference

See references/cli-params.md for the complete parameter reference organized by category.

Most-used flags

FlagPurposeDefault
------------------------
--urlTarget URL (required)-
--outputtext or jsontext
--workersConcurrent threads3
--max-reqs-per-secRequests per second limit10
--max-depthCrawl depth (0 = unlimited)0
--timeoutRequest timeout in seconds5
--no-colorDisable colorsoff
--ignore-robots-txtIgnore robots.txtoff

Resource filtering

FlagEffect
--------------
--disable-all-assetsOnly crawl pages
--disable-javascriptNo JS (recommended for offline/SPA)
--disable-imagesNo images
--disable-stylesNo CSS
--disable-filesNo downloadable docs

URL filtering

FlagEffect
--------------
--include-regexPCRE regex to include URLs
--ignore-regexPCRE regex to skip URLs
--allowed-domain-for-crawlingAllow cross-domain crawling
--allowed-domain-for-external-filesAllow external asset domains

Script Helpers

scripts/audit.sh — Quick audit wrapper

Runs a full crawl with HTML report and optional JSON output. See script for usage.

scripts/export-markdown.sh — Markdown export wrapper

Exports a website to markdown (single or multi-file). See script for usage.

Tips

  • For modern JS frameworks (Next.js, React, Vue), add --disable-javascript when doing offline exports
  • Use --output json for programmatic processing; JSON goes to STDOUT, progress to STDERR
  • Use --extra-columns="Title,Keywords,Description" to add SEO columns
  • Use --timezone="Asia/Shanghai" for local timestamps
  • For large sites, increase --memory-limit, --max-visited-urls, and --max-queue-length
  • Use --resolve to test local/dev servers (like curl --resolve)
  • HTML reports are self-contained — open in any browser, no server needed

版本历史

共 1 个版本

  • v0.0.1 当前
    2026-05-08 14:02 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 272 📥 100,137
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 296 📥 139,160
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 208 📥 67,309