← 返回
未分类 Key

Cloudflare Crawl

Crawl websites using Cloudflare's Browser Rendering API. Use when you need to scrape entire sites, build knowledge bases, extract content from multiple pages...
使用 Cloudflare Browser Rendering API 抓取网站。适用于需要爬取整个站点、构建知识库、从多页面提取内容等。
wirelessjoe
未分类 clawhub v0.1.1 1 版本 99591.8 Key: 需要
★ 0
Stars
📥 244
下载
💾 1
安装
1
版本
#latest

概述

Cloudflare Crawl

Crawl entire websites using Cloudflare's Browser Rendering /crawl API. Async job-based crawling with JS rendering.

When to Use

  • Scrape entire sites (not just single pages)
  • Build knowledge bases or RAG datasets
  • Research across multiple pages
  • Sites protected by Cloudflare (CF won't block itself)
  • Need Markdown or structured JSON output

Prerequisites

Get Cloudflare API credentials:

  1. Go to https://dash.cloudflare.com/profile/api-tokens
  2. Create token with Account.Browser Rendering permission
  3. Get your Account ID from dashboard URL

Set environment variables:

export CLOUDFLARE_API_TOKEN="your_token"
export CLOUDFLARE_ACCOUNT_ID="your_account_id"

Quick Start

# Start a crawl job
node scripts/crawl.js start https://example.com --limit 50

# Check status
node scripts/crawl.js status <job_id>

# Get results as markdown
node scripts/crawl.js results <job_id> --format markdown

API Overview

1. Start Crawl Job

curl -X POST "https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "limit": 50,
    "depth": 3,
    "formats": ["markdown"]
  }'

Returns: { "success": true, "result": "job_id_here" }

2. Poll for Completion

curl "https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl/$JOB_ID?limit=1" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"

Status values: running, completed, errored, cancelled_due_to_timeout

3. Get Results

curl "https://api.cloudflare.com/client/v4/accounts/$CLOUDFLARE_ACCOUNT_ID/browser-rendering/crawl/$JOB_ID" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN"

Parameters

ParameterTypeDefaultDescription
---------------------------------------
urlstringrequiredStarting URL
limitnumber10Max pages to crawl (max 100,000)
depthnumber100000Max link depth from start URL
sourcestring"all"URL discovery: all, sitemaps, links
formatsarray["html"]Output: html, markdown, json
renderbooleantrueExecute JS (false = fast HTML only)

Output Formats

Markdown (best for AI)

{
  "url": "https://example.com/page",
  "status": "completed",
  "markdown": "# Page Title\n\nContent here...",
  "metadata": { "title": "Page Title", "status": 200 }
}

JSON (AI-extracted)

Uses Workers AI to extract structured data. Requires jsonOptions.prompt:

{
  "formats": ["json"],
  "jsonOptions": {
    "prompt": "Extract product name, price, and description"
  }
}

Pricing

PlanFree TierOverage
--------------------------
Workers Free10 min/dayN/A
Workers Paid10 hrs/month$0.09/hour

Limits

  • Max 100,000 pages per crawl
  • 7 day max runtime
  • Results available 14 days
  • Free plan: 10 concurrent, 100 pages max

Example: Crawl for RAG

// Crawl docs site for knowledge base
const job = await startCrawl({
  url: 'https://docs.example.com',
  limit: 500,
  formats: ['markdown'],
  source: 'sitemaps' // Use sitemap for efficiency
});

// Wait for completion
const results = await waitForCrawl(job.id);

// Save markdown files for RAG
for (const page of results.records) {
  if (page.status === 'completed') {
    fs.writeFileSync(`docs/${slugify(page.url)}.md`, page.markdown);
  }
}

vs Browserbase/Stagehand

Use CaseCloudflare CrawlBrowserbase
----------------------------------------
Full site scrape✅ Best❌ Manual
Interactive (forms)❌ No✅ Best
CF-protected sites✅ Native⚠️ Cloud bypass
AI extraction✅ Built-in✅ Stagehand
Session management✅ Async jobs❌ Manual
Cost$0.09/hrCredits

Use Cloudflare Crawl for bulk content extraction.

Use Browserbase for interactive automation.

版本历史

共 1 个版本

  • v0.1.1 当前
    2026-05-08 02:16 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 673 📥 325,054
ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,073 📥 806,300
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,223 📥 267,442