← 返回
未分类 Key 中文

Crawl

Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.
抓取任意网站并保存为本地markdown文件。适用于下载文档、知识库或网页内容以供离线访问或分析。无需编码,只需提供URL。
barneyjm barneyjm 来源
未分类 clawhub v0.1.0 1 版本 99747 Key: 需要
★ 0
Stars
📥 1,577
下载
💾 2
安装
1
版本
#latest

概述

Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Prerequisites

Tavily API Key Required - Get your key at https://tavily.com

Add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

Quick Start

Using the Script

./scripts/crawl.sh '<json>' [output_dir]

Examples:

# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'

When output_dir is provided, each crawled page is saved as a separate markdown file.

Basic Crawl

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'

Focused Crawl with Instructions

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

API Reference

Endpoint

POST https://api.tavily.com/crawl

Headers

HeaderValue
---------------
AuthorizationBearer
Content-Typeapplication/json

Request Body

FieldTypeDefaultDescription
-----------------------------------
urlstringRequiredRoot URL to begin crawling
max_depthinteger1Levels deep to crawl (1-5)
max_breadthinteger20Links per page
limitinteger50Total pages cap
instructionsstringnullNatural language guidance for focus
chunks_per_sourceinteger3Chunks per page (1-5, requires instructions)
extract_depthstring"basic"basic or advanced
formatstring"markdown"markdown or text
select_pathsarraynullRegex patterns to include
exclude_pathsarraynullRegex patterns to exclude
allow_externalbooleantrueInclude external domain links
timeoutfloat150Max wait (10-150 seconds)

Response Format

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}

Depth vs Performance

DepthTypical PagesTime
----------------------------
110-50Seconds
250-500Minutes
3500-5000Many minutes

Start with max_depth=1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit chunks_per_source to get full page content.

Examples

For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

For Context: Targeted Technical Docs

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'

For Data Collection: Full Page Archive

Use when saving content to files for later processing:

curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'

Returns full page content - use the script with output_dir to save as markdown files.

Map API (URL Discovery)

Use map instead of crawl when you only need URLs, not content:

curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'

Returns URLs only (faster than crawl):

{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}

Tips

  • Always use chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs
  • Omit chunks_per_source only for data collection - when saving full pages to files
  • Start conservative (max_depth=1, limit=20) and scale up
  • Use path patterns to focus on relevant sections
  • Use Map first to understand site structure before full crawl
  • Always set a limit to prevent runaway crawls

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-05-21 12:19 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Search

barneyjm
使用 Tavily 的 LLM 优化搜索 API 在网上搜索,返回包含内容片段、评分和元数据的相关结果。适用于无需编写代码即可查找任意主题的网页内容。
★ 0 📥 1,690

Query

barneyjm
使用Camino AI的位置智能API通过自然语言搜索地点。返回包含坐标、距离和元数据的相关结果。适用于查找餐厅、商店、地标或任何兴趣点等真实世界位置。
★ 2 📥 1,669

Places

barneyjm
使用灵活的查询格式定位地点(自由文本搜索或结构化地址)。返回坐标、地址及可选的街景照片。用于地址地理编码或查找特定名称的地点。
★ 2 📥 1,725