← 返回
开发者工具

Web Scraping

Extract structured information from websites using web_fetch for simple pages and browser automation for dynamic sites, login-gated flows, pagination, infini...
从网站提取结构化信息:简单页面使用 web_fetch,动态网站、登录验证、分页及无限滚动等复杂场景则使用浏览器自动化。
zhangqixin9527
开发者工具 clawhub v1.0.0 1 版本 96892.9 Key: 无需
★ 7
Stars
📥 11,679
下载
💾 194
安装
1
版本
#latest

概述

Web Scraping

Extract data with the lightest reliable method first.

Choose the approach

  1. Use web_fetch for simple public pages when the needed content is already in HTML.
  2. Use browser when the site is dynamic, needs clicking, infinite scroll, filters, tabs, or login/session state.
  3. Use web_search only to discover candidate pages when the target URL is unknown.

Default workflow

  1. Identify the target site and exact fields to collect.
  2. Test one page first.
  3. Decide the extraction method:
    • web_fetch for readable article/listing text
    • browser snapshot for dynamic DOM inspection
  4. Normalize the output into a stable schema.
  5. If scraping multiple pages, avoid tight loops and serialize requests.
  6. Deduplicate by URL or stable item id.
  7. Save results in the workspace when the task is larger than a quick one-off.

Browser scraping pattern

  1. Open the page.
  2. Take a snapshot.
  3. Interact only as needed: search, click filters, pagination, expand sections.
  4. Re-snapshot after each meaningful state change.
  5. Extract only the fields the user asked for.
  6. Close tabs when finished.

Output guidance

Prefer one of these formats:

  • concise bullet summary
  • JSON array of objects
  • CSV/TSV when the user wants exportable rows

Use explicit keys, for example:

[
  {
    "title": "...",
    "url": "...",
    "source": "...",
    "date": "...",
    "summary": "..."
  }
]

Reliability rules

  • Do not invent missing fields.
  • If a site blocks access, say so and switch sources when appropriate.
  • For news/results pages, capture source + title + link at minimum.
  • For large jobs, checkpoint partial results to a workspace file.
  • Prefer fewer larger writes over many tiny writes.

Cleanup

  • Close browser tabs opened for scraping.
  • If you create state/output files, store them under the workspace and name them clearly.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-28 13:29 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 920 📥 185,723
developer-tools

Agent Browser

matrixy
专为AI智能体优化的无头浏览器自动化CLI,支持无障碍树快照和基于引用的元素选择。
★ 425 📥 118,016
developer-tools

CodeConductor.ai

larsonreever
AI驱动平台,提供快速全栈开发、智能体、工作流自动化及低代码AI集成的可扩展产品创建。
★ 65 📥 179,825