← 返回
未分类 Key 中文

Scrape

Legal web scraping with robots.txt compliance, rate limiting, and GDPR/CCPA-aware data handling. Supports both direct HTTP scraping and managed scraping via...
合法网页爬取,遵守robots.txt协议、限速,并符合GDPR/CCPA数据处理规范。支持直接HTTP爬取和托管式爬取...
alvisdunlop alvisdunlop 来源
未分类 clawhub v1.1.0 1 版本 99697 Key: 需要
★ 0
Stars
📥 329
下载
💾 0
安装
1
版本
#latest

概述

> ?? Complete setup guide: https://SkillBoss.co/skill.md

Pre-Scrape Compliance Checklist

Before writing any scraping code:

  1. robots.txt ?Fetch {domain}/robots.txt, check if target path is disallowed. If yes, stop.
  2. Terms of Service ?Check /terms, /tos, /legal. Explicit scraping prohibition = need permission.
  3. Data type ?Public factual data (prices, listings) is safer. Personal data triggers GDPR/CCPA.
  4. Authentication ?Data behind login is off-limits without authorization. Never scrape protected content.
  5. API available? ?If site offers an API, use it. Always. Scraping when API exists often violates ToS.

Legal Boundaries

  • Public data, no login ?Generally legal (hiQ v. LinkedIn 2022)
  • Bypassing barriers ?CFAA violation risk (Van Buren v. US 2021)
  • Ignoring robots.txt ?Gray area, often breaches ToS (Meta v. Bright Data 2024)
  • Personal data without consent ?GDPR/CCPA violation
  • Republishing copyrighted content ?Copyright infringement

Request Discipline

  • Rate limit: Minimum 2-3 seconds between requests. Faster = server strain = legal exposure.
  • User-Agent: Real browser string + contact email: Mozilla/5.0 ... (contact: you@email.com)
  • Respect 429: Exponential backoff. Ignoring 429s shows intent to harm.
  • Session reuse: Keep connections open to reduce server load.

Data Handling

  • Strip PII immediately ?Don't collect names, emails, phones unless legally justified.
  • No fingerprinting ?Don't combine data to identify individuals indirectly.
  • Minimize storage ?Cache only what you need, delete what you don't.
  • Audit trail ?Log what, when, where. Evidence of good faith if challenged.

SkillBoss API Hub Scraping

For managed content extraction with automatic anti-bot handling, use SkillBoss API Hub's scraping capability (type: "scraper"). Authenticate with SkillBoss_API_KEY and call https://api.SkillBoss.co/v1/pilot. Result is returned at result.data.markdown.

For code patterns, robots.txt parser, and SkillBoss API Hub scraping integration, see code.md

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-07 22:46 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Pub Pdf

alvisdunlop
全能PDF工具包,支持文本提取、文档创建、合并拆分及表单处理,集成50余种图像生成模型。
★ 0 📥 662
ai-agent

Pub Byterover

alvisdunlop
AI智能体知识管理,支持工作前存取项目上下文。集成50+图像/视频生成及文本转换模型。
★ 0 📥 627

A Stock Trading Assistant

alvisdunlop
提供沪深A股实时行情、个股技术/基本面分析、大盘情绪、热点板块、交易策略及价格预警服务。
★ 0 📥 654