← 返回
开发者工具 中文

Skrape

Ethical web data extraction with robots exclusion protocol adherence, throttled scraping requests, and privacy-compliant handling ("Scrape responsibly!").
道德网页数据提取,遵守robots排除协议,使用节流抓取请求,符合隐私规定处理("负责任地抓取!")
10oss
开发者工具 clawhub v1.1.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 583
下载
💾 11
安装
1
版本
#latest#latest, web-scraping, robots.txt, rate-limiting, nodejs, javascript, crawler, data-extraction, ethical-scraping, gdpr, http-client

概述

Respect Creative Work

  • Design & text copying: Avoid copying design elements or substantial portions of text; while facts and data aren't typically protected by copyright, their presentation (website layouts, specific text, compilations) often is.
  • Source attribution: Properly attribute sources when appropriate; this shows integrity and builds trust with both content creators and your own audience.
  • Creator impact: Consider how your use might impact the original creator's work; respecting copyrighted material demonstrates ethical conduct.

Pre-Extraction Verification Steps

I. Access Authorization — Retrieve {domain}/robots.txt and review /terms or /tos endpoints. Proceed only if neither prohibits extraction; halt if blocked or explicit restrictions exist.

II. Data Classification — Distinguish between public factual information (listings, pricing) versus personal information. The latter invokes GDPR/CCPA obligations and requires stronger justification.

III. Preferred Channels — Check whether the platform offers an API. If available, use it instead of direct extraction. Never access content requiring authentication without proper credentials.

Operational Conduct & Compliance

  • Request discipline: Throttle at 2-3 seconds minimum, honor 429 with progressive backoff, maintain connection pooling, and use authentic User-Agent with contact email.
  • Access boundaries: robots.txt disregard carries uncertain legal standing (Meta v. Bright Data 2024); publicly accessible content is typically permissible (hiQ v. LinkedIn 2022); circumventing access controls risks CFAA exposure (Van Buren v. US 2021).
  • Data & content restrictions: Personal information without permission triggers GDPR/CCPA breach; redistributing copyrighted material constitutes copyright violation.

Information Stewardship

  • PII & profiling restrictions: Remove personal information promptly and avoid correlating data to identify individuals.
  • Limit retention: Store only necessary data, purge the rest.
  • Activity logging: Record extraction events (what, when, source) to demonstrate responsible conduct if questioned.

Implementation patterns and robots.txt evaluation logic in code.md

版本历史

共 1 个版本

  • v1.1.1 当前
    2026-03-30 00:07 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

productivity

Skreenshot

10oss
通过OCR、批量重命名、文件夹分类、清理及CleanShot X集成整理、标记、搜索和管理macOS截图。
★ 0 📥 489
developer-tools

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 921 📥 185,773
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 668 📥 324,023