← 返回
安全合规 Key 中文

Links to PDFs

Scrape documents from Notion, DocSend, PDFs, and other sources into local PDF files. Use when the user needs to download, archive, or convert web documents to PDF format. Supports authentication flows for protected documents and session persistence via profiles. Returns local file paths to downloaded PDFs.
从 Notion、DocSend、PDF 等来源抓取文档并保存为本地 PDF 文件。用于下载、归档或转换网页文档。支持受保护文档的身份验证及通过配置文件保持会话,返回下载文件的本地路径。
chrisling-dev
安全合规 clawhub v0.0.1 1 版本 99662.8 Key: 需要
★ 2
Stars
📥 2,620
下载
💾 1
安装
1
版本
#latest

概述

docs-scraper

CLI tool that scrapes documents from various sources into local PDF files using browser automation.

Installation

npm install -g docs-scraper

Quick start

Scrape any document URL to PDF:

docs-scraper scrape https://example.com/document

Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf

Basic scraping

Scrape with daemon (recommended, keeps browser warm):

docs-scraper scrape <url>

Scrape with named profile (for authenticated sites):

docs-scraper scrape <url> -p <profile-name>

Scrape with pre-filled data (e.g., email for DocSend):

docs-scraper scrape <url> -D email=user@example.com

Direct mode (single-shot, no daemon):

docs-scraper scrape <url> --no-daemon

Authentication workflow

When a document requires authentication (login, email verification, passcode):

  1. Initial scrape returns a job ID:

```bash

docs-scraper scrape https://docsend.com/view/xxx

# Output: Scrape blocked

# Job ID: abc123

```

  1. Retry with data:

```bash

docs-scraper update abc123 -D email=user@example.com

# or with password

docs-scraper update abc123 -D email=user@example.com -D password=1234

```

Profile management

Profiles store session cookies for authenticated sites.

docs-scraper profiles list     # List saved profiles
docs-scraper profiles clear    # Clear all profiles
docs-scraper scrape <url> -p myprofile  # Use a profile

Daemon management

The daemon keeps browser instances warm for faster scraping.

docs-scraper daemon status     # Check status
docs-scraper daemon start      # Start manually
docs-scraper daemon stop       # Stop daemon

Note: Daemon auto-starts when running scrape commands.

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

docs-scraper cleanup                    # Delete all PDFs
docs-scraper cleanup --older-than 1h    # Delete PDFs older than 1 hour

Job management

docs-scraper jobs list         # List blocked jobs awaiting auth

Supported sources

  • Direct PDF links - Downloads PDF directly
  • Notion pages - Exports Notion page to PDF
  • DocSend documents - Handles DocSend viewer
  • LLM fallback - Uses Claude API for any other webpage

Scraper Reference

Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.

DirectPdfScraper

Handles: URLs ending in .pdf

Data fields: None (downloads directly)

Example:

docs-scraper scrape https://example.com/document.pdf

DocsendScraper

Handles: docsend.com/view/, docsend.com/v/, and subdomains (e.g., org-a.docsend.com)

URL patterns:

  • Documents: https://docsend.com/view/{id} or https://docsend.com/v/{id}
  • Folders: https://docsend.com/view/s/{id}
  • Subdomains: https://{subdomain}.docsend.com/view/{id}

Data fields:

FieldTypeDescription
--------------------------
emailemailEmail address for document access
passwordpasswordPasscode/password for protected documents
nametextYour name (required for NDA-gated documents)

Examples:

# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com

# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D password=secret123

# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D name="John Doe"

# Retry blocked job
docs-scraper update abc123 -D email=user@example.com -D password=secret123

Notes:

  • DocSend may require any combination of email, password, and name
  • Folders are scraped as a table of contents PDF with document links
  • The scraper auto-checks NDA checkboxes when name is provided

NotionScraper

Handles: notion.so/, .notion.site/*

Data fields:

FieldTypeDescription
--------------------------
emailemailNotion account email
passwordpasswordNotion account password

Examples:

# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123

# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
  -D email=user@example.com -D password=mypassword

# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123

Notes:

  • Public Notion pages don't require authentication
  • Toggle blocks are automatically expanded before PDF generation
  • Uses session profiles to persist login across scrapes

LlmFallbackScraper

Handles: Any URL not matched by other scrapers (automatic fallback)

Data fields: Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:

  • Login forms (extracts field names dynamically)
  • Cookie banners (auto-dismisses)
  • Expandable content (auto-expands)
  • CAPTCHAs (reports as blocked)
  • Paywalls (reports as blocked)

Common dynamic fields:

FieldTypeDescription
--------------------------
emailemailLogin email (if detected)
passwordpasswordLogin password (if detected)
usernametextUsername (if login uses username)

Examples:

# Generic webpage (no auth)
docs-scraper scrape https://example.com/article

# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
  -D email=user@example.com -D password=secret

# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret

Notes:

  • Requires ANTHROPIC_API_KEY environment variable
  • Field names are extracted from the page's actual form fields
  • Limited to 2 login attempts before failing
  • CAPTCHAs require manual intervention

Data field summary

ScraperemailpasswordnameOther
---------------------------------------
DirectPdf----
DocSend-
Notion--
LLM Fallback✓*✓*-Dynamic*

*Fields detected dynamically from page analysis

Environment setup (optional)

Only needed for LLM fallback scraper:

export ANTHROPIC_API_KEY=your_key

Optional browser settings:

export BROWSER_HEADLESS=true   # Set false for debugging

Common patterns

Archive a Notion page:

docs-scraper scrape https://notion.so/My-Page-abc123

Download protected DocSend:

docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D email=user@example.com -D password=1234

Batch scraping with profiles:

docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite

Output

Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf)

Blocked: Job ID + required credential types

Troubleshooting

  • Timeout: docs-scraper daemon stop && docs-scraper daemon start
  • Auth fails: docs-scraper jobs list to check pending jobs
  • Disk full: docs-scraper cleanup to remove old PDFs

版本历史

共 1 个版本

  • v0.0.1 当前
    2026-03-28 14:35 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Hyperliquid CLI (with HIP3 Support)

chrisling-dev
通过HIP-3在Hyperliquid上7x24小时交易加密货币、股票(AAPL、NVDA、TSLA)、指数及大宗商品(黄金、白银)。支持实时仓位盈亏追踪、订单簿监控、多账户管理,以及延迟低于5毫秒的高频交易WebSocket客户端。
★ 5 📥 3,587
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,210 📥 266,150
security-compliance

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 89 📥 30,586