url2pdf-mk — Webpage to PDF & Markdown Converter

> One-click webpage to PDF + Markdown converter. Transforms WeChat articles and web pages into

> offline-readable documents with full images, original layout and styles preserved.

> 一键将网页（尤其是微信文章）转换为可离线阅读的 PDF 和 Markdown 文档，完整保留排版、图片、样式。

📋 概述 | Overview

url2pdf-mk is a professional webpage content scraping and conversion tool, designed specifically for WeChat public account articles and regular web pages. It can:

✅ Convert any webpage into PDF and Markdown dual-format output / 将任意网页转换为 PDF 和 Markdown 双格式输出
✅ Preserve complete images, layout and styles / 完整保留网页中的图片、排版、样式
✅ Support both single-page and batch scraping (via xlsx file) / 支持单篇抓取和批量抓取
✅ Auto-create date-named folders on desktop / 自动按日期在桌面创建归档目录
✅ Smart detection of WeChat article titles and publish dates / 智能识别微信文章标题和发布日期

⚙️ 工作机制 | How It Works

核心技术架构 | Core Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    url2pdf-mk Workflow                           │
│                        工作流程                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input Sources / 输入源                                          │
│  ├── Single URL → scrape.py (Single Page)                       │
│  │   └── 单个 URL → 单篇抓取                                     │
│  ├── Multiple URLs → batch_scrape.py (Browser Batch)            │
│  │   └── 多个 URL → 批量浏览器版                                  │
│  └── xlsx file → Smart Routing (Browser/HTTP)                  │
│      └── xlsx 文件 → 智能路由                                    │
│                                                                 │
│  Scraping Engines / 抓取引擎                                      │
│  ├── Browser (CDP) ──→ Chrome DevTools Protocol                 │
│  │   ├── Launch Chrome/Chromium browser                          │
│  │   │   └── 启动浏览器                                           │
│  │   ├── Control browser via CDP protocol                        │
│  │   │   └── CDP 协议控制                                         │
│  │   ├── Render full page (including JavaScript)                │
│  │   │   └── 渲染完整页面                                         │
│  │   └── Extract DOM + screenshot to PDF                        │
│  │       └── 提取 DOM + 截图生成 PDF                               │
│  │                                                              │
│  └── HTTP (No Browser) ──→ requests library                      │
│      ├── Send HTTP requests to get HTML                         │
│      │   └── HTTP 请求获取 HTML                                   │
│      ├── Parse static HTML structure                            │
│      │   └── 解析静态 HTML                                        │
│      └── Markdown output only (no PDF)                          │
│          └── 仅输出 Markdown                                     │
│                                                                 │
│  Output Processing / 输出处理                                     │
│  ├── HTML → Markdown conversion (preserve image links)          │
│  │   └── HTML → Markdown 转换                                    │
│  ├── HTML → PDF generation (reportlab + full styles)            │
│  │   └── HTML → PDF 生成                                          │
│  └── Filename: {date}_{title}.{md/pdf}                          │
│      └── 文件命名：{发布日期}_{标题}                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

工作流程详解 | Workflow Details

1. Single Page Mode / 单篇抓取模式 (`scrape.py`)

URL → Launch browser → Load page → Extract content → Generate PDF + Markdown → Save to date folder
URL → 启动浏览器 → 加载页面 → 提取内容 → 生成 PDF + Markdown → 保存到日期目录

2. Browser Batch Mode / 批量浏览器版 (`batch_scrape.py`)

xlsx file → Read URL list → Reuse browser instance → Scrape each → Batch output
xlsx 文件 → 读取 URL 列表 → 复用浏览器实例 → 逐个抓取 → 批量输出

Advantage / 优势: Browser instance starts only once, high efficiency / 浏览器实例只启动一次，效率高
Prerequisite / 前提: Chrome/Chromium installed / 系统已安装 Chrome/Chromium

3. HTTP Batch Mode / 批量 HTTP 版 (`batch_http.py`)

xlsx file → Read URL list → Send HTTP requests → Parse HTML → Batch output Markdown
xlsx 文件 → 读取 URL 列表 → 发送 HTTP 请求 → 解析 HTML → 批量输出 Markdown

Advantage / 优势: No browser needed, low resource usage / 无需浏览器，资源占用低
Limitation / 限制: Cannot handle JavaScript-rendered content, no PDF output / 无法处理 JS 内容，无 PDF 输出

智能路由逻辑 | Smart Routing Logic

main.py serves as the unified entry point, automatically selecting the optimal approach based on input type:

Input Type / 输入类型	Chrome Available / Chrome 可用	Chrome Unavailable / Chrome 不可用
---------	------------	--------------
Single URL / 1 个 URL	Browser mode (PDF+MD) / 浏览器版	HTTP mode (MD only) / HTTP 版
Multiple URLs / ≥2 个 URL	Browser batch (PDF+MD) / 浏览器版批量	HTTP batch (MD only) / HTTP 版批量
xlsx file / xlsx 文件	Browser batch (PDF+MD) / 浏览器版批量	HTTP batch (MD only) / HTTP 版批量

CDP (Chrome DevTools Protocol) / CDP 协议说明

CDP is Chrome's debugging protocol. This tool uses it to:

Launch/Connect browser / 启动/连接浏览器: Connect to existing browser or launch new instance / 支持连接已有浏览器或启动新实例
Page control / 页面控制: Navigate, wait for load, execute JavaScript / 导航、等待加载、执行 JavaScript
Content extraction / 内容提取: Get complete DOM structure and computed styles / 获取完整的 DOM 结构和计算样式
PDF generation / PDF 生成: Use browser's native PDF print function / 调用浏览器原生 PDF 打印功能

Security measures / 安全措施:

CDP ports bound to 127.0.0.1 only, external network cannot access / CDP 端口仅绑定本机，外网无法访问
WebSocket connections require Token verification / WebSocket 连接需要 Token 验证
--isolated mode uses independent temporary Profile, no access to user data / --isolated 模式使用临时 Profile，不访问用户数据

🖥️ 适配环境 | Supported Environments

操作系统支持 | Operating System Support

Operating System / 操作系统	Status / 支持状态	Note / 说明
---------	---------	------
Windows 10/11	✅ Full Support / 完全支持	Chrome or Edge recommended / 推荐 Chrome 或 Edge
macOS 10.15+	✅ Full Support / 完全支持	Chrome, Chromium, Edge supported / 支持 Chrome、Chromium、Edge
Linux	✅ Full Support / 完全支持	Chrome or Chromium required / 需安装 Chrome 或 Chromium
Windows 7/8	⚠️ Limited / 有限支持	Upgrade recommended for best experience / 建议升级以获得最佳体验

浏览器支持 | Browser Support

Browser / 浏览器	Status / 支持状态	Version / 版本要求
--------	---------	---------
Google Chrome	✅ Recommended / 推荐	80+
Microsoft Edge (Chromium)	✅ Supported / 支持	80+
Chromium	✅ Supported / 支持	80+
Others / 其他浏览器	❌ Not supported / 不支持	Chromium-based only / 仅支持 Chromium 内核

Python 环境要求 | Python Environment Requirements

Component / 组件	Version / 版本要求	Purpose / 用途
------	---------	------
Python	3.7+	Runtime environment / 运行时环境
pip	Latest / 最新版	Package manager / 包管理器

📦 基础组件依赖 | Dependencies

Required Dependencies / 必需依赖 (must install before running)

pip install websockets openpyxl requests reportlab

Package / 依赖包	Version / 版本要求	Purpose / 用途
--------	---------	------
`websockets`	10.0+	CDP WebSocket communication / CDP WebSocket 通信
`openpyxl`	3.0+	Read xlsx files (batch mode) / 读取 xlsx 文件（批量模式）
`requests`	2.25+	HTTP requests (HTTP mode) / HTTP 请求（HTTP 版抓取）
`reportlab`	3.6+	PDF document generation / PDF 文档生成

Optional Dependencies / 可选依赖

Package / 依赖包	Purpose / 用途	Note / 说明
--------	------	------
`Pillow`	Image processing / 图片处理	PDF image optimization (auto-installed) / PDF 图片优化（自动安装）
`beautifulsoup4`	HTML parsing / HTML 解析	Enhanced content extraction (built-in) / 增强内容提取（已内置）

System Requirements / 系统组件要求

Component / 组件	Requirement / 说明
------	------
Chrome/Chromium/Edge	Required for browser mode / 浏览器版抓取必需
Network / 网络连接	Required to access target webpages / 访问目标网页必需
Disk space / 磁盘空间	100MB+ recommended / 建议预留 100MB+

快速环境检查 | Quick Environment Check

Run the following commands to check if the environment is ready:

# Check Python version
python3 --version  # Should show 3.7+

# 检查依赖是否已安装
python3 -c "import websockets, openpyxl, requests, reportlab; print('✅ All dependencies installed')"

# Check if Chrome is available
# Windows
where chrome
# macOS
which "Google Chrome"
# Linux
which google-chrome || which chromium-browser

🔒 安全说明（必读）| Security

Mode / 模式	Command / 命令	Profile	Use Case / 适用场景
------	------	---------	----------
Isolated (Recommended) / 隔离（推荐）	`main.py --isolated`	Temporary isolated Profile, no Cookie / 临时隔离 Profile，无 Cookie	Public content / 公开内容
Default / 默认	`main.py`	Reuse Chrome real Profile / 复用 Chrome 真实 Profile	WeChat articles requiring login / 需登录态的微信文章

风险与缓解 | Risks & Mitigations

Risk / 风险项	Description / 说明	Mitigation / 缓解措施
--------	------	----------
Profile reuse / Profile 复用	Can access Chrome cookies/login sessions / 可访问 Chrome Cookie/登录态	✅ Always use `--isolated` for public content / 公开内容一律用 `--isolated`
CDP remote debugging / CDP 远程调试	Enables Chrome DevTools protocol / 开启 Chrome DevTools 协议	Only listens on `127.0.0.1:9222~9232` (localhost) / 仅监听本机
CDP Proxy	Creates local proxy daemon / 创建本地代理守护进程	Only bound to `127.0.0.1`, external access blocked / 仅绑定本机；WS 需 Token 验证
Browser launch / 浏览器启动	May reuse existing browser window / 可能复用已有浏览器窗口	`--isolated` always launches independent instance / 始终启动独立实例
sys.path	Only adds skill directory, no external path access / 仅添加本技能目录	✅ Already securely isolated / 已安全隔离
pip dependencies / pip 依赖	No longer auto-install at runtime / 不再运行时自动安装	✅ Pre-installed + error prompt / 预装 + 报错提示

⚠️ 重要安全警告 | Important Security Warnings

> ⭐ For maximum safety, prefer --isolated mode or HTTP batch mode (batch_http.py) so the tool never touches your real browser profile or cookies.

Cookie and Login Access / Cookie 和登录态访问: Default mode connects to your existing Chrome instance, allowing access to browser cookies and logged-in sessions. Only use default mode when you trust the code and need to scrape pages requiring login.

CDP Proxy — Local Port & State Files / CDP 代理 — 本地端口与状态文件: The CDP proxy daemon opens a local listening port (default 9223) and writes PID/state files to your temp directory (named cdp-proxy-). After each task, manually check and delete these temp files if auto-cleanup failed.

Temporary Files / 临时文件: Scripts create temporary files (xlsx, proxy state, pid files) in system temp directory. These are auto-cleaned after task completion. For extra security, manually check and delete relevant files in temp directory after tasks.

CDP Port Security / CDP 端口安全: CDP debugging ports are bound to 127.0.0.1 only, external network cannot access. WebSocket connections require Token verification.

Don't Run as Root / 不要以 root 运行: Run with normal user privileges recommended / 建议以普通用户权限运行此工具。

VM / Sandbox / 虚拟机 / 沙箱: If you must run browser mode with your real profile, do so inside a VM or disposable environment to limit exposure.

Extra Assurance / 如需额外保证: Modify browser_launcher.py to force isolated profile only.

用法 | Usage

统一入口（推荐）| Unified Entry (Recommended)

# Single page / 单篇
python3 scripts/main.py "https://mp.weixin.qq.com/s?__biz=..."

# Batch (multiple URLs) / 批量（多个 URL）
python3 scripts/main.py "https://url1" "https://url2" "https://url3"

# Batch (xlsx file) / 批量（xlsx 文件）
python3 scripts/main.py /path/to/urls.xlsx

# Isolated mode (no login) / 隔离模式（不使用登录态）
python3 scripts/main.py --isolated "https://..."
python3 scripts/main.py --isolated /path/to/urls.xlsx

Smart Routing / 智能路由

Input / 输入	Processing / 处理方式
------	----------
Single URL / 1 个 URL	Single-page full scraping (PDF + Markdown) / 单篇完整抓取
Multiple URLs / ≥2 个 URL	Batch scraping: Browser mode if Chrome available, otherwise HTTP mode / Chrome 可用 → 浏览器版，否则 → HTTP 版
xlsx file / xlsx 文件	Auto-routed based on URL count / 根据 URL 数量自动路由

xlsx 格式 | xlsx Format

Column / 列	Content / 内容
----	------
B (2nd) / B（第2列）	Article title / 文章标题
C (3rd) / C（第3列）	Publish date / 发布日期
F (6th) / F（第6列）	Article URL / 文章 URL

输出 | Output

~/Desktop/{current date}/
├── 2025-04-03_article-title.md
└── 2025-04-03_article-title.pdf

脚本说明 | Scripts

Script / 脚本	Purpose / 用途
------	------
`main.py`	Unified entry, smart routing / 统一入口，智能路由
`scrape.py`	Single-page scraping, PDF + Markdown / 单篇抓取，PDF + Markdown
`batch_scrape.py`	Browser batch (requires Chrome) / 批量浏览器版（需 Chrome）
`batch_http.py`	HTTP batch (no browser, MD only) / 批量 HTTP 版（无浏览器，仅 Markdown）

已知限制 | Known Limitations

Video/Audio / 视频/音频: PDF keeps cover, Markdown keeps links / PDF 保留封面，Markdown 保留链接
WeChat Mini Programs / 微信小程序: Static content only / 仅提取静态内容
Anti-crawl limits / 反爬限制: HTTP mode may fail, switch to browser mode / HTTP 版可能失败，改用浏览器版

更新日志 | Changelog

v1.1.1（2026-04-14）— 安全描述增强

SKILL.md description metadata 新增安全警告：
明确告知浏览器默认模式会复用 Chrome Profile，可访问 Cookie/登录态
提示 CDP 代理会打开本地端口并写入临时状态文件
推荐公开内容优先使用 --isolated
推荐规则强化：
新增最安全方案：HTTP 批量模式（batch_http.py），无需浏览器，不访问 Profile
所有推荐规则前加 ⭐ 标识
⚠️ 安全警告章节扩充：
新增 CDP 代理本地端口与 cdp-proxy- 状态文件提示
新增 VM / 沙箱运行建议
临时文件手动清理建议

v1.1.0（2026-04-14）— English-first bilingual formatting

🌐 全面中英双语调整 / English-first throughout:
All section titles adjusted to "English | 中文" format
All tables adjusted to English first, Chinese second
All descriptions adjusted to English first, Chinese second
Architecture diagram converted to bilingual layout
Security warnings, usage instructions, and known limitations all bilingual
description metadata: EN description placed before Chinese

v1.0.9（2026-04-14）— 文档增强

SKILL.md 文档大幅更新：

📋 新增「概述」章节 / Overview section: 简要说明工具核心功能定位
⚙️ 新增「工作机制」/ How It Works 章节:
核心技术架构流程图
工作流程详解（单篇抓取、批量浏览器版、批量 HTTP 版）
智能路由逻辑表格
CDP 协议技术说明
🖥️ 新增「适配环境」/ Supported Environments 章节:
操作系统支持表（Windows/macOS/Linux）
浏览器支持表（Chrome/Edge/Chromium 版本要求）
Python 环境要求
📦 新增「基础组件依赖」/ Dependencies 章节:
必需依赖包及版本要求（websockets、openpyxl、requests、reportlab）
可选依赖说明
系统组件要求
快速环境检查命令
便于 SkillHub 平台海外用户理解工具功能

文档结构优化：

将「环境」章节整合到「基础组件依赖」中，避免重复
统一表格格式，提升可读性

v1.0.8（2026-04-13）— 安全增强

SKILL.md：新增⚠️ 重要安全警告章节，包含：
Cookie 和登录态访问警告
临时文件安全说明
CDP 端口安全说明
不要以 root 运行的建议
VM/disposable 环境运行建议
main.py：临时文件清理添加日志输出，清理成功/失败都会提示

v1.0.7（2026-04-13）— 文件名修复

批量抓取时，文件名（PDF + Markdown）现在使用从网页提取的真实标题和日期
修复了之前使用 xlsx 默认标题（如"文章1"）的问题
batch_scrape.py：在提取网页内容后，用网页获取的真实标题覆盖 xlsx 中的标题

v1.0.5（2026-04-13）— 安全修复

移除 batch_http.py 中的硬编码 xlsx 默认路径
所有脚本：pip 依赖从运行时 auto-install 改为预装检查 + 报错提示（安全加固）
batch_http.py 新增系统目录路径安全校验
SKILL.md：调整安全说明，--isolated 标记为公开内容首选

v1.0.4（2026-04-13）

新增 --isolated 隔离模式参数
移除硬编码路径，必须提供 xlsx 路径
新增 --output 可指定输出目录
强化安全文档

v1.0.1（2026-04-12）

新增批量抓取脚本
修复多项 bug

版本历史

共 4 个版本

v1.1.1 Initial release 当前

2026-05-17 21:43 安全安全
v1.0.2 优化了Windows环境的适配，修正了桌面目录名称BUG

2026-04-13 15:27 安全安全
v1.0.1 修正了批量处理网页需要多次授权的问题

2026-04-12 14:40 安全
v1.0.0 Initial release

2026-04-12 11:48 安全

url2pdf-mk

概述