> One-click webpage to PDF + Markdown converter. Transforms WeChat articles and web pages into
> offline-readable documents with full images, original layout and styles preserved.
> 一键将网页(尤其是微信文章)转换为可离线阅读的 PDF 和 Markdown 文档,完整保留排版、图片、样式。
url2pdf-mk is a professional webpage content scraping and conversion tool, designed specifically for WeChat public account articles and regular web pages. It can:
┌─────────────────────────────────────────────────────────────────┐
│ url2pdf-mk Workflow │
│ 工作流程 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input Sources / 输入源 │
│ ├── Single URL → scrape.py (Single Page) │
│ │ └── 单个 URL → 单篇抓取 │
│ ├── Multiple URLs → batch_scrape.py (Browser Batch) │
│ │ └── 多个 URL → 批量浏览器版 │
│ └── xlsx file → Smart Routing (Browser/HTTP) │
│ └── xlsx 文件 → 智能路由 │
│ │
│ Scraping Engines / 抓取引擎 │
│ ├── Browser (CDP) ──→ Chrome DevTools Protocol │
│ │ ├── Launch Chrome/Chromium browser │
│ │ │ └── 启动浏览器 │
│ │ ├── Control browser via CDP protocol │
│ │ │ └── CDP 协议控制 │
│ │ ├── Render full page (including JavaScript) │
│ │ │ └── 渲染完整页面 │
│ │ └── Extract DOM + screenshot to PDF │
│ │ └── 提取 DOM + 截图生成 PDF │
│ │ │
│ └── HTTP (No Browser) ──→ requests library │
│ ├── Send HTTP requests to get HTML │
│ │ └── HTTP 请求获取 HTML │
│ ├── Parse static HTML structure │
│ │ └── 解析静态 HTML │
│ └── Markdown output only (no PDF) │
│ └── 仅输出 Markdown │
│ │
│ Output Processing / 输出处理 │
│ ├── HTML → Markdown conversion (preserve image links) │
│ │ └── HTML → Markdown 转换 │
│ ├── HTML → PDF generation (reportlab + full styles) │
│ │ └── HTML → PDF 生成 │
│ └── Filename: {date}_{title}.{md/pdf} │
│ └── 文件命名:{发布日期}_{标题} │
│ │
└─────────────────────────────────────────────────────────────────┘
scrape.py)URL → Launch browser → Load page → Extract content → Generate PDF + Markdown → Save to date folder
URL → 启动浏览器 → 加载页面 → 提取内容 → 生成 PDF + Markdown → 保存到日期目录
batch_scrape.py)xlsx file → Read URL list → Reuse browser instance → Scrape each → Batch output
xlsx 文件 → 读取 URL 列表 → 复用浏览器实例 → 逐个抓取 → 批量输出
batch_http.py)xlsx file → Read URL list → Send HTTP requests → Parse HTML → Batch output Markdown
xlsx 文件 → 读取 URL 列表 → 发送 HTTP 请求 → 解析 HTML → 批量输出 Markdown
main.py serves as the unified entry point, automatically selecting the optimal approach based on input type:
| Input Type / 输入类型 | Chrome Available / Chrome 可用 | Chrome Unavailable / Chrome 不可用 |
|---|---|---|
| --------- | ------------ | -------------- |
| Single URL / 1 个 URL | Browser mode (PDF+MD) / 浏览器版 | HTTP mode (MD only) / HTTP 版 |
| Multiple URLs / ≥2 个 URL | Browser batch (PDF+MD) / 浏览器版批量 | HTTP batch (MD only) / HTTP 版批量 |
| xlsx file / xlsx 文件 | Browser batch (PDF+MD) / 浏览器版批量 | HTTP batch (MD only) / HTTP 版批量 |
CDP is Chrome's debugging protocol. This tool uses it to:
Security measures / 安全措施:
127.0.0.1 only, external network cannot access / CDP 端口仅绑定本机,外网无法访问--isolated mode uses independent temporary Profile, no access to user data / --isolated 模式使用临时 Profile,不访问用户数据| Operating System / 操作系统 | Status / 支持状态 | Note / 说明 |
|---|---|---|
| --------- | --------- | ------ |
| Windows 10/11 | ✅ Full Support / 完全支持 | Chrome or Edge recommended / 推荐 Chrome 或 Edge |
| macOS 10.15+ | ✅ Full Support / 完全支持 | Chrome, Chromium, Edge supported / 支持 Chrome、Chromium、Edge |
| Linux | ✅ Full Support / 完全支持 | Chrome or Chromium required / 需安装 Chrome 或 Chromium |
| Windows 7/8 | ⚠️ Limited / 有限支持 | Upgrade recommended for best experience / 建议升级以获得最佳体验 |
| Browser / 浏览器 | Status / 支持状态 | Version / 版本要求 |
|---|---|---|
| -------- | --------- | --------- |
| Google Chrome | ✅ Recommended / 推荐 | 80+ |
| Microsoft Edge (Chromium) | ✅ Supported / 支持 | 80+ |
| Chromium | ✅ Supported / 支持 | 80+ |
| Others / 其他浏览器 | ❌ Not supported / 不支持 | Chromium-based only / 仅支持 Chromium 内核 |
| Component / 组件 | Version / 版本要求 | Purpose / 用途 |
|---|---|---|
| ------ | --------- | ------ |
| Python | 3.7+ | Runtime environment / 运行时环境 |
| pip | Latest / 最新版 | Package manager / 包管理器 |
pip install websockets openpyxl requests reportlab
| Package / 依赖包 | Version / 版本要求 | Purpose / 用途 |
|---|---|---|
| -------- | --------- | ------ |
websockets | 10.0+ | CDP WebSocket communication / CDP WebSocket 通信 |
openpyxl | 3.0+ | Read xlsx files (batch mode) / 读取 xlsx 文件(批量模式) |
requests | 2.25+ | HTTP requests (HTTP mode) / HTTP 请求(HTTP 版抓取) |
reportlab | 3.6+ | PDF document generation / PDF 文档生成 |
| Package / 依赖包 | Purpose / 用途 | Note / 说明 |
|---|---|---|
| -------- | ------ | ------ |
Pillow | Image processing / 图片处理 | PDF image optimization (auto-installed) / PDF 图片优化(自动安装) |
beautifulsoup4 | HTML parsing / HTML 解析 | Enhanced content extraction (built-in) / 增强内容提取(已内置) |
| Component / 组件 | Requirement / 说明 |
|---|---|
| ------ | ------ |
| Chrome/Chromium/Edge | Required for browser mode / 浏览器版抓取必需 |
| Network / 网络连接 | Required to access target webpages / 访问目标网页必需 |
| Disk space / 磁盘空间 | 100MB+ recommended / 建议预留 100MB+ |
Run the following commands to check if the environment is ready:
# Check Python version
python3 --version # Should show 3.7+
# 检查依赖是否已安装
python3 -c "import websockets, openpyxl, requests, reportlab; print('✅ All dependencies installed')"
# Check if Chrome is available
# Windows
where chrome
# macOS
which "Google Chrome"
# Linux
which google-chrome || which chromium-browser
| Mode / 模式 | Command / 命令 | Profile | Use Case / 适用场景 |
|---|---|---|---|
| ------ | ------ | --------- | ---------- |
| Isolated (Recommended) / 隔离(推荐) | main.py --isolated | Temporary isolated Profile, no Cookie / 临时隔离 Profile,无 Cookie | Public content / 公开内容 |
| Default / 默认 | main.py | Reuse Chrome real Profile / 复用 Chrome 真实 Profile | WeChat articles requiring login / 需登录态的微信文章 |
| Risk / 风险项 | Description / 说明 | Mitigation / 缓解措施 |
|---|---|---|
| -------- | ------ | ---------- |
| Profile reuse / Profile 复用 | Can access Chrome cookies/login sessions / 可访问 Chrome Cookie/登录态 | ✅ Always use --isolated for public content / 公开内容一律用 --isolated |
| CDP remote debugging / CDP 远程调试 | Enables Chrome DevTools protocol / 开启 Chrome DevTools 协议 | Only listens on 127.0.0.1:9222~9232 (localhost) / 仅监听本机 |
| CDP Proxy | Creates local proxy daemon / 创建本地代理守护进程 | Only bound to 127.0.0.1, external access blocked / 仅绑定本机;WS 需 Token 验证 |
| Browser launch / 浏览器启动 | May reuse existing browser window / 可能复用已有浏览器窗口 | --isolated always launches independent instance / 始终启动独立实例 |
| sys.path | Only adds skill directory, no external path access / 仅添加本技能目录 | ✅ Already securely isolated / 已安全隔离 |
| pip dependencies / pip 依赖 | No longer auto-install at runtime / 不再运行时自动安装 | ✅ Pre-installed + error prompt / 预装 + 报错提示 |
> ⚠️ Default mode reuses your real Chrome profile — it can read cookies and logged-in sessions. Always prefer --isolated for public content to avoid exposing your browser session.
--isolated (No login required, uses temporary isolated profile / 无需登录,使用临时隔离 Profile)batch_http.py) — no browser, no profile access / 无需浏览器,不访问 Profile> ⭐ For maximum safety, prefer --isolated mode or HTTP batch mode (batch_http.py) so the tool never touches your real browser profile or cookies.
cdp-proxy-). After each task, manually check and delete these temp files if auto-cleanup failed.127.0.0.1 only, external network cannot access. WebSocket connections require Token verification.browser_launcher.py to force isolated profile only.# Single page / 单篇
python3 scripts/main.py "https://mp.weixin.qq.com/s?__biz=..."
# Batch (multiple URLs) / 批量(多个 URL)
python3 scripts/main.py "https://url1" "https://url2" "https://url3"
# Batch (xlsx file) / 批量(xlsx 文件)
python3 scripts/main.py /path/to/urls.xlsx
# Isolated mode (no login) / 隔离模式(不使用登录态)
python3 scripts/main.py --isolated "https://..."
python3 scripts/main.py --isolated /path/to/urls.xlsx
| Input / 输入 | Processing / 处理方式 |
|---|---|
| ------ | ---------- |
| Single URL / 1 个 URL | Single-page full scraping (PDF + Markdown) / 单篇完整抓取 |
| Multiple URLs / ≥2 个 URL | Batch scraping: Browser mode if Chrome available, otherwise HTTP mode / Chrome 可用 → 浏览器版,否则 → HTTP 版 |
| xlsx file / xlsx 文件 | Auto-routed based on URL count / 根据 URL 数量自动路由 |
| Column / 列 | Content / 内容 |
|---|---|
| ---- | ------ |
| B (2nd) / B(第2列) | Article title / 文章标题 |
| C (3rd) / C(第3列) | Publish date / 发布日期 |
| F (6th) / F(第6列) | Article URL / 文章 URL |
~/Desktop/{current date}/
├── 2025-04-03_article-title.md
└── 2025-04-03_article-title.pdf
| Script / 脚本 | Purpose / 用途 |
|---|---|
| ------ | ------ |
main.py | Unified entry, smart routing / 统一入口,智能路由 |
scrape.py | Single-page scraping, PDF + Markdown / 单篇抓取,PDF + Markdown |
batch_scrape.py | Browser batch (requires Chrome) / 批量浏览器版(需 Chrome) |
batch_http.py | HTTP batch (no browser, MD only) / 批量 HTTP 版(无浏览器,仅 Markdown) |
--isolatedbatch_http.py),无需浏览器,不访问 Profilecdp-proxy- 状态文件提示SKILL.md 文档大幅更新:
文档结构优化:
batch_scrape.py:在提取网页内容后,用网页获取的真实标题覆盖 xlsx 中的标题batch_http.py 中的硬编码 xlsx 默认路径batch_http.py 新增系统目录路径安全校验--isolated 标记为公开内容首选--isolated 隔离模式参数--output 可指定输出目录共 4 个版本