← 返回
未分类 中文

Webpage Export

Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages....
导出网页为整洁的本地TXT、DOCX和PDF文件,支持源元数据、备用提取逻辑和浏览器辅助恢复困难页面。
lilw-yezi lilw-yezi 来源
未分类 clawhub v1.0.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 694
下载
💾 0
安装
1
版本
#latest

概述

Webpage Export

Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference.

Core workflow

  1. Run scripts/export_webpage.py to create a TXT snapshot first.
  2. Treat TXT as the baseline extracted record.
  3. Add --docx when the user wants a Word document.
  4. Add --pdf when Chrome/Chromium is available and the user wants a PDF.
  5. Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents.
  6. Save outputs to an explicit --outdir when the user provides one; otherwise let the script use its local default export folder under the current working directory.
  7. For accuracy-sensitive work, keep original title, original URL, and extracted source metadata.

Commands

TXT only

python3 scripts/export_webpage.py "<url>"

TXT + DOCX

python3 scripts/export_webpage.py "<url>" --docx

TXT + PDF

python3 scripts/export_webpage.py "<url>" --pdf

TXT + DOCX + PDF with explicit output folder

python3 scripts/export_webpage.py "<url>" --docx --pdf --outdir ./exports/temp

Runtime requirements

  • Requires python3.
  • Requires curl for baseline webpage fetching.
  • PDF export requires Chrome or Chromium.
  • Browser-assisted fallback requires node and the playwright package.
  • DOCX export on macOS requires textutil.

Safety and execution notes

  • This skill fetches arbitrary URLs and may use a headless browser for difficult pages.
  • Browser-assisted fallback executes page JavaScript and should be used only when needed.
  • Prefer explicit --outdir values for production or shared environments.

What the script does

  • Fetch the page with curl
  • Extract title/source/publish-time when available
  • Try multiple body candidates before falling back to a full-page text snapshot
  • Score extraction quality and emit warnings for suspicious/partial results
  • Strip HTML into readable text for a TXT snapshot
  • Convert TXT to DOCX using textutil on macOS
  • Render webpage to PDF using Chrome/Chromium headless printing when available
  • Emit a JSON metadata file with status, paths, word count, quality, and warnings

Format choice

  • Prefer TXT as the baseline extracted record.
  • Prefer DOCX when the user wants an editable or shareable document.
  • Prefer PDF when the user wants page-like rendering or easier direct viewing.
  • For important work, do not treat PDF as the only source of truth.

Chrome/Chromium PDF path

When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation.

Read references/chrome-pdf-guide.md when:

  • you need the exact Chrome PDF logic
  • PDF output is incomplete or suspicious
  • Chrome emits warnings and you need to judge whether the result is still usable
  • you need fallback decisions

Accuracy and fallbacks

Read references/accuracy-and-fallbacks.md when:

  • source accuracy matters
  • webpage metadata is incomplete
  • a field cannot be extracted cleanly
  • you need fallback behavior after a partial extraction

Delivery decisions

Read references/delivery-rules.md when:

  • deciding whether to deliver TXT, DOCX, PDF, or a combination
  • preparing files for downstream agents or user delivery
  • choosing archive placement under the local workspace

Limitations

  • Some highly dynamic or anti-bot pages may extract only partially.
  • PDF depends on Chrome/Chromium being installed.
  • DOCX depends on macOS textutil.
  • If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up.

Accuracy rule

Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.

版本历史

共 1 个版本

  • v1.0.1 当前
    2026-05-02 03:44 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 273 📥 100,657
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 297 📥 141,569
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 211 📥 69,559