← 返回
未分类 中文

Html2md

Convert HTML pages to clean, agent-friendly markdown using Readability + Turndown. Strips navigation, ads, footers, cookie banners, social CTAs. Supports URL...
使用 Readability + Turndown 将 HTML 页面转换为干净的、Agent 友好的 Markdown。去除导航栏、广告、页脚、Cookie 横幅、社交按钮等干扰元素。支持 URL...
saikatkumardey
未分类 clawhub v1.0.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 343
下载
💾 0
安装
1
版本
#latest

概述

html2md

Aggressive HTML-to-markdown converter for AI agents. Mozilla Readability isolates main content, Turndown converts to markdown, then heavy post-processing strips remaining noise.

> Full flag reference and advanced examples: references/usage.md

Setup

cd <skill-dir>/scripts
npm install
npm link        # makes `html2md` globally available

Requires Node.js 22+.

Quick Start

html2md https://example.com                    # fetch + convert
html2md --file page.html                       # local HTML file
cat page.html | html2md --stdin                # pipe from stdin
html2md --max-tokens 2000 https://example.com  # budget-aware truncation
html2md --no-links https://example.com         # strip hrefs, keep text
html2md --json https://example.com             # JSON: {title, url, markdown, tokens}

Key Features

  • Readability extraction — kills navbars, sidebars, ads, cookie banners. Falls back to cleaned when Readability returns too little (e.g. HN's table layout).
  • Token budgeting--max-tokens N keeps all headings, fills remaining budget in document order, appends [truncated — N more tokens]. Uses 1 token ≈ 4 chars heuristic.
  • Post-processing — strips HTML comments, zero-width chars, social CTAs, breadcrumbs, empty headings, collapses excess blank lines.
  • Error handling — bad URLs, timeouts (15s), non-HTML content, missing files all exit code 1 with descriptive stderr.
  • Output modes — plain markdown or --json for programmatic use.

When to Use vs web_fetch

Use html2md whenUse web_fetch when
----------------------------------------
Reading pages in cron jobs / sub-agentsQuick one-off fetch in main session
Token budget matters (--max-tokens)Page is a JSON/XML API endpoint
Heavy nav/ads/footers to stripJS rendering not needed
Need JSON outputSimple pages

Security Considerations

html2md fetches URLs and reads local files — that's its job. If you're passing untrusted input:

  • URL fetching: the tool will fetch whatever URL it's given. Don't pass user-controlled URLs without validation if your threat model includes SSRF.
  • File reading: --file reads any path the process can access. In agent workflows, the agent controls the path — this is equivalent to the agent using cat.
  • No shell execution: the tool itself never spawns shells or runs commands. When calling from scripts, use execFileSync (not execSync) to avoid shell injection.
  • No data exfiltration: output goes to stdout only. No network requests beyond the single URL fetch. No telemetry, no analytics, no phone-home.
  • Dependencies: jsdom (Mozilla DOM implementation), Readability (Mozilla content extractor), Turndown (HTML→markdown). All widely audited, open source libraries.

Examples

# Read a Paul Graham essay within 2000 tokens
html2md --max-tokens 2000 https://paulgraham.com/greatwork.html

# HN front page as clean text, no link noise
html2md --no-links --no-images https://news.ycombinator.com

# Get token count before committing
html2md --json https://example.com | jq .tokens

# Pipe to file
html2md https://docs.example.com/api > api-docs.md

版本历史

共 1 个版本

  • v1.0.1 当前
    2026-05-12 05:25 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 668 📥 324,109
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,358 📥 318,255
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,215 📥 266,490