Aggressive HTML-to-markdown converter for AI agents. Mozilla Readability isolates main content, Turndown converts to markdown, then heavy post-processing strips remaining noise.
> Full flag reference and advanced examples: references/usage.md
cd <skill-dir>/scripts
npm install
npm link # makes `html2md` globally available
Requires Node.js 22+.
html2md https://example.com # fetch + convert
html2md --file page.html # local HTML file
cat page.html | html2md --stdin # pipe from stdin
html2md --max-tokens 2000 https://example.com # budget-aware truncation
html2md --no-links https://example.com # strip hrefs, keep text
html2md --json https://example.com # JSON: {title, url, markdown, tokens}
when Readability returns too little (e.g. HN's table layout).--max-tokens N keeps all headings, fills remaining budget in document order, appends [truncated — N more tokens]. Uses 1 token ≈ 4 chars heuristic.--json for programmatic use.web_fetchUse html2md when | Use web_fetch when |
|---|---|
| ------------------- | --------------------- |
| Reading pages in cron jobs / sub-agents | Quick one-off fetch in main session |
Token budget matters (--max-tokens) | Page is a JSON/XML API endpoint |
| Heavy nav/ads/footers to strip | JS rendering not needed |
| Need JSON output | Simple pages |
html2md fetches URLs and reads local files — that's its job. If you're passing untrusted input:
--file reads any path the process can access. In agent workflows, the agent controls the path — this is equivalent to the agent using cat.execFileSync (not execSync) to avoid shell injection.# Read a Paul Graham essay within 2000 tokens
html2md --max-tokens 2000 https://paulgraham.com/greatwork.html
# HN front page as clean text, no link noise
html2md --no-links --no-images https://news.ycombinator.com
# Get token count before committing
html2md --json https://example.com | jq .tokens
# Pipe to file
html2md https://docs.example.com/api > api-docs.md
共 1 个版本