← 返回
未分类 中文

URL to Markdown

Convert HTML web pages from HTTP/HTTPS URLs to clean, readable Markdown files with optional batch processing and formatting features.
将HTTP/HTTPS网址的HTML网页转换为干净、可读的Markdown文件,支持可选的批量处理和格式化功能。
rwonly
未分类 clawhub v2.1.2 2 版本 100000 Key: 无需
★ 2
Stars
📥 391
下载
💾 0
安装
2
版本
#latest

概述

Url2md

Convert web pages to clean, readable Markdown.

Quick Start

Single URL

python3 scripts/url2md.py https://example.com/article

Output to a file:

python3 scripts/url2md.py https://example.com/article -o article.md

Batch Conversion

Create a file with URLs (one per line):

https://example.com/article-1
https://example.com/article-2
https://example.com/article-3

Convert all and save to a directory:

python3 scripts/url2md.py -f urls.txt -d ./markdown_files/

Features

  • No dependencies: Uses only Python standard library (urllib, html.parser)
  • Reader-style scope: Strips script/style/noscript/template, then prefers the first
    or
    (else ) so output focuses on primary content
  • Title extraction: Uses og:title / Twitter title when present, otherwise </code>, added as H1 when enabled</li><li><strong>YAML Frontmatter</strong>: Extracts structured metadata (title, author, published, description, category, source) from <code><meta></code> tags and Schema.org JSON-LD for knowledge-base workflows</li><li><strong>Template system</strong>: Customize output format with variables (<code>{{title}}</code>, <code>{{content}}</code>, <code>{{author}}</code>, <code>{{published}}</code>, <code>{{date}}</code>, etc.)</li><li><strong>Link resolution</strong>: Converts relative URLs to absolute</li><li><strong>Basic formatting</strong>: Headings, paragraphs, lists, links, images, fenced code (with optional language), GFM-style tables, bold/italic</li><li><strong>Noise removal</strong>: Skips navigation, sidebars, footers, forms, and other chrome inside the parsed fragment</li></ul><h2>Script Reference</h2><h3><code>scripts/url2md.py</code></h3><p><strong>Usage:</strong></p><pre><code>url2md.py [url] [options] </code></pre><p><strong>Options:</strong></p><table><thead><tr><th>Option</th><th>Description</th></tr></thead><tbody><tr><td>--------</td><td>-------------</td></tr><tr><td><code>url</code></td><td>Single URL to convert</td></tr><tr><td><code>-o, --output</code></td><td>Output file (default: stdout)</td></tr><tr><td><code>-f, --file</code></td><td>File containing URLs to convert</td></tr><tr><td><code>-d, --dir</code></td><td>Output directory for batch conversion</td></tr><tr><td><code>--no-title</code></td><td>Skip adding page title as H1</td></tr><tr><td><code>--full-page</code></td><td>Parse full <code><body></code> instead of <code><article></code>/<code><main></code> first (more chrome, wider coverage)</td></tr><tr><td><code>--timeout</code></td><td>Request timeout in seconds (default: 30)</td></tr><tr><td><code>--frontmatter</code></td><td>Add YAML frontmatter with extracted metadata</td></tr><tr><td><code>-t, --template</code></td><td>Path to a template file for customizing output</td></tr><tr><td><code>--filename-template</code></td><td>Batch mode filename pattern (e.g. <code>{{date}}-{{title}}.md</code>)</td></tr><tr><td><code>--download-images</code></td><td>Download remote images to a local folder (e.g. <code>assets</code>)</td></tr><tr><td><code>-v, --version</code></td><td>Show version</td></tr></tbody></table><p><strong>Examples:</strong></p><pre><code># Single URL to stdout python3 scripts/url2md.py https://docs.python.org/3 # Save to file python3 scripts/url2md.py https://docs.python.org/3 -o python-docs.md # Batch with custom timeout python3 scripts/url2md.py -f urls.txt -d ./output/ --timeout 60 # Skip title python3 scripts/url2md.py https://example.com --no-title # Whole body (no article/main focus) python3 scripts/url2md.py https://example.com/sitemap --full-page -o sitemap.md # YAML frontmatter (great for Obsidian / PKM) python3 scripts/url2md.py https://example.com/article --frontmatter -o article.md # Custom template python3 scripts/url2md.py https://example.com/article -t article.tpl -o article.md # Batch with smart filenames python3 scripts/url2md.py -f urls.txt -d ./output/ --filename-template "{{date}}-{{title}}.md" # Download images locally python3 scripts/url2md.py https://example.com/article -o article.md --download-images assets python3 scripts/url2md.py -f urls.txt -d ./output/ --download-images assets </code></pre><p><strong>Template variables:</strong> <code>{{title}}</code>, <code>{{content}}</code>, <code>{{url}}</code>, <code>{{source}}</code>, <code>{{author}}</code>, <code>{{published}}</code>, <code>{{description}}</code>, <code>{{category}}</code>, <code>{{site_name}}</code>, <code>{{date}}</code>, <code>{{datetime}}</code></p><h2>When to Use</h2><ul><li>Converting documentation pages to Markdown for local reference</li><li>Archiving web articles as text files</li><li>Building a knowledge base with structured metadata (frontmatter / templates)</li><li>Building static content from dynamic sources</li><li>Extracting readable content when browser tools are unavailable</li><li>Batch processing a list of URLs</li></ul><h2>Limitations</h2><ul><li>Converts static HTML only; does not execute JavaScript</li><li>Complex layouts (multi-column, heavy CSS) may lose structural fidelity</li><li>Login-required or paywalled content requires authentication tokens</li><li>Rate-limited sites may block repeated requests</li></ul></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 2 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v2.1.2</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-12 05:12 安全 安全 </div> </li> <li> <div> <span class="version-tag">v1.0.0</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-11 04:55 安全 安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=8f726c8df27f4af07ecdccca9bcfe671" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/26/458558_807fd888e7153fb61fff828acbe71872.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781342868%3B1812878868&q-key-time=1781342868%3B1812878868&q-header-list=host&q-url-param-list=&q-signature=4fe9f76610682e2a9797fab09a168c7c1ae401c7" target="_blank">查看报告</a> </div> </div> </div> </div> <!-- Recommended Skills --> <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">security-compliance</span> <h3><a href="/s/skill-vetter">Skill Vetter</a></h3> <div class="rec-owner">spclaudehome</div> <div class="rec-desc">AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1,215</span> <span style="color:#5b6abf;">📥 266,539</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">ai-intelligence</span> <h3><a href="/s/self-improving">Self-Improving + Proactive Agent</a></h3> <div class="rec-owner">ivangdavila</div> <div class="rec-desc">自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1,358</span> <span style="color:#5b6abf;">📥 318,370</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">developer-tools</span> <h3><a href="/s/github">Github</a></h3> <div class="rec-owner">steipete</div> <div class="rec-desc">使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 668</span> <span style="color:#5b6abf;">📥 324,162</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>