WeChat Article Extractor

Extract WeChat public account articles to clean Markdown. WeChat blocks headless browsers (环境异常 CAPTCHA) and web_fetch gets empty JS-rendered pages, so the reliable approach is: find a mirror on aggregator sites, then extract content.

Scope & Boundaries

This skill handles:

Extracting article text, images, and metadata from WeChat article URLs
Finding mirror copies when direct access is blocked
Converting HTML to clean Markdown
Saving output as .md files

This skill does NOT handle:

Publishing or syncing to note-taking apps (that's the user's workflow)
Batch extraction of multiple articles (handle one at a time)
WeChat login, authentication, or account management
Translating article content

Inputs

Input	Required	Description
-------	----------	-------------
WeChat URL	Yes	An `mp.weixin.qq.com` link
Output filename	No	Defaults to kebab-case of article title
Save location	No	Defaults to `/tmp/`

Outputs

A Markdown file with full article content, images, and metadata header
Console confirmation with file path and character count

Workflow

Step 1 — Try direct fetch (fast path)

web_fetch(url, extractMode="markdown", maxChars=50000)

Success check: If result rawLength > 500 AND content has real paragraphs (not just nav/footer text) → skip to Step 4 Option B.

Failure indicators: rawLength < 500, content is navigation/boilerplate only, or contains "环境异常" → go to Step 2.

Step 2 — Extract article metadata

From the URL or any partial content, identify:

Article title (from </code> or og:title)</li><li>Author / account name (from og:description or page content)</li></ul><p>If metadata is unavailable from the URL, ask the user for the article title.</p><h3>Step 3 — Search for mirrors</h3><pre><code>web_search("<article title> <author/account name>") </code></pre><p><strong>Mirror site priority</strong> (ranked by content quality and reliability):</p><ol><li><strong>53ai.com</strong> — full content, reliable formatting</li><li><strong>mp.ofweek.com</strong> — tech articles</li><li><strong>juejin.cn</strong> — developer content</li><li><strong>woshipm.com</strong> — product/business content</li><li><strong>36kr.com</strong> — tech/business news</li></ol><p>If title is unknown, try: <code>web_search("site:53ai.com <keywords from URL path>")</code></p><p><strong>If no mirrors found:</strong> Try the Chrome Extension Relay fallback (see Fallback section).</p><h3>Step 4 — Download and extract</h3><p><strong>Option A — Mirror found:</strong></p><pre><code>curl -s -L "<mirror_url>" -o /tmp/wechat-article.html </code></pre><p>Verify file size > 10KB (smaller usually means redirect/error page).</p><p>Run the extraction script:</p><pre><code>python3 <skill_dir>/scripts/extract_wechat.py /tmp/wechat-article.html /tmp/<output-filename>.md </code></pre><p>Replace <code><skill_dir></code> with the directory containing this SKILL.md.</p><p><strong>Option B — Direct fetch succeeded (Step 1):</strong></p><p>Format the fetched markdown with the header template below.</p><h3>Step 5 — Verify output quality</h3><p>Check the output file:</p><ul><li>Has a title (not "WeChat Article")</li><li>Has multiple paragraphs of real content</li><li>Images have valid URLs (not broken/placeholder)</li><li>No excessive HTML artifacts remaining</li></ul><p>If output looks truncated or garbled, try a different mirror site (return to Step 3).</p><h3>Step 6 — Deliver to user</h3><p>Report:</p><ul><li>File saved at: <code><path></code></li><li>Title: <code><title></code></li><li>Size: <code><char count></code> characters</li><li>Image count: <code><N></code> images</li></ul><p>If the user wants it saved to a specific location (e.g., Obsidian), follow their instructions for the final copy.</p><h2>Markdown Header Template</h2><p>Every extracted article must include this header:</p><pre><code># <title> **作者：** <author> **来源：** 微信公众号「<account_name>」 **日期：** <date> **原文：** <original_wechat_url> --- > **摘要：** <1-2 sentence summary generated from content> --- </code></pre><p>Fields that cannot be determined should be omitted (don't write "Unknown").</p><h2>Fallback: Chrome Extension Relay</h2><p>If no mirror exists (very new or niche article):</p><p>Tell the user (in Chinese if they wrote in Chinese):</p><p>> "没有找到镜像。请在 Chrome 中打开这篇文章，然后点击 OpenClaw Browser Relay 扩展图标（badge 亮起），我就能直接读取内容。"</p><p>Then use:</p><pre><code>browser(action="snapshot", profile="chrome") </code></pre><p>Extract content from the snapshot and format with the header template.</p><h2>Error Handling</h2><table><thead><tr><th>Problem</th><th>Detection</th><th>Action</th></tr></thead><tbody><tr><td>---------</td><td>-----------</td><td>--------</td></tr><tr><td>WeChat blocks access</td><td>rawLength < 500 or "环境异常"</td><td>Search for mirrors (Step 3)</td></tr><tr><td>No mirrors found</td><td>Search returns 0 relevant results</td><td>Try Chrome Relay fallback</td></tr><tr><td>Mirror content truncated</td><td>Output < 1000 chars when original is long</td><td>Try next mirror site</td></tr><tr><td>Script extraction fails</td><td>Python error or empty output</td><td>Fall back to <code>web_fetch</code> on mirror URL</td></tr><tr><td>Images broken</td><td>Image URLs return 404</td><td>Note in output; images may expire</td></tr></tbody></table><h2>Success Criteria</h2><ul><li>Output Markdown contains the full article text (not truncated)</li><li>Title and metadata are correctly extracted</li><li>Images are preserved with working URLs</li><li>No HTML artifacts or navigation junk in output</li><li>File is saved at the specified location</li></ul><h2>Notes</h2><ul><li>WeChat image URLs from mirrors (e.g., api.ibos.cn proxy) are generally valid and render in most Markdown viewers</li><li>Mirror sites typically publish within minutes of the original</li><li>The <code>· · ·</code> section dividers are WeChat style — preserve them</li><li>For very long articles (>50K chars), the script handles them fine but <code>web_fetch</code> may truncate</li></ul><h2>Configuration</h2><p>No persistent configuration required. The skill uses standard OpenClaw tools (<code>web_fetch</code>, <code>web_search</code>, <code>exec</code>) and optionally <code>browser</code> for the Chrome Relay fallback.</p><p><strong>Required tools:</strong></p><table><thead><tr><th>Tool</th><th>Purpose</th></tr></thead><tbody><tr><td>------</td><td>---------</td></tr><tr><td><code>web_fetch</code></td><td>Direct article fetch attempt</td></tr><tr><td><code>web_search</code></td><td>Mirror site discovery</td></tr><tr><td><code>exec</code></td><td>Run curl and Python extraction script</td></tr></tbody></table><p><strong>Optional tools:</strong></p><table><thead><tr><th>Tool</th><th>Purpose</th></tr></thead><tbody><tr><td>------</td><td>---------</td></tr><tr><td><code>browser</code></td><td>Chrome Extension Relay fallback</td></tr></tbody></table><p><strong>System dependencies:</strong></p><table><thead><tr><th>Dependency</th><th>Purpose</th></tr></thead><tbody><tr><td>------------</td><td>---------</td></tr><tr><td>Python 3.8+</td><td>Extraction script</td></tr><tr><td>curl</td><td>Mirror page download</td></tr></tbody></table></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 1 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v1.0.0</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-03-29 12:00 安全安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=41f60aa45136b14a0651b6d1d87c5d75" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/25/401425_c3eb3726de22f075c0d5cfdeddcd9f8c.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781211110%3B1812747110&q-key-time=1781211110%3B1812747110&q-header-list=host&q-url-param-list=&q-signature=28d628ea78502063f06da9cdd44260fb3d9a1680" target="_blank">查看报告</a> </div> </div> </div> </div>  <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/imap-smtp-email">imap-smtp-email</a></h3> <div class="rec-owner">gzlicanyi</div> <div class="rec-desc">使用IMAP/SMTP读取和发送邮件；检查新/未读邮件、获取内容、搜索邮箱、标记已读/未读、发送带附件的邮件。支持...</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 113</span> <span style="color:#5b6abf;">📥 52,385</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/slack">Slack</a></h3> <div class="rec-owner">steipete</div> <div class="rec-desc">当需要通过 slack 工具从 Clawdbot 控制 Slack 时使用，包括在频道或私信中回复消息或置顶/取消置顶项目。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 157</span> <span style="color:#5b6abf;">📥 47,670</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/himalaya">Himalaya</a></h3> <div class="rec-owner">lamelas</div> <div class="rec-desc">{"answer":"通过IMAP/SMTP管理邮件的CLI。可在终端使用 `himalaya` 收发、回复、转发、搜索及整理邮件。支持多账户与MML（MIME元语言）编写邮件。"}</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 68</span> <span style="color:#5b6abf;">📥 45,577</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>

WeChat Article Extractor

概述

WeChat Article Extractor

Scope & Boundaries

Inputs

Outputs

Workflow

Step 1 — Try direct fetch (fast path)

Step 2 — Extract article metadata