← 返回
沟通协作

WeChat Article Extractor

Extract full text and figures from a WeChat public account (微信公众号) article URL and save as a clean Markdown file. Handles WeChat's bot-detection by finding m...
Extract full text and figures from a WeChat public account (微信公众号) article URL and save as a clean Markdown file. Handles WeChat's bot-detection by finding m...
chunhualiao
沟通协作 clawhub v1.0.0 1 版本 99723 Key: 无需
★ 2
Stars
📥 1,760
下载
💾 25
安装
1
版本
#latest

概述

WeChat Article Extractor

Extract WeChat public account articles to clean Markdown. WeChat blocks headless browsers (环境异常 CAPTCHA) and web_fetch gets empty JS-rendered pages, so the reliable approach is: find a mirror on aggregator sites, then extract content.

Scope & Boundaries

This skill handles:

  • Extracting article text, images, and metadata from WeChat article URLs
  • Finding mirror copies when direct access is blocked
  • Converting HTML to clean Markdown
  • Saving output as .md files

This skill does NOT handle:

  • Publishing or syncing to note-taking apps (that's the user's workflow)
  • Batch extraction of multiple articles (handle one at a time)
  • WeChat login, authentication, or account management
  • Translating article content

Inputs

InputRequiredDescription
------------------------------
WeChat URLYesAn mp.weixin.qq.com link
Output filenameNoDefaults to kebab-case of article title
Save locationNoDefaults to /tmp/

Outputs

  • A Markdown file with full article content, images, and metadata header
  • Console confirmation with file path and character count

Workflow

Step 1 — Try direct fetch (fast path)

web_fetch(url, extractMode="markdown", maxChars=50000)

Success check: If result rawLength > 500 AND content has real paragraphs (not just nav/footer text) → skip to Step 4 Option B.

Failure indicators: rawLength < 500, content is navigation/boilerplate only, or contains "环境异常" → go to Step 2.

Step 2 — Extract article metadata

From the URL or any partial content, identify:

  • Article title (from </code> or og:title)</li><li>Author / account name (from og:description or page content)</li></ul><p>If metadata is unavailable from the URL, ask the user for the article title.</p><h3>Step 3 — Search for mirrors</h3><pre><code>web_search("<article title> <author/account name>") </code></pre><p><strong>Mirror site priority</strong> (ranked by content quality and reliability):</p><ol><li><strong>53ai.com</strong> — full content, reliable formatting</li><li><strong>mp.ofweek.com</strong> — tech articles</li><li><strong>juejin.cn</strong> — developer content</li><li><strong>woshipm.com</strong> — product/business content</li><li><strong>36kr.com</strong> — tech/business news</li></ol><p>If title is unknown, try: <code>web_search("site:53ai.com <keywords from URL path>")</code></p><p><strong>If no mirrors found:</strong> Try the Chrome Extension Relay fallback (see Fallback section).</p><h3>Step 4 — Download and extract</h3><p><strong>Option A — Mirror found:</strong></p><pre><code>curl -s -L "<mirror_url>" -o /tmp/wechat-article.html </code></pre><p>Verify file size > 10KB (smaller usually means redirect/error page).</p><p>Run the extraction script:</p><pre><code>python3 <skill_dir>/scripts/extract_wechat.py /tmp/wechat-article.html /tmp/<output-filename>.md </code></pre><p>Replace <code><skill_dir></code> with the directory containing this SKILL.md.</p><p><strong>Option B — Direct fetch succeeded (Step 1):</strong></p><p>Format the fetched markdown with the header template below.</p><h3>Step 5 — Verify output quality</h3><p>Check the output file:</p><ul><li>Has a title (not "WeChat Article")</li><li>Has multiple paragraphs of real content</li><li>Images have valid URLs (not broken/placeholder)</li><li>No excessive HTML artifacts remaining</li></ul><p>If output looks truncated or garbled, try a different mirror site (return to Step 3).</p><h3>Step 6 — Deliver to user</h3><p>Report:</p><ul><li>File saved at: <code><path></code></li><li>Title: <code><title></code></li><li>Size: <code><char count></code> characters</li><li>Image count: <code><N></code> images</li></ul><p>If the user wants it saved to a specific location (e.g., Obsidian), follow their instructions for the final copy.</p><h2>Markdown Header Template</h2><p>Every extracted article must include this header:</p><pre><code># <title> **作者:** <author> **来源:** 微信公众号「<account_name>」 **日期:** <date> **原文:** <original_wechat_url> --- > **摘要:** <1-2 sentence summary generated from content> --- </code></pre><p>Fields that cannot be determined should be omitted (don't write "Unknown").</p><h2>Fallback: Chrome Extension Relay</h2><p>If no mirror exists (very new or niche article):</p><p>Tell the user (in Chinese if they wrote in Chinese):</p><p>> "没有找到镜像。请在 Chrome 中打开这篇文章,然后点击 OpenClaw Browser Relay 扩展图标(badge 亮起),我就能直接读取内容。"</p><p>Then use:</p><pre><code>browser(action="snapshot", profile="chrome") </code></pre><p>Extract content from the snapshot and format with the header template.</p><h2>Error Handling</h2><table><thead><tr><th>Problem</th><th>Detection</th><th>Action</th></tr></thead><tbody><tr><td>---------</td><td>-----------</td><td>--------</td></tr><tr><td>WeChat blocks access</td><td>rawLength < 500 or "环境异常"</td><td>Search for mirrors (Step 3)</td></tr><tr><td>No mirrors found</td><td>Search returns 0 relevant results</td><td>Try Chrome Relay fallback</td></tr><tr><td>Mirror content truncated</td><td>Output < 1000 chars when original is long</td><td>Try next mirror site</td></tr><tr><td>Script extraction fails</td><td>Python error or empty output</td><td>Fall back to <code>web_fetch</code> on mirror URL</td></tr><tr><td>Images broken</td><td>Image URLs return 404</td><td>Note in output; images may expire</td></tr></tbody></table><h2>Success Criteria</h2><ul><li>Output Markdown contains the full article text (not truncated)</li><li>Title and metadata are correctly extracted</li><li>Images are preserved with working URLs</li><li>No HTML artifacts or navigation junk in output</li><li>File is saved at the specified location</li></ul><h2>Notes</h2><ul><li>WeChat image URLs from mirrors (e.g., api.ibos.cn proxy) are generally valid and render in most Markdown viewers</li><li>Mirror sites typically publish within minutes of the original</li><li>The <code>· · ·</code> section dividers are WeChat style — preserve them</li><li>For very long articles (>50K chars), the script handles them fine but <code>web_fetch</code> may truncate</li></ul><h2>Configuration</h2><p>No persistent configuration required. The skill uses standard OpenClaw tools (<code>web_fetch</code>, <code>web_search</code>, <code>exec</code>) and optionally <code>browser</code> for the Chrome Relay fallback.</p><p><strong>Required tools:</strong></p><table><thead><tr><th>Tool</th><th>Purpose</th></tr></thead><tbody><tr><td>------</td><td>---------</td></tr><tr><td><code>web_fetch</code></td><td>Direct article fetch attempt</td></tr><tr><td><code>web_search</code></td><td>Mirror site discovery</td></tr><tr><td><code>exec</code></td><td>Run curl and Python extraction script</td></tr></tbody></table><p><strong>Optional tools:</strong></p><table><thead><tr><th>Tool</th><th>Purpose</th></tr></thead><tbody><tr><td>------</td><td>---------</td></tr><tr><td><code>browser</code></td><td>Chrome Extension Relay fallback</td></tr></tbody></table><p><strong>System dependencies:</strong></p><table><thead><tr><th>Dependency</th><th>Purpose</th></tr></thead><tbody><tr><td>------------</td><td>---------</td></tr><tr><td>Python 3.8+</td><td>Extraction script</td></tr><tr><td>curl</td><td>Mirror page download</td></tr></tbody></table></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 1 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v1.0.0</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-03-29 12:00 安全 安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=41f60aa45136b14a0651b6d1d87c5d75" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/25/401425_c3eb3726de22f075c0d5cfdeddcd9f8c.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781211110%3B1812747110&q-key-time=1781211110%3B1812747110&q-header-list=host&q-url-param-list=&q-signature=28d628ea78502063f06da9cdd44260fb3d9a1680" target="_blank">查看报告</a> </div> </div> </div> </div> <!-- Recommended Skills --> <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/imap-smtp-email">imap-smtp-email</a></h3> <div class="rec-owner">gzlicanyi</div> <div class="rec-desc">使用IMAP/SMTP读取和发送邮件;检查新/未读邮件、获取内容、搜索邮箱、标记已读/未读、发送带附件的邮件。支持...</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 113</span> <span style="color:#5b6abf;">📥 52,385</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/slack">Slack</a></h3> <div class="rec-owner">steipete</div> <div class="rec-desc">当需要通过 slack 工具从 Clawdbot 控制 Slack 时使用,包括在频道或私信中回复消息或置顶/取消置顶项目。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 157</span> <span style="color:#5b6abf;">📥 47,670</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">communication-collaboration</span> <h3><a href="/s/himalaya">Himalaya</a></h3> <div class="rec-owner">lamelas</div> <div class="rec-desc">{"answer":"通过IMAP/SMTP管理邮件的CLI。可在终端使用 `himalaya` 收发、回复、转发、搜索及整理邮件。支持多账户与MML(MIME元语言)编写邮件。"}</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 68</span> <span style="color:#5b6abf;">📥 45,577</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>