← 返回
安全合规 中文

WeChat Work Doc Fetcher

Fetch and convert WeChat Work developer docs pages into clean Markdown files for use in Obsidian, handling SPA content and required authentication.
获取并转换企业微信开发者文档为简洁的 Markdown 文件以用于 Obsidian,处理 SPA 内容及身份验证。
mouzhi
安全合规 clawhub v1.0.0 1 版本 99905.9 Key: 无需
★ 0
Stars
📥 1,062
下载
💾 227
安装
1
版本
#latest

概述

wecom-doc-fetcher

Use this skill when the user wants to save any page from the WeChat Work (企业微信) developer documentation site (developer.work.weixin.qq.com/document/path/*) as a clean Markdown file in their Obsidian vault.

Files in this skill

wecom-doc-fetcher/
├── SKILL.md          # this file
└── wx_doc_fetch.py   # the fetch & convert script

Setup (one-time)

Run these once before using the skill:

pip install requests playwright
playwright install chromium

> playwright install chromium downloads a ~150 MB headless Chromium binary. This is required for automatic doc_id detection.

Python 3.8+ is required.


Usage

Place wx_doc_fetch.py anywhere convenient (e.g. your vault's scripts folder), then run:

# Basic: auto-detect doc_id, print to stdout
python wx_doc_fetch.py <URL>

# Save to file
python wx_doc_fetch.py <URL> output.md

# Skip Playwright, supply doc_id manually
python wx_doc_fetch.py <URL> output.md --doc-id <integer>

# Override cookies at runtime
python wx_doc_fetch.py <URL> output.md --cookies "wwapidoc.sid=xxx; ..."

Example

python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md
# [info] path_id=94677  doc_id=31152
# [done] 已写入:发送消息.md

How It Works

The WeChat Work docs site is a Vue SPA — the visible content is not in the initial HTML. It is loaded at runtime via a private POST API:

POST https://developer.work.weixin.qq.com/docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json
Body: doc_id=<integer>   (application/x-www-form-urlencoded)

The response includes data.content_md — the page content as a Markdown string. The script fetches this field, cleans it, and writes the result.

Why not WebFetch / defuddle?

The page renders client-side. WebFetch and defuddle only see the pre-JS HTML skeleton — no content. Scraping innerText via browser tools works but produces a very large accessibility tree with poor formatting. The content_md API field is the cleanest, most token-efficient source.

URL path ID ≠ doc_id

The number in the browser URL (e.g. 94677) is a routing slug — not the doc_id the API needs. The actual doc_id (e.g. 31152) is determined at runtime by loading the page with Playwright and intercepting the fetchCnt XHR request.


Manual doc_id Fallback

If Playwright is unavailable or times out:

  1. Open the target URL in Chrome
  2. DevTools → Network tab → filter by fetchCnt
  3. Click the request → Payload tab
  4. Read the doc_id value
  5. Pass it with --doc-id:
python wx_doc_fetch.py https://developer.work.weixin.qq.com/document/path/94677 发送消息.md --doc-id 31152

Cookie Configuration

The fetchCnt API requires an authenticated session. Playwright's headless browser obtains session cookies automatically when loading the page — no manual cookie setup needed for normal use.

If you see errCode: -30001 in the output, the session is rejected. Fix:

  1. Open the site in Chrome while logged in
  2. DevTools → Network → any fetchCnt request → Copy as cURL
  3. Find the -b '...' cookie string in the copied command
  4. Either paste it into COOKIES_RAW at the top of wx_doc_fetch.py, or pass it via --cookies "..."

Key cookies and their lifetimes:

CookiePurposeLifetime
--------------------------
wwapidoc.sidSession identifier~24 hours
wwapidoc.token_wtJWT auth token~30 minutes

API Reference

ItemDetail
--------------
EndpointPOST /docFetch/fetchCnt?lang=zh_CN&ajax=1&f=json&random=
Bodydoc_id= (form-urlencoded)
AuthSession cookies
Key response fielddata.content_md
Other response fieldsdata.content_html, data.content_html_v2, data.content_txt, data.title, data.time

content_md Cleaning Rules

The content_md field is mostly valid CommonMark but has site-specific issues. The clean_md() function in wx_doc_fetch.py handles all of them:

#ProblemRaw exampleAfter cleaning
-----------------------------------------
1[TOC] marker at top[TOC]\n# 概述# 概述
2Heading missing space after ###接口定义## 接口定义
3Internal numeric anchor links接收事件接收事件
3Anchors with sub-path开启API开启API
4HTML line breaks inside table cells说明
补充
说明 补充
5 bold tags注意注意
6 inline tagsopen_kfid` open_kfid `
7 color tags警告警告
8!!#rrggbb text!! site-specific highlight!!#ff0000 重要!!重要
9Leading spaces before table rows`··\参数 \``\参数 \`
10No blank line before table (Obsidian won't render)`文字\n\col \``文字\n\n\col \`
11Excess blank lines3+ \n in a row2 \n max

Rule 10 — critical regex note

The blank-line-before-table rule must match on lines that don't start with |, not just on the trailing character of the previous line:

# CORRECT — matches on start of line, avoids breaking table rows apart
re.sub(r"^([^|\n][^\n]*)\n(\|)", r"\1\n\n\2", content, flags=re.MULTILINE)

# WRONG — table rows end with "| " (trailing space), so last char is space,
#          causing blank lines to be inserted between every table row
re.sub(r"([^\n])\n(\|)", r"\1\n\n\2", content)

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 11:08 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,720
security-compliance

1password

steipete
设置和使用 1Password CLI (op)。适用于:安装 CLI、启用桌面应用集成、登录(单/多账户)、通过 op 读取/注入/运行密钥。
★ 53 📥 31,172
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,215 📥 266,540