← 返回
未分类

Web Content Fetcher (WeChat images fix)

Extract article content from any URL as clean Markdown. Uses Scrapling script as primary method (with auto fast→stealth fallback), Jina Reader as alternative...
从任意URL提取文章内容并转换为整洁Markdown。以Scrapling脚本为主要方式(支持自动从fast回退至stealth),Jina Reader作为备选...
haanya168 haanya168 来源
未分类 clawhub v0.0.1 1 版本 99912.4 Key: 无需
★ 1
Stars
📥 1,121
下载
💾 19
安装
1
版本
#latest

概述

Web Content Fetcher

Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.

Note: This skill extracts content + remote image URLs. If the user wants an "offline" copy (download images to local disk and rewrite links), add a post-processing step (not included by default in this skill).

Extraction Strategy

Always try one method per URL — don't cascade blindly. Pick the right one upfront.

URL
 │
 ├─ 1. Scrapling script (preferred)
 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
 │     Works for most sites. Returns clean Markdown directly.
 │
 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
       web_fetch("https://r.jina.ai/<url>")
       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
       Does NOT work for: WeChat (403), some Chinese platforms.

Scrapling script

python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]

is the directory where this SKILL.md lives. Resolve it before calling the script.

The script has two modes built in:

  • Default (fast): HTTP fetch, ~1-3s, works for most sites
  • --stealth: Headless browser, ~5-15s, for JS-rendered or anti-scraping sites

When run without --stealth, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify --stealth manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.

Domain Routing

Use this table to pick the right mode on the first call:

DomainCommandWhy
----------------------
mp.weixin.qq.comfetch.py --stealthJS-rendered content
zhuanlan.zhihu.comfetch.py --stealthAnti-scraping + JS
juejin.cnfetch.py --stealthJS-rendered SPA
sspai.comfetch.py Static HTML
blog.csdn.netfetch.py Static HTML
ruanyifeng.comfetch.py Static blog
openai.comfetch.py Static HTML
blog.googlefetch.py Static HTML
Everything elsefetch.py Auto-fallback handles it

Script Options

# Basic — auto-selects fast or stealth
python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"

# Force stealth for known JS-heavy sites
python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth

# Limit output to 15000 characters (default: 30000)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000

# JSON output with metadata (url, mode, selector, content_length)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json

Install Dependencies

First use only — the script checks and tells you if anything is missing:

pip install scrapling html2text

If on system-managed Python (macOS/Linux), add --break-system-packages or use a venv.

Failure Rules

  • Same URL fails once → give up, tell the user "unable to extract content from this URL"
  • Do not retry — each failed call wastes context tokens

WeChat-specific gotchas

  • WeChat often uses lazy-loaded images where the real URL is in data-src and src is a tiny placeholder.
  • The extractor script normalizes these to real src URLs before running html2text.
  • If you ever see Markdown image lines that contain weird URL-encoded SVG fragments (e.g. ...%3Csvg...%3E) appended after the closing ) of an image, it means the placeholder leaked into Markdown parsing; update fix_lazy_images() in scripts/fetch.py to remove/replace placeholder data:image/svg+xml src values.

版本历史

共 1 个版本

  • v0.0.1 当前
    2026-03-30 11:33 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 297 📥 141,177
data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 273 📥 100,490
data-analysis

Stock Watcher

robin797860
管理和监控个人股票自选列表,支持利用同花顺数据添加、删除、列出股票及汇总近期表现。适用于用户希望追踪特定股票、获取表现汇总或管理自选列表时。
★ 112 📥 46,346