Scrapling 自适应爬虫框架

> 一句话：Scrapling 是一个自适应 Python 爬虫框架，能自动适应网站结构变化、绕过 Cloudflare 反爬，支持静态/动态/无头浏览器三种爬取模式。

> - GitHub：https://github.com/D4Vinci/Scrapling

> - 文档：https://scrapling.readthedocs.io/

> - 要求：Python ≥ 3.10

> - 协议：BSD-3-Clause

1. 概述

适用场景

场景	推荐的 Fetcher	说明
------	---------------	------
公开 API / 静态 HTML	`Fetcher`	最快，无需浏览器
SPA / JS 动态渲染	`DynamicFetcher`	需 Playwright + Chromium
Cloudflare 反爬	`StealthyFetcher`	自动绕过 Turnstile 等
大规模爬取	Spider 框架	并发、断点恢复、多 Session
快速查看内容	CLI (`scrapling extract`)	无需写代码，一行命令

平台兼容性

平台	状态	说明
------	------	------
macOS	✅ 已验证	系统 Python 或 WorkBuddy 托管 Python
Linux	✅ 支持	需安装系统级依赖
Windows	✅ 支持	建议 PowerShell 执行
WorkBuddy	✅ 已验证	托管 Python 3.13.12
Claude Code (Codex)	✅ 支持	通过 `detect_env()` 自动适配

2. 前置条件与环境检测

2.1 平台自动检测

在执行任何命令前，运行以下检测函数确定当前环境：

import platform, os, sys

def detect_env():
    """返回当前平台的配置字典"""
    is_workbuddy = os.path.exists(os.path.expanduser("~/.workbuddy"))
    is_codex = os.environ.get("CLAUDERC") or os.environ.get("CODECLIMATE_REPO_TOKEN")
    
    if is_workbuddy:
        return {
            "platform": "workbuddy",
            "python": os.path.expanduser("~/.workbuddy/binaries/python/envs/default/bin/python3"),
            "pip": os.path.expanduser("~/.workbuddy/binaries/python/envs/default/bin/pip"),
            "scrapling": os.path.expanduser("~/.workbuddy/binaries/python/envs/default/bin/scrapling"),
        }
    elif is_codex:
        return {
            "platform": "codex",
            "python": "python3" if platform.system() != "Windows" else "python",
            "pip": "pip3" if platform.system() != "Windows" else "pip",
            "scrapling": "scrapling",
        }
    else:
        system = platform.system().lower()
        return {
            "platform": system,
            "python": "python3" if system != "windows" else "python",
            "pip": "pip3" if system != "windows" else "pip",
            "scrapling": "scrapling",
        }

2.2 前置依赖清单

依赖项	最低版本	用途	检测命令
--------	---------	------	---------
Python	3.10+	运行环境	`python3 --version`
Scrapling	最新	爬虫核心	`python3 -c "import scrapling; print(scrapling.__version__)"`
Playwright	配套版本	浏览器自动化	`python3 -c "import playwright; print(playwright.__version__)"`
Chromium	最新	浏览器引擎	`python3 -m playwright install --dry-run 2>&1 \	grep chromium`
Git	—	源码克隆（可选）	`git --version`

2.3 各平台 Python 检测

# macOS - 系统 Python
/usr/bin/python3 --version

# WorkBuddy - 托管 Python
ls ~/.workbuddy/binaries/python/versions/ 2>/dev/null

# Linux - 系统 Python
python3 --version

# Windows - PowerShell
python --version

# Claude Code - 自动检测
python3 --version

3. 安装

3.1 标准安装（pip）

# ✅ macOS | ✅ Linux | ✅ Windows | ✅ WorkBuddy | ✅ Codex
pip install scrapling
pip install playwright
python3 -m playwright install chromium

3.2 WorkBuddy 托管环境安装

# 确定路径
PYTHON=~/.workbuddy/binaries/python/envs/default/bin/python3
PIP=~/.workbuddy/binaries/python/envs/default/bin/pip

# 安装
$PIP install scrapling
$PIP install playwright
$PYTHON -m playwright install chromium

3.3 验证安装

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher, Selector
print(f"Scrapling OK | Fetcher ✅ | StealthyFetcher ✅ | DynamicFetcher ✅")

> 如果仅需基础 HTTP 爬取，Playwright 和 Chromium 不是必须的。Fetcher 无需浏览器即可工作。

4. 配置管理

4.1 一次性配置（Setup once, reuse forever）

> 以下配置安装时设置一次，后续不再变动。

配置项	参数	平台注意	说明
--------	------	---------	------
`Fetcher.configure(adaptive=True)`	全局	通用	开启自适应解析（元素自动重定位）
Python 执行路径	变量	由 `detect_env()` 决定	安装时自动检测
默认超时	`timeout=30` (HTTP) / `timeout=30000` (浏览器)	通用	可被每次任务覆盖
默认代理（可选）	`proxy="http://..."`	通用	如团队有统一出口

Fetcher.configure() 参数速查：

参数	默认值	说明
------	--------	------
`adaptive`	`False`	开启自适应解析
`adaptive_domain`	`None`	绑定域名（跨 URL 匹配同一站点）
`huge_tree`	`False`	处理超大 HTML 时开启
`keep_comments`	`False`	保留 HTML 注释
`keep_cdata`	`False`	保留 CDATA 部分
`storage`	`None`	自适应存储后端（默认 SQLite）
`storage_args`	`None`	存储参数

4.2 每次任务配置（Per-task decision）

> 以下参数在 §5 每个爬取模板的「可调参数表」中场景化呈现，执行时根据目标调整。

Fetcher 选择决策流：

用户需求 (目标 URL)
    ↓
判断目标网站特征：
├── 纯静态 HTML / 公开 API  →  §5.1 Fetcher（最快）
├── JS 动态渲染 / SPA       →  §5.2 DynamicFetcher
├── 有 Cloudflare / 反爬     →  §5.3 StealthyFetcher
├── 需翻页 / 列表           →  §5.4 Spider
└── 需登录态                →  §5.5 登录态模板
    ↓
已确定输出目的地？
├── 本地文件 → 直接写 (§7.5 后处理)
├── 企微     → 加载 wecom-unified skill
├── 飞书     → 加载 lark-unified skill
├── 钉钉     → DingTalk OpenAPI
└── IMA     → 加载 ima-skill
    ↓
执行前确认 → 运行模板 → 输出 → 路由

决策项	可选值	判断依据
--------	-------	---------
Fetcher 类型	Fetcher / DynamicFetcher / StealthyFetcher	目标反爬强度
代理策略	无 / 单代理 / 轮转代理	目标是否封 IP
Cloudflare 解决	开 / 关	目标是否有 Cloudflare
输出目的地	本地 / 企微 / 飞书 / 钉钉 / IMA	用户需求
CSS/XPath 选择器	按需指定	需提取的内容

5. 爬取策略模板

> 每个模板统一为 5 段式：适用场景 → 可调参数 → 代码模板 → OUTPUT_SCHEMA → 结果路由指引。

⚠️ 执行前检查点 — 使用任何模板前必须确认：

检查项	确认内容	是否需要用户参与
--------	---------	----------------
✅ 环境就绪	Scrapling 已安装？（见 §10.1 自检脚本）	首次使用时确认
✅ 目标 URL	用户提供了完整的 URL？	每次必问
✅ Fetcher 选择	静态/Fetcher / 动态/Dynamic / 反爬/Stealthy？	每次必问
✅ 输出目的地	本地 / 企微 / 飞书 / 钉钉 / IMA？	每次必问
✅ 选择器指定	用户是否知道要提取哪些内容？	不清楚时先用 CLI 预览

> 如果 Scrapling 尚未安装，不要尝试运行代码，先跳转到 §3 安装完成安装。

5.1 静态页面爬取

适用场景：公开 API、静态 HTML 页面、无 JS 渲染的博客/新闻/文档站。

> ⚠️ 使用前确认：URL、CSS 选择器、是否需要输出路由

可调参数：

参数	示例值	说明
------	--------	------
url	`"https://example.com"`	目标 URL
timeout	`30`	超时（秒）
impersonate	`"chrome"`	TLS 指纹模拟浏览器
proxy	`"http://user:pass@host:port"`	代理（可选）
stealthy_headers	`True`	自动生成浏览器风格请求头
cookes	`"session=abc123"`	请求 Cookie（可选）

代码模板：

from scrapling.fetchers import Fetcher
from scrapling.parser import Selector

env = detect_env()
PYTHON = env["python"]

# 一次性获取
resp = Fetcher.get(url, timeout=timeout, impersonate="chrome")
page = Selector(resp.text)

# 使用 Session（连接池优化，适合多次请求）
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate="chrome") as session:
    page = session.get(url)
    data = page.css("h1::text").getall()

输出标准预定义 (OUTPUT_SCHEMA)：

OUTPUT_SCHEMA = {
    "url": str,             # 来源 URL
    "title": str,           # 页面标题
    "data": list[dict],     # 提取的数据列表
    "fields": {             # 字段定义
        "field_name": "CSS 选择器路径"
    },
    "timestamp": str,       # ISO 8601 抓取时间
    "total": int            # 记录数
}

结果路由指引：

# 路由到本地 CSV
to_csv(data, ["title", "url"], "output.csv")
# 路由到企微/飞书/IMA → 见 §7 结果路由

5.2 动态页面渲染（DynamicFetcher）

适用场景：SPA 单页应用、JS 渲染内容、需要等待网络请求完成的页面。

> ⚠️ 使用前确认：URL、目标是否真正需要 JS 渲染（可先用 §5.1 试抓确认）、等待策略

可调参数：

参数	示例值	说明
------	--------	------
url	`"https://spa-site.com"`	目标 URL
headless	`True`	无头模式
network_idle	`True`	等待网络空闲（500ms 无活动）
wait_selector	`".content-loaded"`	等待指定元素出现
wait	`2000`	额外等待（毫秒）
timeout	`30000`	超时（毫秒）
disable_resources	`False`	禁用不必要资源（加快加载）
block_ads	`True`	拦截广告和跟踪器

代码模板（含 3 种等待策略）：

from scrapling.fetchers import DynamicSession

# 策略一：等待网络空闲
with DynamicSession(headless=True, network_idle=True) as session:
    page = session.fetch(url)
    data = page.css(".content").getall()

# 策略二：等待特定元素出现
with DynamicSession(headless=True) as session:
    page = session.fetch(url, wait_selector=".main-loaded")
    data = page.css(".main-content").getall()

# 策略三：额外等待时间（适合动画/过渡）
with DynamicSession(headless=True) as session:
    page = session.fetch(url, wait=3000)
    data = page.css(".result").getall()

OUTPUT_SCHEMA：同 5.1

结果路由指引：

# 保存为 JSON
to_json(data, "dynamic_output.json")

5.3 Cloudflare 绕过（StealthyFetcher）

适用场景：有 Cloudflare 防护、Cloudflare Turnstile、反爬机制严格的网站。

> ⚠️ 使用前确认：URL、目标是否有 Cloudflare（出现验证页面/超时）；Cloudflare 解决需 ≥60s 超时

可调参数：

参数	示例值	说明
------	--------	------
url	`"https://protected-site.com"`	目标 URL
solve_cloudflare	`True`	自动解决 Cloudflare 挑战
headless	`True`	无头模式
timeout	`60000`	超时（毫秒，Cloudflare 建议 ≥60s）
proxy	`"http://user:pass@host:port"`	代理（可选）
proxy_rotator	`ProxyRotator(...)`	轮转代理（可选）
hide_canvas	`False`	Canvas 指纹噪声
block_webrtc	`True`	防止 WebRTC IP 泄露
allow_webgl	`True`	保持 WebGL 启用

代码模板：

from scrapling.fetchers import StealthySession

# 基础 Cloudflare 绕过
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page = session.fetch(url, timeout=60000)
    data = page.css("title::text").get()
    content = page.css(".main-content").getall()

# 带代理轮换的绕过
from scrapling.fetchers import StealthySession, ProxyRotator

proxies = [
    "http://proxy1:8080",
    "http://proxy2:8080",
]
rotator = ProxyRotator(proxies)

with StealthySession(
    headless=True,
    solve_cloudflare=True,
    proxy_rotator=rotator,
    timeout=60000
) as session:
    page = session.fetch(url)
    data = page.css("h1::text").get()

OUTPUT_SCHEMA：同 5.1

结果路由指引：

# Cloudflare 绕过 → JSON 输出
to_json(data, "stealth_output.json")
# 路由到 IMA → 加载 ima-skill → upload_document("stealth_output.json")

5.4 分页/列表爬取（Spider）

适用场景：需要翻页爬取的商品列表、文章列表、搜索结果等。

可调参数：

参数	示例值	说明
------	--------	------
start_urls	`["https://site.com/page/1"]`	起始 URL
concurrent_requests	`4`	并发请求数
download_delay	`1.0`	请求间隔（秒）
max_pages	`50`	最大翻页数
crawldir	`"./crawl_data"`	断点恢复目录（可选）

代码模板：

from scrapling.spiders import CrawlSpider, CrawlRule, LinkExtractor
from scrapling.spiders import Spider, Request, Response

# 方式一：CrawlSpider（声明式，自动跟链接）
class ListSpider(CrawlSpider):
    name = "list_scraper"
    start_urls = ["https://site.com/list"]
    concurrent_requests = 4
    download_delay = 1.0

    def rules(self):
        return [
            CrawlRule(
                LinkExtractor(allow=r"/item/\d+"),
                callback=self.parse_item
            ),
            CrawlRule(
                LinkExtractor(allow=r"/page/\d+"),
            ),
        ]

    async def parse_item(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "url": response.url,
        }

result = ListSpider().start()
result.items.to_json("output.json")
print(f"共抓取 {len(result.items)} 条记录")

# 方式二：带断点恢复
spider = ListSpider(crawldir="./crawl_data")
result = spider.start()
# 按 Ctrl+C 可暂停，重新执行会从断点继续

# 方式三：带 try/except 的执行（应对临时网络失败）
try:
    spider = ListSpider(crawldir="./crawl_data")
    result = spider.start()
    print(f"完成: {len(result.items)} 条")
except Exception as e:
    print(f"爬取失败: {e}")
    print("建议: 检查网络 → 检查目标是否可访问 → 重试")
    print("如果反复失败，检查 §8 错误排查")

OUTPUT_SCHEMA：

OUTPUT_SCHEMA = {
    "url": str,
    "records": [dict],     # 每条记录为一个 dict
    "fields": ["field1", "field2"],
    "total": int,
    "timestamp": str,
    "stats": {             # 爬取统计
        "requests": int,
        "failed": int,
        "elapsed_seconds": float
    }
}

结果路由指引：

# Spider 结果直接输出 JSON
result.items.to_json("spider_output.json")

# 转 CSV 输出
to_csv([dict(r) for r in result.items], ["title", "url"], "spider_output.csv")

# 路由到企微智能表格 → 加载 wecom-unified skill → sheet record append
# 路由到飞书多维表格 → 加载 lark-unified skill → base record append

5.5 登录态保持

适用场景：需要登录 Cookie 才能访问的页面、带身份验证的内容。

可调参数：

参数	示例值	说明
------	--------	------
url	`"https://auth-site.com/dashboard"`	目标 URL
cookies	`"session=abc123; token=xyz"`	登录态 Cookie
auth	`("user", "pass")`	HTTP Basic Auth
impersonate	`"chrome"`	浏览器指纹
cookie_file	`"./cookies.json"`	Cookie 持久化文件

代码模板：

from scrapling.fetchers import FetcherSession
import json, os

# 方案一：直接从 Cookies 字符串
with FetcherSession(impersonate="chrome") as session:
    page = session.get(url, cookies="session=abc123; token=xyz")
    data = page.css(".protected-content").getall()

# 方案二：Cookie 文件持久化（跨会话保持）
COOKIE_FILE = "./cookies.json"

def save_cookies(cookies_dict):
    with open(COOKIE_FILE, "w") as f:
        json.dump(cookies_dict, f)

def load_cookies():
    if os.path.exists(COOKIE_FILE):
        with open(COOKIE_FILE) as f:
            return json.load(f)
    return {}

# 首次：登录并保存 Cookie
with FetcherSession(impersonate="chrome") as session:
    # 先登录
    resp = session.post("https://site.com/login", data={
        "username": "user", "password": "pass"
    })
    # 保存登录后 Cookie
    save_cookies(resp.cookies)

# 后续：加载 Cookie 直接访问
with FetcherSession(impersonate="chrome") as session:
    cookies = load_cookies()
    page = session.get(url, cookies=cookies)
    data = page.css(".content").getall()

OUTPUT_SCHEMA：同 5.1

结果路由指引：

# 登录后数据 → JSON 输出
to_json(data, "auth_data.json")
# 路由到企微/飞书/IMA → 详见 §7

6. 数据提取速查

6.1 CSS 选择器

Scrapling 支持完整 CSS3 + Scrapling 特有伪元素：

选择器	说明	示例
--------	------	------
`.class`	class 选择器	`page.css(".title")`
`#id`	id 选择器	`page.css("#main")`
`tag`	标签选择器	`page.css("div")`
`::text`	取文本（Scrapling 特有）	`page.css("h1::text")`
`::attr(name)`	取属性（Scrapling 特有）	`page.css("a::attr(href)")`
`:contains("text")`	文本包含	`page.css("div:contains('价格')")`
链式调用	多级筛选	`page.css(".list").css(".item::text")`

方法：

page.css("h1::text").get()           # 第一条
page.css("h1::text").getall()        # 全部
page.css(".item").first              # 安全取第一个（不会 IndexError）
page.css(".item").last               # 安全取最后一个

真实站点选择器示例：

# 商品列表页：提取商品价格
page.css(".product-price::text").getall()       # <span class="product-price">¥99</span>

# 文章列表：提取标题 + 链接
page.css("h2.entry-title a::text").getall()     # 标题文本
page.css("h2.entry-title a::attr(href)").getall()  # 链接 URL

# 表格数据
page.css("table#data tbody tr").getall()        # 每行
page.css("table#data tbody tr td::text").getall()  # 每个单元格

# 分页链接
page.css("a.next.page-numbers::attr(href)").get()  # 下一页

6.2 XPath 选择器

表达式	说明
--------	------
`//div`	所有 div 元素
`//div[@class="title"]`	class="title" 的 div
`//h1/text()`	h1 的文本
`//a/@href`	a 的 href 属性
`//div[contains(text(), "关键词")]`	包含关键词的 div

注意：Scrapling 不支持 XPath has-class() 扩展函数。用 .has_class() 方法替代：

elements = page.css("div")
for el in elements:
    if el.has_class("active"):
        print(el.text)

6.3 内嵌 JSON 提取

# 从 <script> 标签中提取 JSON 数据
scripts = page.css("script::text").getall()
for script in scripts:
    if "window.__INITIAL_STATE__" in script:
        import re
        match = re.search(r"__INITIAL_STATE__\s*=\s*({.*?});", script, re.DOTALL)
        if match:
            # 使用 TextHandler.json() 解析
            from scrapling.parser import TextHandler
            data = TextHandler(match.group(1)).json()
            print(data.keys())

6.4 正则回退

当 CSS/XPath 无法定位时，用正则兜底：

from scrapling.parser import Selector

page = Selector(html_text)
# 先用 CSS
data = page.css(".price::text").getall()

# 兜底：正则提取
if not data:
    data = page.re(r"价格[：:]?\s*¥?([\d,.]+)")
    # 或
    data = page.re_first(r"\d{11}", default=None)  # 提取手机号

7. 结果路由

7.1 标准中间格式

Scrapling 抓取结果统一为以下格式，再路由到目标平台：

ROUTABLE_DATA = {
    "schema": OUTPUT_SCHEMA,    # 从 §5 模板继承
    "records": [dict],          # 数据记录列表
    "source": str,              # 来源 URL
    "timestamp": str,           # ISO 8601 抓取时间
    "total": int                # 记录数
}

7.2 路由目标索引表

目标平台	渠道类型	具体工具	GitHub 项目	配置要求
---------	---------	---------	------------	---------
CSV	Python	`csv.writer`	—	无
JSON	Python	`orjson` / `json.dump`	—	无
Excel	Python	`openpyxl`	https://github.com/theorchard/openpyxl	`pip install openpyxl`
企微智能文档	Skill	`wecom-unified` → `wecom-cli`	https://github.com/WecomTeam/wecom-cli	企业微信应用凭证
企微智能表格	Skill	`wecom-unified` → 智能表格 API	https://github.com/WecomTeam/wecom-cli	同上
飞书文档/表格	Skill	`lark-unified` → `lark-cli`	https://github.com/larksuite/cli	飞书应用凭证
钉钉文档	REST API	钉钉 OpenAPI 文档接口	https://github.com/open-dingtalk/dingtalk-mcp	Client ID + Secret
IMA 知识库	Skill	`ima-skill` → `upload_document`	ima.qq.com （官方 API）	IMA_OPENAPI 凭证

7.3 路由决策流程

Scrapling 输出 (标准 JSON)
    ↓
是否需格式转换？ → 是 → CSV / Excel / Markdown / ... → 输出临时文件
    ↓ 否                                  ↓
路由目标平台选择：
    ├── 企微 → 加载 wecom-unified skill → 调用对应命令
    ├── 飞书 → 加载 lark-unified skill → 调用对应命令
    ├── 钉钉 → 调用 DingTalk OpenAPI (HTTP POST)
    ├── IMA  → 加载 ima-skill → upload_document
    ├── 本地 → 直接写文件
    └── 其他 → 输出本地文件 + 告知路径（兜底）

7.4 各平台写入指引

企微智能文档 / 智能表格：

→ 加载 wecom-unified skill
→ 确保 wecom-cli 已安装 (https://github.com/WecomTeam/wecom-cli)
→ 文档创建：wecom-cli document create --title "结果" --content markdown
→ 表格写入：wecom-cli sheet record append --table-id xxx --records json

飞书文档 / 多维表格：

→ 加载 lark-unified skill
→ 确保 lark-cli 已安装 (https://github.com/larksuite/cli)
→ 文档创建：lark document create --title "结果" --content markdown
→ 表格写入：lark base record append --table-id xxx --records json

钉钉文档：

→ 方式一：DingTalk OpenAPI 直接调用 (https://open.dingtalk.com/)
→ 方式二：配置 dingtalk-mcp MCP Server (https://github.com/open-dingtalk/dingtalk-mcp)
   （注：当前 MCP 暂未包含文档操作，需直接使用 REST API）
→ 前置条件：钉钉开放平台企业应用 → 获取 Client ID + Client Secret
→ 文档 API 参考：https://open.dingtalk.com/document/development/dingtalk-document-overview

IMA 知识库：

→ 加载 ima-skill
→ 先输出本地 JSON/CSV → ima-skill upload_document(file_path)
→ IMA OpenAPI: https://ima.qq.com

7.5 数据后处理模板（格式转换）

import csv, json

# 转 CSV
def to_csv(records, fields, output_path):
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(records)

# 转 JSON（输出）
def to_json(records, output_path):
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(records, f, ensure_ascii=False, indent=2)

# 字段清洗（null/missing 填充）
def clean_records(records, default=""):
    for rec in records:
        for k, v in rec.items():
            if v is None:
                rec[k] = default
    return records

8. 常见错误排查

8.0 Scrapling 未安装 / 环境不完整

症状: ModuleNotFoundError: No module named 'scrapling'
       → 代码模板无法执行
原因: Scrapling 尚未安装或安装在错误的 Python 环境
解决:
  Step 1: 运行自检脚本 (§10.1) 确认当前状态
  Step 2: 跳转到 §3 安装 → 选择对应平台的安装方案
  Step 3: 安装完成后再次运行自检确认

  ⚠️ 不要跳过安装步骤直接运行代码模板
  ⚠️ 多 Python 环境时，确保 pip 安装的目标与执行时用的 python 是同一版本

8.1 403 / 429 反爬拦截

症状: HTTP 403 / 429 → 内容为空
原因: 目标网站检测到爬虫特征
解决:
  Step 1: 添加浏览器指纹 → impersonate="chrome"
  Step 2: 切换 StealthyFetcher → 自动绕过 Cloudflare
  Step 3: 添加代理 → proxy / proxy_rotator
  Step 4: 降低频率 → download_delay / 手动限速

8.2 浏览器启动失败

症状: "Browser closed" / "Channel closed"
原因: Playwright 浏览器未安装或系统依赖缺失
解决:
  $ python3 -m playwright install chromium
  $ python3 -m playwright install-deps chromium  # Linux 专属
  # 或复用系统 Chrome: 设置 real_chrome=True 参数

8.3 编码问题

症状: 中文乱码或字符显示异常
原因: 页面编码与 Scrapling 自动检测不一致
解决:
  resp = Fetcher.get(url)
  # 手动指定编码
  resp.encoding = "utf-8"  # 或 "gbk" / "gb2312"
  # 或从 HTML meta 标签提取
  import re
  charset = re.search(r'charset=([^"\']+)', resp.text)
  if charset: resp.encoding = charset.group(1)

8.4 结果不符合预期

症状: 输出为空白或数据缺失
排查步骤:
  1. 检查 CSS 选择器是否正确 → page.css("...").getall()
  2. 判断是否为动态加载 → 换 DynamicFetcher
  3. 检查是否有反爬拦截 → 查看 page.status
  4. 检查页面是否为空 → 打印 page.text[:200]
  5. 降级方案: Fetcher → DynamicFetcher → StealthyFetcher

9. 典型使用场景

场景 A：抓取商品列表 + 分页 → 企微智能表格

1. 决策：目标为静态分页列表 → Fetcher + CrawlSpider
2. 模板：§5.4 分页模板
3. OUTPUT_SCHEMA: {"url", "title", "price", "timestamp"}
4. 执行：爬取 → JSON → 企微智能表格写入

路由：

→ Spider 输出 JSON
→ 加载 wecom-unified skill
→ 调用 wecom-cli 智能表格写入记录 API

场景 B：绕过 Cloudflare 抓取 → IMA 知识库

1. 决策：目标有 Cloudflare 防护 → StealthyFetcher
2. 模板：§5.3 Cloudflare 模板
3. OUTPUT_SCHEMA: {"url", "content", "timestamp"}
4. 执行：绕过 → JSON → IMA 知识库

路由：

→ StealthyFetcher 输出 JSON
→ 保存本地文件
→ 加载 ima-skill
→ 调用 upload_document(file_path)

场景 C：动态渲染页面截图 → 企微消息

1. 决策：SPA 单页应用 → DynamicFetcher
2. 模板：§5.2 动态模板
3. 输出：本地截图文件
4. 执行：渲染 → 截图 → 企微文件消息

路由：

→ DynamicFetcher 输出截图
→ 加载 wecom-unified skill
→ 调用消息发送 API 发送图片文件

场景 D：定时监控价格 → CSV + 飞书多维表格

1. 决策：定时任务 + 静态页面 → WorkBuddy automation + Fetcher
2. 模板：§5.1 静态模板
3. OUTPUT_SCHEMA: {"product", "price", "change", "timestamp"}
4. 执行：定时抓取 → CSV (本地存档) + 飞书多维表格 (可视化)

路由：

→ Fetcher 输出 JSON
→ CSV 写入本地
→ 加载 lark-unified skill
→ 调用 lark-cli 多维表格记录追加 API

10. 附录

10.1 自检脚本

运行以下 Python 脚本验证 Scrapling 环境完整性：

#!/usr/bin/env python3
"""Scrapling 安装自检"""
import sys, platform, subprocess, os

errors, warnings = [], []
pass_count = 0

def check(label, ok, detail=""):
    global pass_count
    if ok:
        pass_count += 1
        print(f"  ✅ {label}")
    else:
        errors.append(label)
        print(f"  ❌ {label}" + (f" — {detail}" if detail else ""))

def check_warn(label, ok, detail=""):
    if not ok:
        warnings.append(f"{label}: {detail}")
        print(f"  ⚠️  {label} — {detail}")
    else:
        print(f"  ✅ {label}")

print("=" * 50)
print("Scrapling 安装自检")
print("=" * 50)

# 系统
os_name = platform.system()
print(f"\n📋 系统: {os_name} {platform.release()}")

# Python
ver = sys.version_info
check("Python ≥ 3.10", ver.major >= 3 and ver.minor >= 10,
      f"当前: {ver.major}.{ver.minor}.{ver.micro}")
print(f"   Python: {sys.executable}")

# Git
try:
    subprocess.run(["git", "--version"], capture_output=True)
    check("Git 可用", True)
except:
    check_warn("Git 可用", False, "仅 pip 安装不需要 Git")

# Scrapling
try:
    import scrapling
    check("Scrapling 已安装", True)
    print(f"   版本: {scrapling.__version__}")
except ImportError:
    check("Scrapling 已安装", False)

# Playwright
try:
    import playwright
    check("Playwright 已安装", True)
    br_path = os.path.expanduser("~/.cache/ms-playwright")
    if os.path.exists(br_path):
        browsers = os.listdir(br_path)
        check(f"Playwright 浏览器 ({len(browsers)} 个)", True)
    else:
        check_warn("Playwright 浏览器", False, "需 playwright install chromium")
except ImportError:
    check_warn("Playwright 已安装", False, "Fetcher 可用，StealthyFetcher 不可用")

# Fetcher 可用性
check("Fetcher (基础 HTTP)", True)
try:
    from scrapling import StealthyFetcher
    check("StealthyFetcher (反爬绕过)", True)
except:
    check_warn("StealthyFetcher", False, "需 Playwright + 浏览器")
try:
    from scrapling import DynamicFetcher
    check("DynamicFetcher (浏览器自动化)", True)
except:
    check_warn("DynamicFetcher", False, "需 Playwright + 浏览器")

print(f"\n{'=' * 50}")
print(f"通过: {pass_count} 项")
if errors:
    print(f"失败: {len(errors)} 项")
    for e in errors:
        print(f"   - {e}")
if warnings:
    print(f"警告: {len(warnings)} 项")
    for w in warnings:
        print(f"   - {w}")
print(f"{'=' * 50}")

10.2 参考链接

资源	链接
------	------
Scrapling GitHub	https://github.com/D4Vinci/Scrapling
Scrapling 文档	https://scrapling.readthedocs.io/
Scrapling 官方 Skill	`scrapling-official`（API 参考用）
wecom-cli（企业微信）	https://github.com/WecomTeam/wecom-cli
lark-cli（飞书）	https://github.com/larksuite/cli
dingtalk-mcp（钉钉）	https://github.com/open-dingtalk/dingtalk-mcp
钉钉文档 OpenAPI	https://open.dingtalk.com/document/development/dingtalk-document-overview
IMA OpenAPI	https://ima.qq.com
Playwright Python	https://playwright.dev/python/

10.3 版本

当前版本：v2.0（2026-05-11）

Scrapling 策略工具箱 — 模板·路由·多平台

概述

Scrapling 自适应爬虫框架

目录

1. 概述

适用场景

平台兼容性

2. 前置条件与环境检测

2.1 平台自动检测

2.2 前置依赖清单

2.3 各平台 Python 检测

3. 安装

3.1 标准安装（pip）

3.2 WorkBuddy 托管环境安装

3.3 验证安装

4. 配置管理

4.1 一次性配置（Setup once, reuse forever）

4.2 每次任务配置（Per-task decision）

5. 爬取策略模板

5.1 静态页面爬取

5.2 动态页面渲染（DynamicFetcher）

5.3 Cloudflare 绕过（StealthyFetcher）

5.4 分页/列表爬取（Spider）

5.5 登录态保持

6. 数据提取速查

6.1 CSS 选择器

6.2 XPath 选择器

6.3 内嵌 JSON 提取

6.4 正则回退

7. 结果路由

7.1 标准中间格式

7.2 路由目标索引表

7.3 路由决策流程

7.4 各平台写入指引

7.5 数据后处理模板（格式转换）

8. 常见错误排查

8.0 Scrapling 未安装 / 环境不完整

8.1 403 / 429 反爬拦截

8.2 浏览器启动失败

8.3 编码问题

8.4 结果不符合预期

9. 典型使用场景

场景 A：抓取商品列表 + 分页 → 企微智能表格

场景 B：绕过 Cloudflare 抓取 → IMA 知识库

场景 C：动态渲染页面截图 → 企微消息

场景 D：定时监控价格 → CSV + 飞书多维表格

10. 附录

10.1 自检脚本

10.2 参考链接

10.3 版本

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

项目上下文管理

通用知识图谱构建器

谱位决策法（SPD）分析工具