> ## ⚡ INSTANT VALUE — Install This If You:
> - Need to scrape Baidu/Taobao/Douyin/Zhihu/WeChat/1688 but keep hitting anti-crawl walls
> - Want platform-specific bypass recipes — not generic "use Selenium" advice, but tested configs for each Chinese site
> - Are using Scrapling but need Chinese site configurations (Baidu's cookie walls, Taobao's login gates, Douyin's dynamic rendering)
> - Want to know what's legal — China's data scraping legal boundaries (Criminal Law 285/286, Data Security Law, PIPL)
>
> 🎯 Why this over generic scraping skills? Generic scraping skills give you BeautifulSoup/Selenium tutorials. We give you tested anti-crawl configs for 10+ Chinese websites, legal compliance boundaries (avoid Criminal Law 285!), and Scrapling integration with Chinese site presets. Tutorials vs Recipes — you decide.
>
> 🔗 Based on Scrapling (31K+ GitHub Stars) — the fastest Python scraping framework with adaptive selectors and Cloudflare bypass. We add the China layer on top.
You are a Chinese website data scraping expert. You help AI agents and developers scrape data from Chinese websites that are notoriously difficult to crawl — Baidu, Taobao, Douyin, Zhihu, WeChat, 1688, and more.
Scrapling solves the general scraping problem. We solve the China-specific problem.
Chinese websites have unique anti-crawl mechanisms that generic tools can't handle:
This skill provides tested recipes for each platform, not generic advice.
┌─────────────────────────────────────┐
│ AI Agent / User │
├─────────────────────────────────────┤
│ cn-data-scraper Skill │
│ ┌─────────────┐ ┌───────────────┐ │
│ │ Platform │ │ Legal │ │
│ │ Recipes │ │ Compliance │ │
│ │ (10+ sites) │ │ Boundaries │ │
│ └──────┬──────┘ └───────┬───────┘ │
│ │ │ │
│ ┌──────▼────────────────▼───────┐ │
│ │ Scrapling Framework │ │
│ │ (Adaptive Selectors + │ │
│ │ StealthyFetcher + │ │
│ │ Camoufox Engine) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
Anti-crawl mechanisms:
Scrapling configuration:
from scrapling import StealthyFetcher
# Baidu search with anti-crawl bypass
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=关键词',
headless=True,
network_idle=True, # Wait for JS execution
timeout=30000,
)
# Extract search results — use adaptive selectors
# Baidu frequently changes class names, so use structural selectors
results = page.css('div.c-container') # More stable than class-based
for result in results:
title = result.css_first('h3 a')
snippet = result.css_first('span.content-right_8Zs40')
# Fallback: adaptive selector if structure changed
if not snippet:
snippet = result.css_first('[class*="content"]')
Key tips:
User-Agent with Baidu app identifier for mobile resultsnetwork_idle=True — Baidu loads results via AJAXAnti-crawl mechanisms:
Scrapling configuration:
# Taobao requires login — use cookie injection
page = StealthyFetcher.fetch(
'https://s.taobao.com/search?q=关键词',
headless=True,
network_idle=True,
# Must inject login cookies
cookies={
'_m_h5_tk': 'your_token_here',
'_m_h5_tk_enc': 'your_enc_token_here',
'cookie2': 'your_cookie2',
'sgcookie': 'your_sgcookie',
}
)
# Extract product data
products = page.css('div.Card--doubleCardWrapper')
for product in products:
title = product.css_first('span.Title--titleSpan')
price = product.css_first('span.Price--priceInt')
sales = product.css_first('span.Sales--sales')
Key tips:
mtop.taobao.searchapi.search API endpoint for structured dataAnti-crawl mechanisms:
Scrapling configuration:
# Douyin web version — easier than app API
page = StealthyFetcher.fetch(
'https://www.douyin.com/search/关键词',
headless=True,
network_idle=True,
wait_selector='div.video-card', # Wait for video cards to load
)
# Extract video data
videos = page.css('div.video-card')
for video in videos:
title = video.css_first('p.title')
author = video.css_first('span.author-card-user-name')
likes = video.css_first('span.video-like-count')
Key tips:
Anti-crawl mechanisms:
Scrapling configuration:
# Zhihu search — use API endpoint for structured data
page = StealthyFetcher.fetch(
'https://www.zhihu.com/search?type=content&q=关键词',
headless=True,
network_idle=True,
# Zhihu requires specific headers
headers={
'Referer': 'https://www.zhihu.com/',
}
)
# Extract search results
results = page.css('div.SearchResult-Card')
for result in results:
title = result.css_first('h2.ContentItem-title a')
excerpt = result.css_first('span.RichText')
author = result.css_first('meta[itemprop="name"]')
Key tips:
api.zhihu.com/search_v3 returns JSON — easier to parseAnti-crawl mechanisms:
Scrapling configuration:
# WeChat article — direct URL access
page = StealthyFetcher.fetch(
'https://mp.weixin.qq.com/s/ARTICLE_ID',
headless=True,
network_idle=True,
)
# Extract article content
title = page.css_first('h1.rich_media_title')
content = page.css_first('div.rich_media_content')
author = page.css_first('a.rich_media_meta_link')
publish_time = page.css_first('em#publish_time')
Key tips:
weixin.sogou.com — but has aggressive anti-crawlmp.weixin.qq.com direct access for known URLsAnti-crawl mechanisms:
Scrapling configuration:
# 1688 search
page = StealthyFetcher.fetch(
'https://s.1688.com/selloffer/offer_search.htm?keywords=关键词',
headless=True,
network_idle=True,
)
# Extract product listings
products = page.css('div.offer-item')
for product in products:
title = product.css_first('a.title')
price = product.css_first('span.price')
min_order = product.css_first('span.min-order')
supplier = product.css_first('a.company-name')
Anti-crawl mechanisms:
Scrapling configuration:
# Xiaohongshu web version
page = StealthyFetcher.fetch(
'https://www.xiaohongshu.com/search_result?keyword=关键词',
headless=True,
network_idle=True,
wait_selector='section.note-item',
)
# Extract notes
notes = page.css('section.note-item')
for note in notes:
title = note.css_first('div.title')
author = note.css_first('span.name')
likes = note.css_first('span.like-wrapper .count')
Key tips:
Anti-crawl mechanisms:
Scrapling configuration:
# Weibo search
page = StealthyFetcher.fetch(
'https://s.weibo.com/weibo?q=关键词',
headless=True,
network_idle=True,
cookies={
'SUB': 'your_sub_cookie', # Required for search
}
)
# Extract posts
posts = page.css('div.card-wrap[action-type="feed_list_item"]')
for post in posts:
author = post.css_first('a.name')
content = post.css_first('p.txt')
reposts = post.css_first('a[action-type="fl_forward"] em')
comments = post.css_first('a[action-type="flcomment"] em')
likes = post.css_first('a[action-type="fl_like"] em')
Anti-crawl mechanisms:
Scrapling configuration:
# CSDN article
page = StealthyFetcher.fetch(
'https://blog.csdn.net/author/article/ID',
headless=True,
network_idle=True,
)
# Remove anti-copy overlay
content = page.css_first('article.baidu_pl')
# Content is in HTML, anti-copy is just a CSS overlay
Anti-crawl mechanisms:
Scrapling configuration:
# Boss Zhipin search
page = StealthyFetcher.fetch(
'https://www.zhipin.com/web/geek/job?query=关键词',
headless=True,
network_idle=True,
cookies={
'geek_zp_token': 'your_token',
}
)
# Extract job listings
jobs = page.css('li.job-card-wrapper')
for job in jobs:
title = job.css_first('span.job-name')
salary = job.css_first('span.salary')
company = job.css_first('h3.company-name')
location = job.css_first('span.job-area')
This is NOT optional. Violating these can result in criminal prosecution.
| Law | Scope | Max Penalty |
|---|---|---|
| ----- | ------- | ------------- |
| Criminal Law Art. 253 | Personal information | 7 years + fine |
| Criminal Law Art. 285 | Unauthorized system access | 7 years |
| Criminal Law Art. 286 | System disruption | 15 years |
| Data Security Law | Data classification | ¥10M fine |
| PIPL (个人信息保护法) | Personal information | ¥50M or 5% revenue |
| Cybersecurity Law | Network data | ¥1M fine |
| Anti-Unfair Competition Law | Business data scraping | ¥3M fine |
# Install Scrapling with all features
pip install scrapling[all]
# Or minimal install
pip install scrapling
from scrapling import StealthyFetcher, Fetcher
# 1. Simple HTTP fetch (fast, no JS rendering)
page = Fetcher.get('https://example.com')
# 2. Stealthy browser fetch (bypasses anti-bot)
page = StealthyFetcher.fetch(
'https://www.baidu.com/s?wd=test',
headless=True,
network_idle=True,
)
# 3. Adaptive selectors — survive site redesigns
element = page.find_by_text('价格') # Find by text content
element = page.css_first('[class*="price"]') # Partial class match
Chinese websites redesign frequently. Use these strategies to make your selectors resilient:
# ❌ BAD: Exact class names (break on redesign)
title = page.css_first('span.title_3wVZ1')
# ✅ GOOD: Structural selectors
title = page.css_first('h2 a') # Semantic HTML
# ✅ GOOD: Partial class match
title = page.css_first('[class*="title"]')
# ✅ GOOD: Text-based selection
title = page.find_by_text('价格')
# ✅ GOOD: Attribute-based
title = page.css_first('[data-type="title"]')
# ✅ BEST: Scrapling's adaptive selector
# Scrapling remembers element characteristics and re-finds after changes
element = page.css_first('div.product-title')
# If class changes, Scrapling's smart locator adapts automatically
import time
import random
def polite_scrape(urls, min_delay=2, max_delay=5):
"""Scrape with polite rate limiting"""
results = []
for url in urls:
page = Fetcher.get(url)
results.append(page)
# Random delay to appear human
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
# Platform-specific rate limits
RATE_LIMITS = {
'baidu': {'min': 3, 'max': 8, 'max_per_hour': 80},
'taobao': {'min': 5, 'max': 12, 'max_per_hour': 30},
'douyin': {'min': 5, 'max': 15, 'max_per_hour': 20},
'zhihu': {'min': 3, 'max': 8, 'max_per_hour': 40},
'wechat': {'min': 2, 'max': 5, 'max_per_hour': 60},
'1688': {'min': 5, 'max': 10, 'max_per_hour': 30},
'xiaohongshu': {'min': 8, 'max': 20, 'max_per_hour': 10},
'weibo': {'min': 3, 'max': 8, 'max_per_hour': 40},
}
def extract_product(page, platform='generic'):
"""Universal product data extractor"""
templates = {
'taobao': {
'title': 'span.Title--titleSpan',
'price': 'span.Price--priceInt',
'sales': 'span.Sales--sales',
'shop': 'a.ShopName--shopName',
},
'1688': {
'title': 'a.title',
'price': 'span.price',
'min_order': 'span.min-order',
'supplier': 'a.company-name',
},
'jd': {
'title': 'div.sku-name',
'price': 'span.price',
'comments': 'a.comment-count',
},
}
selector = templates.get(platform, templates['taobao'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
def extract_social_post(page, platform='generic'):
"""Universal social media post extractor"""
templates = {
'weibo': {
'author': 'a.name',
'content': 'p.txt',
'reposts': 'a[action-type="fl_forward"] em',
'comments': 'a[action-type="flcomment"] em',
'likes': 'a[action-type="fl_like"] em',
},
'xiaohongshu': {
'author': 'span.name',
'content': 'span.note-text',
'likes': 'span.like-wrapper .count',
'collects': 'span.collect-wrapper .count',
},
'douyin': {
'author': 'span.author-card-user-name',
'content': 'p.title',
'likes': 'span.video-like-count',
},
}
selector = templates.get(platform, templates['weibo'])
return {
field: page.css_first(sel).text() if page.css_first(sel) else None
for field, sel in selector.items()
}
scripts/scrape.sh — Quick CLI Scraper#!/bin/bash
# cn-data-scraper CLI tool
# Usage: ./scripts/scrape.sh <platform> <keyword> [options]
PLATFORM=$1
KEYWORD=$2
OUTPUT=${3:-/tmp/scrape_result.json}
if [ -z "$PLATFORM" ] || [ -z "$KEYWORD" ]; then
echo "Usage: ./scripts/scrape.sh <platform> <keyword> [output_file]"
echo "Platforms: baidu taobao douyin zhihu wechat 1688 xiaohongshu weibo csdn boss"
exit 1
fi
python3 -c "
from scrapling import StealthyFetcher, Fetcher
import json
platform = '$PLATFORM'
keyword = '$KEYWORD'
output = '$OUTPUT'
URLS = {
'baidu': f'https://www.baidu.com/s?wd={keyword}',
'zhihu': f'https://www.zhihu.com/search?type=content&q={keyword}',
'weibo': f'https://s.weibo.com/weibo?q={keyword}',
'csdn': f'https://so.csdn.net/so/search?q={keyword}',
}
url = URLS.get(platform)
if not url:
print(json.dumps({'error': f'Platform {platform} not supported for CLI scraping. Use Python API for full features.'}))
exit(0)
try:
if platform in ['baidu', 'zhihu', 'weibo']:
page = StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=30000)
else:
page = Fetcher.get(url)
# Extract all text content
texts = [el.text() for el in page.css('p, span, h1, h2, h3, h4, h5, h6') if el.text()]
result = {
'platform': platform,
'keyword': keyword,
'url': url,
'content_count': len(texts),
'preview': texts[:20],
}
with open(output, 'w') as f:
json.dump(result, f, ensure_ascii=False, indent=2)
print(json.dumps(result, ensure_ascii=False, indent=2))
except Exception as e:
print(json.dumps({'error': str(e)}))
"
For AI agent workflows, this skill can be used with MCP servers:
# Example: MCP tool for scraping Chinese websites
from mcp.server import Server
server = Server("cn-data-scraper")
@server.tool()
def scrape_chinese_site(platform: str, keyword: str, max_results: int = 10) -> dict:
"""Scrape data from Chinese websites with anti-crawl bypass.
Args:
platform: Target platform (baidu/taobao/douyin/zhihu/wechat/1688/xiaohongshu/weibo)
keyword: Search keyword
max_results: Maximum number of results to return
Returns:
Dictionary with scraped data and metadata
"""
# Implementation using Scrapling + platform recipes
pass
@server.tool()
def check_legal_compliance(scraping_plan: str) -> dict:
"""Check if a scraping plan complies with Chinese law.
Args:
scraping_plan: Description of what data you plan to scrape
Returns:
Compliance assessment with risk level and legal references
"""
pass
| Feature | Scrapling | BeautifulSoup | Selenium | Playwright |
|---|---|---|---|---|
| --------- | ----------- | --------------- | ---------- | ------------ |
| Speed | 784x BS4 | Baseline | Slow | Medium |
| Anti-crawl bypass | ✅ Built-in | ❌ | ⚠️ Manual | ⚠️ Manual |
| Adaptive selectors | ✅ Auto | ❌ | ❌ | ❌ |
| Cloudflare bypass | ✅ Native | ❌ | ⚠️ Plugin | ⚠️ Plugin |
| Chinese site configs | ❌ (We add this) | ❌ | ❌ | ❌ |
| Legal compliance | ❌ (We add this) | ❌ | ❌ | ❌ |
| Memory usage | Low | Very Low | High | Medium |
| Setup complexity | pip install | pip install | Driver needed | pip install |
Our value-add: Scrapling handles the technical scraping. We add the China layer (platform recipes + legal compliance + adaptive selectors for Chinese sites).
共 1 个版本
暂无安全检测报告