Use this skill whenever working on news/information crawlers. Treat each site as one source in a long-running news collection system, not as a one-off page parser.
ajax, fetch, page, pageNum, columnId, channel.user-js-reverse MCP when JS/signature/cookie logic is involved.get_market_news_info as the lightweight main table.get_market_news_content for content, content_html, images, and raw_data.```sql
UNIQUE KEY uk_source_unique_id (source,source_unique_id)
```
news_id is only a relation/index, not the content table unique key.media_name: collection site name.original_source: true source displayed by the article, empty string if unavailable.summary: article-specific summary only; never store fixed site description, column intro, or template text.author: do not use editor as a fake author.stock_list: 文章关联的个股代码列表,格式为纯字符串数组 ["AAPL", "INTC"]。页面或接口有就提取,没有就存空数组 []。不要存对象数组,不要附带URL等额外信息。crawl_error: short error only; do not store full HTML/response bodies.extra_data; large/raw fields go to the content table.images is non-empty: success; keep content empty and store content_html if available.crawl_status = 4 / body empty.Prefer structured API data over HTML parsing for content extraction.
Data source priority:
get_text() split by newline (last resort).Paragraph rules:
\n\n (double newline), not single \n.description as the first paragraph. tags; also consider –, , as paragraph-level elements. Skip non-content areas (disclaimers, author bios, related articles) by scoping to the correct content container first. tags exist, fall back to splitting get_text("\n") and filtering empty lines.Image position in content:
[图片:caption] in content at the image's position to preserve the spatial relationship between text and images.images field stores the full URL list (ordered by appearance).content_html preserves the original HTML structure for re-rendering if needed.Why this matters:
urllib.parse.quote).News crawlers must support first-time backfill, daily incremental collection, failed retry, batched dedupe, and resumability.
Standard strategy:
MAX(publish_time) per source + column_name.now - backfill_days (default 90 days).max_publish_time - overlap_hours (default 6 hours).IN (...).Never collect all columns/all historical pages into memory and only insert at the end.
INSERT ... ON DUPLICATE KEY UPDATE.crawl_status:
1: success2: detail request failed3: page structure abnormal4: body empty5: parse exceptionFile header: every crawler script must include a module docstring with @Desc followed by a 来源url: line pointing to the target page URL. This makes it easy to locate the crawl target when debugging.
"""
@Author: you name
@File : 香港01财经快讯.py python3
@Desc : 香港01(www.hk01.com 繁体站)財經快訊抓取
来源url:https://www.hk01.com/channel/396/%E8%B2%A1%E7%B6%93%E5%BF%AB%E8%A8%8A
"""
Add concise comments to non-obvious logic. Do not narrate what code does line-by-line. Focus on:
"""获取增量抓取的时间阈值:表里有数据则取最新时间-overlap_hours,无数据则回溯backfill_days天"""if not self.db_pool: # 本地调试无DB时直接走全量回溯should_stop = (not page_has_newer) # 本页全部早于阈值则停止翻页if not content_text: # blocks为空时兜底用HTML提取Do NOT comment:
每个爬虫必须内置监控报警,不能依赖外部监控平台单独配置。
三层报警(必须全部实现):
total_success == 0 → 飞书/通知 "0条数据入库,请检查网站或API是否异常"。total_fail / (total_success + total_fail) > 0.5 → 飞书/通知 "失败率过高,可能触发反爬"。# main() 示例结构
def main():
crawler = XxxCrawler(max_workers=3)
total_success, total_fail = crawler.run()
if total_success == 0:
notify("本次执行0条数据入库,请检查网站或API是否异常")
elif total_fail > 0 and total_fail / (total_success + total_fail) > 0.5:
notify("详情抓取失败率过高,可能触发反爬")
if __name__ == "__main__":
try:
main()
except Exception as err:
notify(f"程序报错: {err}")
raise
连续空页中断:
翻页停止条件需要区分两种情况,不能混为一谈:
| 情况 | should_stop | consecutive_empty | 动作 |
|---|---|---|---|
| ------ | :---: | :---: | ------ |
| 本页所有文章早于阈值 | True | 不累加 | 正常结束(翻到底了) |
| API请求失败/返回空 | False | +1 | 交给连续空页计数处理 |
关键原则:
([], False) 而非 ([], True)。should_stop=True 仅当确认页面有数据但全部早于阈值时才设置。consecutive_empty 的判断必须在 seen 去重之前,用 raw_tasks(原始列表结果)判断,而非去重后的 tasks。for page in range(max_pages):
raw_tasks, should_stop = self.parse_list_page(end_index, threshold)
if not raw_tasks:
consecutive_empty += 1
if consecutive_empty >= 3:
print(f"警告:连续{consecutive_empty}页为空,可能API异常")
break
else:
consecutive_empty = 0
tasks = [t for t in raw_tasks if t["id"] not in seen]
seen.update(t["id"] for t in tasks)
tasks = self.filter_not_exists(tasks)
# ... process tasks ...
if should_stop:
break
449 Foreign Host Forbidden)。proxies={} 绕过。# 需要绕过默认代理的请求
new_request("GET", url, "json", proxies={}, headers=headers)
raw_data 字段入库时从 item 构建即可,不需要提前存储完整 API 响应。Do not stop at syntax/lint checks. Verify:
summary is a real article summary.original_source is the true article source.proxies={})。Prefer a class-based crawler with these responsibilities:
共 3 个版本