Automated pipeline for Chinese news aggregation and digest generation.
# 第 1 步:安装依赖
pip install requests beautifulsoup4
# 第 2 步:一键初始化(建表 + 插入示例网站 + 关键词)
python scripts/news_digest_v2/init_db.py
# 第 3 步:运行摘要
python scripts/news_digest_v2/run_all_stages.py
或者一条命令全部搞定:
python scripts/news_digest_v2/quick_start.py
Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)
Stage 1: Fetch → Scrape websites → Filter → Save to SQLite DB
Stage 2: Process → Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM → Batch LLM summarization (optional, requires API key)
Stage 3: Output → Read LLM summaries (fallback to rule summaries) → Save to files
requests, beautifulsoup4Run the init script to create tables and seed with sample data:
python scripts/news_digest_v2/init_db.py
This creates:
Default database path: news.db (in the skill directory).
Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db
After initialization, add or remove websites and keywords via SQL:
-- Add a website
INSERT INTO monitor_websites (name, url, selector, category, priority)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);
-- Add a keyword
INSERT INTO system_keywords (keyword, category, weight)
VALUES ('新能源', 'core', 5);
| Table | Purpose |
|---|---|
| ------- | --------- |
articles | Scraped news articles (title, content, URL, date, keywords, duplicate flag) |
monitor_websites | Monitored websites (name, URL, CSS selector, category, enabled) |
system_keywords | Keywords for relevance scoring (core vs auxiliary, with weight) |
digest_output | LLM-generated summaries (optional) |
python scripts/news_digest_v2/run_all_stages.py
Takes ~13 minutes (network + LLM bound).
python scripts/news_digest_v2/quick_start.py
Runs init + fetch + process + output in one shot.
schedule: "0 20 * * *" # Daily 20:00
payload:
run: python scripts/news_digest_v2/run_all_stages.py
then: read .news-digest-out.md and send to messaging
timeout: 900 # 15 minutes
【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...
不完整句子自动过滤:
教程/指南类内容全部过滤:
rules_config.py 中 social 分类的教程关键词列表Not simple truncation. Each paragraph is scored by:
Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.
Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.
教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator 等。
企业宣传稿/软文(全部过滤):产能突破、全线投产、技术溢出、供应链底气、跨界营销、负面舆情、品鉴官、品牌定位等。
教育/社会活动/颁奖(全部过滤):十佳、颁奖仪式、表彰、职校生、职业院校、评选、杰出代表、工匠精神等。
Invalid keywords: clickbait patterns, advertising, webpage navigation elements.
| Variable | Default | Description |
|---|---|---|
| ---------- | --------- | ------------- |
NEWS_DIGEST_DB | news.db | SQLite database path |
NEWS_DIGEST_LLM_API_KEY | (empty) | LLM API key for Stage 2.5 summarization |
NEWS_DIGEST_LLM_BASE_URL | (empty) | LLM API base URL |
NEWS_DIGEST_LLM_MODEL | qwen3.6-plus | LLM model name |
If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.
news-digest/
├── SKILL.md
└── scripts/
└── news_digest_v2/
├── __init__.py
├── config.py # DB path, websites, keywords, holidays, LLM config
├── database.py # SQLite operations
├── fetcher.py # Web scraping + smart summary extraction
├── filters.py # Content filtering logic
├── formatter.py # Output formatting + incomplete sentence handling
├── init_db.py # One-click database initialization (NEW in v1.0.1)
├── quick_start.py # One-command full pipeline (NEW in v1.0.1)
├── rules_config.py # Exclusion rules, keywords, dateline patterns
├── similarity.py # Jaccard deduplication
├── stage1_fetch.py # Stage 1 entry (fetch)
├── stage2_process.py # Stage 2 entry (dedup + keywords)
├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
├── stage3_output.py # Stage 3 entry (read + format + save)
└── run_all_stages.py # Full pipeline entry
Q: 安装后跑不起来?
A: 确保先运行了 init_db.py 初始化数据库。没有数据库和示例数据,后续步骤会失败。
Q: pip install 失败?
A: 尝试 pip install --upgrade pip 后再安装。如果网络问题,使用 pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests beautifulsoup4。
Q: 某些网站抓取失败?
A: 正常现象。部分网站有反爬或 SSL 问题,脚本会继续处理其他网站。不影响最终输出。
Q: 输出是空的?
A: 检查数据库中是否有数据。运行 python scripts/news_digest_v2/init_db.py 重新初始化。
Q: 如何自定义监测网站?
A: 通过 SQL 插入 monitor_websites 表,字段:name, url, selector, category, priority。
Q: 数据库会越来越大吗?
A: 约 30-50 条/天。建议定期清理旧数据,或删除 news.db 后重新初始化。
INSERT OR IGNORE),旧新闻保留。is_duplicate = 1,不删除。rules_config.py):parse_rmrbhwb 路由增加 paper.people.com.cn/rmrb/ 支持(数据库新增 id=47, priority=1)cross_day_dedup.py 新增硬规则4(标题互相包含检测)+ Jaccard权重 0.5→0.6formatter.py 中"收录新闻"统计从预过滤数改为实际输出数filtered_news 长度(排除重复后),但未减去内容类型黑名单过滤的条目corporate_scandal 分类)cross_day_dedup.py,对比最近 3 天历史摘要自动拦截跨天重复新闻stage1_fetch.py 从硬编码 42 改为动态读取 len(WEBSITES)decode_response() 已合并入主流程,支持所有已知 GBK 来源fetcher.py 新增 decode_response() 函数,对已知 GBK 编码来源(人民日报海外版)强制使用 GBK 解码,从根源修复 Cyrillic 乱码stage2_5_llm_summary.py 新增乱码标题检测,发现乱码时提示 LLM 从正文生成准确标题formatter.py 新增乱码标题兜底过滤TITLE_EXCLUDE_KEYWORDS(评论丨/时评/社评/深度观察/记者观察/招聘/面试/递补/人事任免/讣告/专访等)URL_EXCLUDE_PATTERNS(人民网评论频道等)formatter.py 输出时自动跳过非硬新闻类型qwen-plus(不存在,400 错误)改为 qwen3.6-pluscorporate_pr(企业宣传稿/软文)+ education_social(教育/颁奖/评选)·、空格)parse_cnr): 央广网页面标题和正文在同一 标签内,新增独立解析器只取 作为标题,避免标题+正文混一起导致标题过长被过滤 标签检测编码,提高抓取成功率init_db.py for one-click database initialization with sample dataquick_start.py for one-command full pipeline共 6 个版本