概述

cn-financial-scraper 金融数据爬虫

数据源与更新

| 数据类型 | 来源 | 更新周期 |

|---------|------|---------|

| 全量金融机构名单 | 国家金融监督管理总局、银行业协会、基金业协会、证券业协会 | 每季度 |

| 上市公司报告 | 东方财富、巨潮资讯 | 按需爬取 |

| 券商研报 | 东方财富 | 按需爬取 |

| 新闻资讯 | 东方财富、同花顺 | 按需爬取 |

| 基金产品 | 天天基金、各基金公司官网 | 按需爬取 |

功能概览

| 功能 | 模块 | 说明 |

|------|------|------|

| 全量机构名单 | full_institution_crawler.py | 从监管机构获取全量金融机构名单 |

| 网页爬取 | scraper.py | 基础网页内容爬取 |

| 网页交互 | web_parser.py | 复杂页面操作（点击/填表/滚动） |

| 产品解析 | web_parser.py | 基金/ETF/FOF/股票/债券解析 |

| 机构爬取 | institution_scraper.py | 500+金融机构批量爬取 |

| 机构更新 | institution_updater.py | 季度更新机构名单 |

| 公告下载 | announcement_scraper.py | 公告搜索、页面扫描、PDF下载 |

| 文档解析 | document_parser.py | PDF/Word/Excel内容提取 |

| 产品分析 | analyzer.py | 财务指标、组合复刻 |

| 公司报告 | company_report_scraper.py | 上市公司年报/半年报/季报 |

| 上市公司名单 | stock_list_updater.py | A股全量上市公司名单 + 行情快照 |

| 新闻爬取 | news_scraper.py | 新闻资讯、舆情分析 |

| 可视化报告 | visualization_reporter.py | ASCII图表、分析报告导出 |

| 实时监控 | realtime_monitor.py | 动态页面监控、新公告检测 |

| 机构名爬虫 | name_scraper.py | 通过名称爬取机构网站、反爬、双语展示 |

| 券商研报 | research_report_scraper.py | 券商研报爬取（评级/分析师/目标价） |

| 综合报告 | comprehensive_report_scraper.py | 统一入口，整合定期报告+券商研报+公告 |

| 全量索引 | report_indexer.py | 全量扫描索引器（SQLite + 断点续扫） |

| 报告导出 | report_exporter.py | 报告生成与可视化导出（PPT/PDF/Word/Excel） |

| 批量爬取 | batch_institution_crawler.py | 批量爬取多个机构/公司信息 |

能力边界说明

> 以下情况可能导致功能不可用或效果不佳，请知悉：

网页爬取的局限性

| 场景 | 可能问题 | 解决方案 |

|------|---------|----------|

| 需要登录的页面 | 无法获取数据 | 部分页面支持游客访问 |

| 验证码/Captcha | 爬取被拦截 | 稍后重试或使用动态渲染模式 |

| Cloudflare保护 | 遇到防爬机制 | 使用 use_dynamic=True |

| 页面结构变更 | 解析失败 | 稍后重试，代码会自动适配 |

| 目标网站下线 | 找不到页面 | 检查URL是否正确 |

数据获取限制

| 数据类型 | 说明 |

|---------|------|

| 实时数据 | 需要自行爬取，本地数据有滞后 |

| 历史数据 | 取决于目标网站的存档策略 |

| 个股研报 | 仅覆盖有公开研报的券商 |

| 基金持仓 | 仅展示最新报告期数据 |

批量爬取建议

单次批量建议不超过50个，间隔2秒以上
全量扫描（1000+股票）需要数小时，建议断点续扫
频繁请求可能触发目标网站限流，建议夜间运行

可爬取金融机构查询

> 完整名单通过 CLI 或 Python API 查询，不在文档中列出。

CLI 查询方式

# 查看统计（默认）
python -m scripts.scrapable_registry stat

# 按类型查询
python -m scripts.scrapable_registry list 基金管理公司
python -m scripts.scrapable_registry list 证券公司
python -m scripts.scrapable_registry list 保险公司

# 搜索机构
python -m scripts.scrapable_registry search 华夏
python -m scripts.scrapable_registry search 招商

Python API 查询方式

from scripts.scrapable_registry import ScrapableRegistry

registry = ScrapableRegistry()

# 获取所有可爬取机构
all_institutions = registry.get_all_scrapable()

# 按类型获取
fund_companies = registry.get_by_type("基金管理公司")
securities = registry.get_by_type("证券公司")

# 搜索机构
results = registry.search("华夏")
for inst in results:
    print(f"{inst['name']} -> {inst['url']}")

# 获取统计
stats = registry.get_stats()
print(f"总机构数: {stats['total']}, 可爬取: {stats['scrapable_count']}")

输出示例：

# search() 返回
[
  {'name': '华夏基金管理股份有限公司', 'type': '基金管理公司', 'website': 'https://www.chinaamc.com', 'scrapable': True},
  {'name': '华夏银行股份有限公司', 'type': '商业银行', 'website': 'https://www.hxb.com.cn', 'scrapable': True}
]

# get_stats() 返回
{'total': 918, 'scrapable_count': 98, 'types': {'商业银行': 6, '基金管理公司': 77, ...}}

快速使用

爬取机构网页

from scripts.institution_scraper import InstitutionScraper

scraper = InstitutionScraper()
result = scraper.scrape_by_name("华夏基金管理")
print(result['content'][:500])

通过URL直接爬取

from scripts.scraper import scrape_url

scraper = FinancialPageScraper()
content = scraper.scrape_url("https://www.chinaamc.com")
print(f"获取到 {len(content)} 字符内容")

输出示例：

# 成功时返回Selector对象
获取到 15234 字符内容

# 失败时返回None，并打印错误提示
[错误] 爬取超时，目标网站响应太慢
[提示] 可以尝试：1) 稍后重试 2) 检查网络状况 3) 目标网站可能有限流

解析金融产品

from scripts.web_parser import parse_financial_product

# 解析基金
result = parse_financial_product("https://fund.eastmoney.com/000001.html", "fund")

# 解析ETF
result = parse_financial_product(url, "etf")

# 解析FOF
result = parse_financial_product(url, "fof")

# 解析股票
result = parse_financial_product(url, "stock")

下载公告

from scripts.announcement_scraper import AnnouncementScraper

scraper = AnnouncementScraper()
announcements = scraper.search("华夏基金", limit=10)
for ann in announcements:
    print(f"{ann['title']} - {ann['date']}")
    scraper.download_pdf(ann['url'], save_dir="data/announcements/")

新闻资讯

from scripts.news_scraper import NewsScraper

scraper = NewsScraper()
news = scraper.crawl_financial_news(limit=20)
for item in news:
    print(f"{item['title']} [{item['source']}]")

机构名单数据类型

> institution_registry.json 为核心文件（918家，含URL）；institutions.json 为脱敏版（无URL，仅name+code）；各 *_list.json 为分类统计册。

| 文件 | 内容 | 数量 | 说明 |

|------|------|------|------| 说明 |

|------|------|------|------|

项目结构

cn-financial-scraper/
├── SKILL.md                    # 本文档
├── README.md                   # 项目简介
├── _meta.json                  # SkillHub元数据
├── requirements.txt            # 依赖
├── scripts/
│   ├── __init__.py
│   ├── scraper.py              # 基础网页爬虫（v2.0 三级降级+缓存）
│   ├── stock_list_updater.py   # A股上市公司名单更新器
│   ├── web_parser.py           # 网页解析（产品/公告）
│   ├── institution_scraper.py  # 机构批量爬虫
│   ├── name_scraper.py         # 机构名爬虫（反爬+双语）
│   ├── announcement_scraper.py  # 公告搜索与下载
│   ├── news_scraper.py          # 新闻爬取
│   ├── full_institution_crawler.py  # 全量机构名单
│   ├── institution_updater.py  # 季度更新
│   ├── scrapable_registry.py   # 可爬取机构注册表
│   ├── document_parser.py      # PDF/Word/Excel解析
│   ├── analyzer.py             # 产品分析
│   ├── visualization_reporter.py # 可视化报告
│   ├── realtime_monitor.py     # 实时监控
│   ├── company_report_scraper.py  # 上市公司定期报告
│   ├── research_report_scraper.py # 券商研报爬虫
│   ├── comprehensive_report_scraper.py # 综合报告统一入口
│   ├── report_indexer.py       # 全量扫描索引器
│   └── report_exporter.py      # 报告导出器（PPT/PDF/Word/Excel）
└── data/
    ├── institution_registry.json  # 机构注册表（918家）
    ├── *_list.json              # 各类型机构名单
    ├── announcements/           # 公告下载
    ├── company_reports/        # 定期报告下载
    ├── broker_reports/         # 券商研报下载
    ├── report_index.db         # 全量报告索引数据库
    └── update_logs/            # 更新日志

全A股报告爬取

券商研报爬取

from scripts.research_report_scraper import BrokerReportManager

manager = BrokerReportManager()

# 获取个股研报
print(manager.generate_report_summary("600519"))

# 按券商筛选
reports = manager.get_reports_by_broker("中金公司")

CLI:

python -m scripts.research_report_scraper search 600519
python -m scripts.research_report_scraper top           # 热门研报
python -m scripts.research_report_scraper broker 中金公司

综合报告获取

from scripts.comprehensive_report_scraper import ComprehensiveReportManager

manager = ComprehensiveReportManager()

# 获取所有类型报告
data = manager.get_all_reports("600519")

# 生成摘要
print(manager.generate_report_summary("600519"))

# 批量处理
result = manager.batch_process_stocks(["600519", "000001"])

CLI:

python -m scripts.comprehensive_report_scraper get 600519
python -m scripts.comprehensive_report_scraper batch 600519,000001

全量索引器

from scripts.report_indexer import StockIndexer

indexer = StockIndexer()

# 初始化股票列表
indexer.init_stock_list()

# 全量扫描
indexer.full_scan(limit=100)  # 先测试100只

# 搜索索引
results = indexer.search_reports("年报", report_type="periodic")

CLI:

python -m scripts.report_indexer init                    # 初始化股票列表
python -m scripts.report_indexer scan --full --limit 10  # 测试扫描10只
python -m scripts.report_indexer scan --resume           # 断点续扫
python -m scripts.report_indexer search 年报 --type periodic
python -m scripts.report_indexer stats                  # 查看统计

报告导出

from scripts.report_exporter import ComprehensiveExporter
from scripts.comprehensive_report_scraper import ComprehensiveReportManager

manager = ComprehensiveReportManager()
data = manager.get_all_reports("600519")

exporter = ComprehensiveExporter()

# 导出所有格式
results = exporter.export_all_formats(data, "600519")
# {'ppt': '...', 'docx': '...', 'xlsx': '...', 'pdf': '...'}

# 生成分析报告
analysis = exporter.generate_analysis_report(data)

CLI:

python -m scripts.report_exporter export 600519 --formats ppt,docx,xlsx
python -m scripts.report_exporter analyze 600519
python -m scripts.report_exporter batch 600519,000001 --formats xlsx

批量爬取

批量机构爬取

from scripts.batch_institution_crawler import BatchInstitutionCrawler

crawler = BatchInstitutionCrawler()

# 按名称批量爬取
institutions = [
    {"name": "易方达基金", "code": "EF"},
    {"name": "华夏基金", "code": "HX"},
    {"name": "广发基金", "code": "GF"},
]
results = crawler.crawl_by_names(institutions)

# 按类型爬取整类机构
results = crawler.crawl_by_type("基金管理公司")

# 爬取所有类型
all_results = crawler.crawl_all_types()

# 导出结果
output = crawler.export_results(format="json")

批量股票爬取

from scripts.batch_institution_crawler import StockBatchCrawler

stock_crawler = StockBatchCrawler()

# 按代码批量爬取
codes = ["600519", "000858", "000001"]
results = stock_crawler.crawl_by_codes(codes)

# 导出结果
output = stock_crawler.export_to_json()

CLI:

python -m scripts.batch_institution_crawler --type 基金管理公司
python -m scripts.batch_institution_crawler --names 易方达基金,华夏基金,广发基金
python -m scripts.batch_institution_crawler --stock-batch 600519,000858

安装依赖

pip install scrapling>=0.2.0 playwright>=1.40.0 requests>=2.28.0
playwright install chromium
pip install PyPDF2>=3.0.0 python-docx>=0.8.11 openpyxl>=3.1.0

> scrapling 0.2.x 说明：版本 0.2.x 的 Response 对象使用 .html_content 属性获取 HTML（旧版 .html 已废弃）。

> 本项目已自动注入兼容层，无需额外处理。

数据时效说明

| 数据类型 | 实际滞后 | 说明 |

|---------|---------|------|

| 机构名单 | 季度更新 | 每季度15天内同步监管数据 |

| 机构URL | 实时更新 | 人工核实+域名推理 |

| 产品数据 | 按需爬取 | 取决于目标网站反爬策略 |

核心原则：所有数据仅供辅助参考，爬取行为请遵守目标网站 robots.txt 和使用条款。

常见问题与错误排查

安装问题

Q: 安装scrapling失败？

A: 如果遇到scrapling安装问题，可以尝试：

# 使用国内镜像
pip install scrapling -i https://pypi.tuna.tsinghua.edu.cn/simple

# 或先安装playwright再用scrapling
pip install playwright -i https://pypi.tuna.tsinghua.edu.cn/simple
playwright install chromium
pip install scrapling

Q: playwright安装超时？

A: 建议使用国内镜像源，或使用离线安装：

# 设置镜像
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 安装
pip install playwright
playwright install chromium --with-deps

Q: 提示"No module named 'xxx'"？

A: 缺少依赖，请运行：

pip install scrapling playwright PyPDF2 python-docx openpyxl requests

爬取问题

Q: 爬取时报"超时"？

A: 可能是目标网站响应慢或有限流措施，可以尝试：

稍后重试
使用动态渲染模式（use_dynamic=True）
检查网络状况

Q: 爬取时遇到反爬机制（验证码、Cloudflare等）？

A: 可以尝试：

稍后重试（目标网站可能有临时限制）
使用动态渲染模式（浏览器模拟）
降低请求频率

Q: 爬取结果为空或数据不完整？

A: 可能原因：

页面结构已变更（目标网站更新了UI）
需要使用动态渲染模式
该页面需要登录或权限

Q: 提示"找不到目标元素"？

A: 页面结构可能已变更，可以：

检查URL是否正确
尝试使用动态渲染模式
该页面可能需要登录才能查看

数据问题

Q: 数据更新频率是多少？

A:

机构名单：每季度更新
机构URL：实时更新（人工核实）
产品数据：按需爬取，取决于目标网站反爬策略

Q: 如何获取实时数据？

A: 产品数据需要自行爬取，建议：

设置合理的请求间隔（已内置2秒限流）
爬取后缓存到本地
定期更新

环境要求

Q: 支持Python 2吗？

A: 不支持，需要Python 3.8及以上版本。

Q: 需要特殊网络设置吗？

A: 不需要，正常网络连接即可。部分目标网站可能需要代理访问。

版本历史

共 14 个版本

v1.0.15 日常更新当前

2026-06-12 13:43 安全安全
v1.0.14 修复BUG，优化逻辑

2026-05-25 22:37 安全安全
v1.0.13 优化描述

2026-05-25 16:48 安全安全
v1.0.12 修复cnfinancialscraper项目里面 PyPI 镜像源 DNS 不通的问题，新增批量爬取功能。

2026-05-25 12:23 安全安全
v1.0.11 更新了点文件

2026-05-25 09:35 安全安全
v1.0.10 新增所有上市公司定期报告、券商研究报告、各类公告爬取整理和导出。

2026-05-25 09:13 安全安全
v1.0.9 更新数据

2026-05-24 14:08 安全安全
v1.0.7 优化爬取逻辑

2026-05-24 02:11 安全安全
v1.0.6 优化了描述

2026-05-24 01:17 安全安全
v1.0.5 丰富了爬取的机构名单，新增任意网站爬取和翻译功能。

2026-05-24 01:04 安全安全
v1.0.3 丰富了爬取的数据源。

2026-05-23 01:05 安全安全
v1.0.2 新增了保险、城商行、私募基金等机构的爬取列表。

2026-05-23 00:31 安全安全
v1.0.1 增加上市公司定期的报告爬取、分析和可视化导出功能。

2026-05-22 16:29 安全安全
v1.0.0 Initial release

2026-05-22 09:59 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

全能金融爬虫

概述

cn-financial-scraper 金融数据爬虫

数据源与更新

功能概览

能力边界说明

网页爬取的局限性

数据获取限制

批量爬取建议

可爬取金融机构查询

CLI 查询方式

Python API 查询方式

快速使用

爬取机构网页

通过URL直接爬取

解析金融产品

下载公告

新闻资讯

机构名单数据类型

项目结构

全A股报告爬取

券商研报爬取

综合报告获取

全量索引器

报告导出

批量爬取

批量机构爬取

批量股票爬取

安装依赖

数据时效说明

常见问题与错误排查

安装问题

爬取问题

数据问题

环境要求

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

量化基金投资顾问

量化股票预测分析

微信聊天分析及预测工具