广东省政府采购网中标（成交）结果公告采集 Skill

1. 概述

基于 Agent Browser 浏览器自动化能力，采集广东省政府采购网中标（成交）结果公告。通过浏览器模拟真人操作，自动完成页面加载、分页浏览、元素解析、数据清洗、全局去重、增量更新，输出 JSON / Excel / CSV 格式。

核心原则：

项目编号唯一主键去重 — 全局去重，增量过滤
仅采集浏览器渲染后的官方公示字段 — 不杜撰、不补全缺失数据
单条异常不中断整体采集 — 逐条 try/except，异常记录日志后继续
全程适配浏览器自动化运行逻辑 — Vue + Element UI 组件兼容

2. 目标链接

主页面: https://gdgpo.czt.gd.gov.cn/maincms-web/noticeInformationGd
公告类型: 中标（成交）结果公告（需通过 JS 点击选中）
站点框架: Vue 2 + Element UI（大部分交互不在 accessibility tree 中）

3. 页面结构分析

3.1 搜索表单（筛选面板）

控件	JS 定位	说明
------	---------	------
类型按钮组	`querySelectorAll('.conditionwrap_right')[0]`	8 个子 span，index=0 为"全部"，index=7 为"中标（成交）结果公告"
开始日期输入	`input[placeholder="开始日期"]`	格式 `YYYY-MM-DD HH:mm:ss`，`browser_type` 会弹日历
结束日期输入	`input[placeholder="结束日期"]`	同上
查询按钮	`button` 含 `textContent.includes('查询')`	`browser_click` 无法触发 Vue 事件
重置按钮	`button` 含 `textContent.includes('重置')`	先重置再操作，避免残留状态

3.2 列表页（搜索结果表）

元素	提取方式
------	----------
总条数	`browser_snapshot` 中 `"共 N 条"` StaticText
表格行 (第 i 行)	第 i 个 table row，每行 3 cell：标题 / 区划 / 发布时间
标题（第 i 行）	第 i 行第 1 个 cell 文本（无 href — 见 §3.4 导航）
区划（第 i 行）	第 i 行第 2 个 cell 文本
发布时间（第 i 行）	第 i 行第 3 个 cell 文本（格式 `YYYY-MM-DD`）
分页页码输入	`spinbutton` — 输入数字后按 Enter
下一页按钮	`button` 含下一页图标 — 检查 `disabled` 属性判定末页
"暂无数据"	StaticText="暂无数据" — 无结果标识

> ⚠️ 每次翻页后 ref ID 全部变化 — 必须重新调用 browser_snapshot() 获取最新元素树。

3.3 详情页（公告正文）

详情页数据分两个来源：

来源 A：notice-content 头部元数据（不在 accessibility tree 中）

页面顶部的发布机构、发布时间、采购计划编号、预算金额等信息通过 CSS 拼接渲染，不显示在 accessibility tree 中，必须用 browser_console 执行 JS 提取。

> ⚠️ 字段在 DOM 中无换行/分隔符拼接在一起，例如：

> "发布机构：肇庆市尚意项目管理有限公司发布时间：2026-05-30 10:21:02采购计划编号：..."

> 必须用正则逐一提取，不能用 textContent.split()。

来源 B：主内容区 heading（accessibility tree 可见）

字段	Snapshot 定位	示例
------	---------------	------
项目编号	heading=4 `"一、项目编号：X"`	`0724-2611Z3151282`
项目名称	heading=4 `"二、项目名称：X"`	完整项目名称
中标供应商	`"三、采购结果"` 下的 table	供应商名称/地址/金额
主要标的信息	`"四、主要标的信息"` 下的 table	品目/名称/金额
评审专家	`"五、评审专家"` 后的 paragraph	专家名单
代理服务费	`"六、代理服务费"` 下的 table	金额/收取对象
公告期限	`"七、公告期限"` 后的 paragraph	自本公告发布之日起 N 个工作日
采购人信息	h6 `"1.采购人信息"` 后的段落	名称/地址/联系方式
代理机构信息	h6 `"2.采购代理机构信息"` 后的段落	名称/地址/联系方式

> ⚠️ 全角字符注意：页面可能使用全角冒号 ：、全角空格 \u3000，正则和字符串分割时需兼容。

3.4 详情页导航（Vue 组件兼容）

列表页的行点击由 Vue 组件处理，表格行没有 href 或 onclick 属性。必须通过 Vue 实例内部方法导航：

// 安全导航到详情页（第 i 行）
(function(i) {
  let rows = document.querySelectorAll('.el-table__body-wrapper .el-table__row');
  if (!rows || !rows[i]) return JSON.stringify({error: 'row not found', i: i});
  let rowData = rows[i].__vue__._props.row;
  let table = document.querySelector('.el-table');
  if (!table || !table.__vue__) return JSON.stringify({error: 'table vue instance not found'});
  let p = table.__vue__.$parent;
  for (let d = 0; d < 15 && p; d++) {
    if (typeof p.rowClick === 'function') {
      p.rowClick(rowData);
      return JSON.stringify({success: true, rowData: rowData});
    }
    p = p.$parent;
  }
  return JSON.stringify({error: 'rowClick method not found'});
})(0);  // 0 = 第一行

导航成功后 window.location.href 变为类似：

https://gdgpo.czt.gd.gov.cn/maincms-web/noticeGd?type=notice&id={uuid}&channel={channel}&noticeType=001026&openTenderCode=SY-2026-050(CGZB)&channelName=项目采购公告&path=/noticeInformationGd

3.5 采购方式编码映射

noticeType URL 参数对应采购方式：

编码	采购方式
------	----------
001001	公开招标
001002	邀请招标
001003	竞争性谈判
001004	竞争性磋商
001005	询价
001006	单一来源
001014	框架协议
001026	竞争性磋商

4. 输入参数

class GdgpoCollectorConfig:
    # === 目标链接 ===
    base_url: str = "https://gdgpo.czt.gd.gov.cn/maincms-web/noticeInformationGd"

    # === 时间筛选 ===
    time_mode: str = "today"       # "today" | "last7" | "last30" | "custom"
    start_date: str = ""           # 仅 time_mode="custom"，格式 "YYYY-MM-DD"
    end_date: str = ""             # 仅 time_mode="custom"，格式 "YYYY-MM-DD"

    # === 抓取页数 ===
    max_pages: int = 50            # 最大自动翻页数量

    # === 浏览器配置 ===
    page_load_timeout: int = 30    # 页面加载超时（秒）
    element_wait_timeout: int = 15 # 元素等待超时（秒）
    page_turn_interval: float = 2.0 # 翻页间隔（秒）
    headless: bool = True          # 是否无头模式

    # === 输出配置 ===
    output_format: str = "json"    # "json" | "excel" | "csv"
    output_dir: str = ""           # ⚠️ 须用户确认输出路径

    # === 高级配置 ===
    page_size: int = 10            # 每页条数
    human_like_delay: tuple = (1.0, 3.0)  # 随机操作延迟（秒）
    anti_bot_speed: bool = True    # 启用反爬规避
    max_retries: int = 3           # 浏览器/元素操作重试次数
    retry_delay: int = 5           # 重试间隔（秒）

5. 工作流程

Phase 0: 确认输出目录

# 采集开始前必须向用户确认输出目录
# 默认提议: ~/.hermes/cron/output/gdgpo/
# 使用 clarify 工具提问，用户确认后存入变量 output_dir
# 目录不存在则自动创建

Phase 1: 设置筛选条件（全部用 JS 一次性完成）

1. browser_navigate(url=base_url)
2. browser_snapshot() → 确认页面加载完成
3. browser_console 执行完整 JS 筛选模板（重置 → 选择类型 → 设置日期 → 点击查询）
4. await new Promise(r => setTimeout(r, 2000)) 等待结果渲染
5. browser_snapshot() → 验证结果（检查 "共 N 条" > 0）

Phase 2: 列表页采集 + 翻页

1. 读取 "共 N 条"，计算总页数 total_pages = ceil(N / page_size)
2. current_page = 0
3. Loop while current_page < min(total_pages, max_pages):
   a. browser_snapshot() → 读取本页表格行（标题 / 区划 / 发布时间）
   b. 对每行 i in 0..rows-1:
      - 记录列表页字段作为备用
      - JS 导航到详情页（Vue rowClick，见 §3.4）
      - 执行 Phase 3 详情页采集
      - 浏览器返回列表页（或重新导航到主页面 + 恢复查询状态）
   c. 翻页: 点下一页按钮或输入页码
   d. 检查是否末页（按钮 disabled）→ 终止
   e. current_page += 1
   f. 翻页间隔等待（human_like_delay）

Phase 3: 详情页采集

1. browser_console("window.location.href") → 获取 source_url
2. browser_console 执行 JS 提取 notice-content 头部元数据（正则）
3. browser_snapshot() → 从 heading 提取 项目编号/项目名称/评审专家 等
4. 组合所有字段，标记 data_status

Phase 4: 数据清洗与去重

1. 日期标准化: YYYY-MM-DD / YYYY-MM-DD HH:mm:ss
2. 金额标准化: 去除 .0000 后缀转 float
3. 文本清洗: 去除首尾空格、\n、\r、\u3000（全角空格）、乱码字符
4. 以 project_id 为唯一主键全局去重
5. 检查增量索引文件 _index.json，过滤已采集条目

Phase 5: 输出

1. 按 output_format 输出 JSON / CSV / Excel
2. 文件名规范: gdgpo_bid_{date_range}_{timestamp}.{ext}
3. 生成 _report.json 采集统计报告
4. 更新增量索引 _index.json
5. browser_navigate 回到首页释放资源

6. 字段映射

6.1 核心字段（20 个）

#	JSON Key	中文名	必填	类型	提取来源
---	----------	--------	------	------	----------
1	`project_id`	项目编号	✅	string	详情页 h4 `"一、项目编号：X"`
2	`project_name`	项目名称	✅	string	详情页 h4 `"二、项目名称：X"`
3	`title`	公告标题	✅	string	列表页第 1 个 cell
4	`region`	区划	✅	string	列表页第 2 个 cell
5	`publish_date`	发布日期	✅	string	列表页第 3 个 cell（取 YYYY-MM-DD）
6	`detail_url`	详情链接	✅	string	进入详情页后 `window.location.href`
7	`publisher`	发布机构	条件	string	notice-content 正则 `发布机构[：:]`
8	`publish_datetime`	发布时间(精确)	条件	string	notice-content 正则 `发布时间[：:]`
9	`procurement_plan_id`	采购计划编号	条件	string	notice-content 正则 `采购计划编号[：:]`
10	`budget_amount`	预算金额	条件	string	notice-content 正则 `预算金额[：:]`（保留原始）
11	`procurement_category`	采购品目	条件	string	notice-content 正则 `采购品目[：:]`
12	`procurement_agent`	代理机构	条件	string	notice-content 正则 `代理机构[：:]`
13	`project_handler`	项目经办人	条件	string	notice-content 正则 `项目经办人[：:]`
14	`project_leader`	项目负责人	条件	string	notice-content 正则 `项目负责人[：:]`
15	`supplier`	中标供应商	条件	string	采购结果表格首行供应商名称
16	`bid_price`	中标（成交）金额	条件	string	采购结果表格首行金额文本
17	`evaluation_experts`	评审专家	条件	string	h4 `"五、评审专家"` 后段落文本
18	`announcement_period`	公告期限	条件	string	h4 `"七、公告期限"` 后段落文本
19	`data_status`	数据状态	✅	enum	`normal` / `incomplete` / `error`
20	`collection_timestamp`	采集时间戳	✅	string	系统自动生成 `YYYY-MM-DDTHH:mm:ss+08:00`

> 条件 = 页面存在则采集，页面不存在则自动留空（""），不杜撰、不补全。

6.2 中标供应商对象结构

{
  "package_no": "1",
  "package_name": "每年第一和第三季度的肉菜类食材配送服务",
  "supplier_name": "广东供销农产品股份有限公司",
  "supplier_address": "惠州市博罗县泰美镇新塘村粤港澳大湾区...",
  "winning_amount": "折扣率：79.00%",
  "winning_amount_type": "折扣率",
  "winning_amount_numeric": 0.79
}

6.3 数据清洗规则

字段	规则
------	------
`publish_date`	统一为 `YYYY-MM-DD`
`publish_datetime`	统一为 `YYYY-MM-DD HH:mm:ss`
`budget_amount`	保留原始字符串，另存 `budget_amount_numeric` 为 float
所有文本字段	去除首尾空格、`\n`、`\r`、`\u3000`（全角空格）、HTML 实体
空值处理	不存在/采集失败 → 空字符串 `""`，不填 None/null
中标金额	解析数字部分存 numeric，原始文本保留
全角转半角	数字和冒号统一转半角

7. 异常处理机制

7.1 浏览器加载异常

页面加载超时、白屏、404/5xx 报错、资源加载失败
自动等待重试（最多 max_retries 次，每次间隔 retry_delay 秒）
重试失败后：记录异常链接与失败原因到日志 → 跳过当前条目 → 继续执行后续采集任务
不整体中断、不等待人工介入

for attempt in range(max_retries):
    try:
        browser_navigate(url)
        browser_snapshot()  # 确认页面有内容
        break
    except Exception as e:
        log_error(f"Page load failed (attempt {attempt+1}): {e}")
        if attempt < max_retries - 1:
            time.sleep(retry_delay)
        else:
            raise SkipPageException(url)

7.2 元素解析异常

页面元素动态变更、字段缺失、标签位置变动、元素未加载完成
对应字段自动留空（""） — 不报错、不补全、不篡改数据
同步标记 data_status 为 "incomplete"
在日志中记录缺失字段明细

field = extract_field(page, "project_id")
if field is None:
    record["project_id"] = ""
    record["data_status"] = "incomplete"
    log_missing_field("project_id")

7.3 翻页采集异常

自动翻页失效、分页加载中断、漏页、重复抓取
通过 project_id 全局去重：每条数据采集后写入临时集合并检查重复
增量过滤：加载 _index.json 中的已采集 project_id 集合，跳过已有记录
自动判定末页：下一页按钮 disabled 属性为 true 或 snapshot 无表格行
翻页失效时自动降级为页码输入框策略（spinbutton + Enter）

def is_last_page():
    """检查是否到达末页"""
    snapshot = browser_snapshot()
    return "暂无数据" in snapshot or is_button_disabled("下一页")

def safe_turn_page():
    """安全的翻页操作"""
    try:
        # 策略 1: 点击下一页按钮
        browser_click(next_page_ref)
    except:
        # 策略 2: 通过页码输入框
        current_page += 1
        browser_type(page_input_ref, str(current_page))
        browser_press("Enter")

7.4 浏览器风控异常

触发站点反爬、人机验证、访问受限场景
自动延长操作间隔：将 human_like_delay 加倍，模拟更慢的人类浏览节奏
随机化操作时序：每次操作前等待 random.uniform(min_delay, max_delay) 秒
轻度异常（偶尔 503/429）→ 自动重试
重度异常（人机验证、IP 封禁）→ 记录日志并跳过，标记 data_status = "error"
反爬规避策略：
使用随机 User-Agent
鼠标移动模拟（非必要时不触发）
翻页前随机滚动页面
避免固定频率操作

7.5 任务日志输出

采集结束后自动生成完整统计日志（保存为 _report.json 与输出文件同目录）：

{
  "task_id": "gdgpo_2026-05-30_001",
  "timestamp": "2026-06-01T15:30:00+08:00",
  "config": {
    "time_mode": "today",
    "start_date": "2026-05-30",
    "end_date": "2026-05-30",
    "max_pages": 50,
    "output_format": "json"
  },
  "stats": {
    "total_records": 1,
    "success_count": 1,
    "incomplete_count": 0,
    "error_count": 0,
    "duplicate_skipped": 0,
    "pages_collected": 1,
    "total_pages_available": 1
  },
  "browser": {
    "navigations": 3,
    "snapshots": 5,
    "js_executions": 8,
    "total_duration_seconds": 45.2
  },
  "errors": [],
  "missing_fields": [],
  "data_files": [
    "gdgpo_bid_2026-05-30_001.json",
    "gdgpo_bid_2026-05-30_001_report.json"
  ]
}

8. 增量更新策略

8.1 索引文件

使用 _index.json 作为全局增量索引，记录所有已采集的 project_id：

{
  "version": 1,
  "created_at": "2026-06-01T15:00:00+08:00",
  "updated_at": "2026-06-01T15:30:00+08:00",
  "total_collected": 127,
  "last_date_range": "2026-05-30~2026-05-30",
  "index": {
    "0724-2611Z3151282": "2026-06-01",
    "SY-2026-050(CGZB)": "2026-06-01"
  }
}

8.2 去重流程

每次采集前：
1. 读取 _index.json（若存在），加载已采集的 project_id 集合
2. 每采集一条新数据：
   a. 检查 project_id 是否在已采集集合中
   b. 若存在 → 跳过（记录为 duplicate_skipped）
   c. 若不存在 → 写入输出文件 + 添加到集合
3. 采集结束后更新 _index.json

8.3 边界情况

首次运行：_index.json 不存在 → 视为空索引，全量采集
索引文件损坏：JSON 解析失败 → 重建空索引（备份旧索引为 _index.json.bak）
跨天采集：日期范围变化不影响去重逻辑，project_id 全局唯一
数据重采：如用户要求重新采集某个日期范围，Agent 需要明确告知用户可能产生重复数据

9. 输出文件规范

9.1 文件命名

# 输出目录由用户确认（Phase 0），默认 ~/.hermes/cron/output/gdgpo/
# 按日期分组存储

gdgpo_bid_{date_range}_{timestamp}.json     # 数据输出
gdgpo_bid_{date_range}_{timestamp}.csv      # 数据输出（CSV）
gdgpo_bid_{date_range}_{timestamp}.xlsx     # 数据输出（Excel）
gdgpo_bid_{date_range}_{timestamp}_report.json  # 采集报告
_index.json                                 # 增量索引（全局）

date_range: 2026-05-30（单日）或 2026-05-01~2026-05-30（多日）
timestamp: 20260601_153000（采集完成时间，格式 YYYYMMDD_HHmmss）

9.2 JSON 格式

输出数组 []，每个元素为一个公告对象
key 使用小写 snake_case
空字段输出空字符串 ""，不输出 null

[
  {
    "project_id": "SY-2026-050(CGZB)",
    "project_name": "乡道Y559线广宁竹海大观段水毁修复工程",
    "title": "广宁县公路事务中心乡道Y559线广宁竹海大观段水毁修复工程结果公告",
    "region": "广宁县",
    "publish_date": "2026-05-30",
    "publish_datetime": "2026-05-30 10:21:02",
    "publisher": "肇庆市尚意项目管理有限公司",
    "detail_url": "https://gdgpo.czt.gd.gov.cn/maincms-web/noticeGd?...",
    "data_status": "normal",
    "collection_timestamp": "2026-06-01T15:30:00+08:00",
    "supplier": "中国云南路建集团股份公司",
    "evaluation_experts": "植建文、李洁妍、黄鹤立（采购人代表）",
    ...
  }
]

9.3 CSV / Excel 格式

首行为字段名行（小写 snake_case，与 JSON key 一致）
CSV 使用 UTF-8 with BOM（确保 Excel 正确识别中文）
Excel 自动调整列宽

10. JS 模板参考（完整版）

所有 JS 代码均在 browser_console 中执行。由于 browser_console 不支持顶层 await，异步操作需用 (async () => { ... })() IIFE。

10.1 类型筛选 + 日期设置 + 查询（一体化模板）

(async () => {
  // Step 1: 点击重置按钮，清除之前的筛选状态
  const resetBtns = document.querySelectorAll('button');
  for (const btn of resetBtns) {
    if (btn.textContent.includes('重置')) { btn.click(); break; }
  }
  await new Promise(r => setTimeout(r, 500));

  // Step 2: 选择 "中标（成交）结果公告"（index=7）
  const typeDiv = document.querySelectorAll('.conditionwrap_right')[0];
  if (!typeDiv) return JSON.stringify({error: 'type div not found'});
  typeDiv.children[7].click();
  await new Promise(r => setTimeout(r, 200));

  // Step 3: 设置开始/结束日期（使用原生 setter 绕过日历弹窗）
  const setter = Object.getOwnPropertyDescriptor(HTMLInputElement.prototype, 'value').set;
  const startInput = document.querySelector('input[placeholder="开始日期"]');
  const endInput = document.querySelectorAll('input[placeholder="结束日期"]')[0];
  if (!startInput || !endInput) return JSON.stringify({error: 'date inputs not found'});

  const startDate = '2026-05-30 00:00:00';
  const endDate = '2026-05-30 23:59:59';

  setter.call(startInput, startDate);
  startInput.dispatchEvent(new Event('input', { bubbles: true }));
  startInput.dispatchEvent(new Event('change', { bubbles: true }));

  setter.call(endInput, endDate);
  endInput.dispatchEvent(new Event('input', { bubbles: true }));
  endInput.dispatchEvent(new Event('change', { bubbles: true }));

  // Step 4: 点击查询按钮
  await new Promise(r => setTimeout(r, 300));
  const allBtns = document.querySelectorAll('button');
  for (const btn of allBtns) {
    if (btn.textContent.includes('查询')) { btn.click(); break; }
  }

  // 返回确认信息
  return JSON.stringify({
    success: true,
    type: '中标（成交）结果公告',
    startDate: startDate,
    endDate: endDate
  });
})()

10.2 获取总条数

(function() {
  const text = document.body.innerText || '';
  const match = text.match(/共\s*(\d+)\s*条/);
  return match ? match[1] : '0';
})()

10.3 获取当前页面行数

(function() {
  const rows = document.querySelectorAll('.el-table__body-wrapper .el-table__row');
  return JSON.stringify({
    count: rows.length,
    urls: Array.from(rows).slice(0, 5).map(row => {
      // "三、采购结果" 下的 table
      return row.__vue__ ? row.__vue__._props.row.noticeName : 'no vue';
    })
  });
})()

10.4 导航到详情页（安全版，第 i 行）

(function(i) {
  const rows = document.querySelectorAll('.el-table__body-wrapper .el-table__row');
  if (!rows || !rows[i]) return JSON.stringify({error: 'row not found', index: i});
  const rowData = rows[i].__vue__._props.row;
  const table = document.querySelector('.el-table');
  if (!table || !table.__vue__) return JSON.stringify({error: 'table vue not found'});
  let p = table.__vue__.$parent;
  for (let d = 0; d < 15 && p; d++) {
    if (typeof p.rowClick === 'function') {
      p.rowClick(rowData);
      return JSON.stringify({success: true, noticeName: rowData.noticeName, id: rowData.id});
    }
    p = p.$parent;
  }
  return JSON.stringify({error: 'rowClick not found'});
})(0)

10.5 提取 notice-content 头部元数据（正则版）

(function() {
  const content = document.querySelector('.notice-content');
  if (!content) return JSON.stringify({error: 'notice-content not found'});
  const text = content.textContent.trim();
  const result = {};
  const patterns = [
    { key: 'publisher', pattern: /发布机构[：:]?\s*([^发布采购项目]+?)(?=发布时间|$)/ },
    { key: 'publish_datetime', pattern: /发布时间[：:]?\s*([\d\-:\s]+?)(?=采购计划编号|$)/ },
    { key: 'procurement_plan_id', pattern: /采购计划编号[：:]?\s*([\w-]+?)(?=预算金额|$)/ },
    { key: 'budget_amount', pattern: /预算金额[：:]?\s*([\d.]+)/ },
    { key: 'procurement_category', pattern: /采购品目[：:]?\s*([^代理]+?)(?=代理机构|$)/ },
    { key: 'procurement_agent', pattern: /代理机构[：:]?\s*([^项目]+?)(?=项目经办人|$)/ },
    { key: 'project_handler', pattern: /项目经办人[：:]?\s*([^项目]+?)(?=项目负责人|$)/ },
    { key: 'project_leader', pattern: /项目负责人[：:]?\s*([^\n]+)/ }
  ];
  patterns.forEach(({ key, pattern }) => {
    const match = text.match(pattern);
    if (match) result[key] = match[1].trim().replace(/\s+/g, '');
  });
  return JSON.stringify(result);
})()

10.6 从 heading 提取结构化字段

(function() {
  const result = { project_id: '', project_name: '', evaluation_experts: '',
    announcement_period: '', supplier: '', bid_price: '' };
  const headings = document.querySelectorAll('h4');
  headings.forEach(h => {
    const text = h.textContent.trim();
    if (text.includes('项目编号')) result.project_id = text.replace(/.*项目编号[：:]\s*/, '').trim();
    if (text.includes('项目名称')) result.project_name = text.replace(/.*项目名称[：:]\s*/, '').trim();
    if (text.includes('评审专家')) {
      let p = h.nextElementSibling;
      while (p && p.tagName === 'P') {
        const inner = p.textContent.trim();
        if (inner.includes('、') || inner.includes('采购人代表')) {
          result.evaluation_experts = inner; break;
        }
        p = p.nextElementSibling;
      }
    }
    if (text.includes('公告期限')) {
      let p = h.nextElementSibling;
      while (p && p.tagName === 'P') {
        result.announcement_period = p.textContent.trim(); break;
      }
    }
  });

  // 提取采购结果供应商表（三、采购结果）
  const h3s = document.querySelectorAll('h3');
  h3s.forEach(h3 => {
    if (h3.textContent.includes('采购结果')) {
      const table = h3.nextElementSibling;
      if (table && table.tagName === 'TABLE') {
        const rows = table.querySelectorAll('tr');
        if (rows.length > 1) {
          const cells = rows[1].querySelectorAll('td');
          if (cells.length >= 1) result.supplier = cells[0].textContent.trim();
          if (cells.length >= 3) result.bid_price = cells[2].textContent.trim();
        }
      }
    }
  });

  // 提取采购方式（从 URL）
  const url = window.location.href;
  const noticeType = url.match(/noticeType=([^&]+)/);
  const NOTICE_TYPE_MAP = {
    '001001': '公开招标', '001002': '邀请招标', '001003': '竞争性谈判',
    '001004': '竞争性磋商', '001005': '询价', '001006': '单一来源',
    '001014': '框架协议', '001026': '竞争性磋商'
  };
  if (noticeType) result.procurement_method = NOTICE_TYPE_MAP[noticeType[1]] || noticeType[1];

  // 提取采购人/代理机构信息
  const h6s = document.querySelectorAll('h6');
  h6s.forEach(h6 => {
    const text = h6.textContent.trim();
    if (text.includes('1.采购人信息') || text.includes('采购人信息')) {
      let next = h6.nextElementSibling;
      while (next && next.tagName !== 'H6') {
        const nt = next.textContent.trim();
        if (nt.includes('名')) result.purchaser_name = nt.split(/[：:]/).slice(1).join('').trim();
        if (nt.includes('地')) result.purchaser_address = nt.split(/[：:]/).slice(1).join('').trim();
        if (nt.includes('联系方式')) result.purchaser_phone = nt.split(/[：:]/).slice(1).join('').trim();
        next = next.nextElementSibling;
      }
    }
    if (text.includes('2.采购代理机构信息') || text.includes('采购代理机构信息')) {
      let next = h6.nextElementSibling;
      while (next && next.tagName !== 'H6') {
        const nt = next.textContent.trim();
        if (nt.includes('名')) result.agent_name = nt.split(/[：:]/).slice(1).join('').trim();
        if (nt.includes('地')) result.agent_address = nt.split(/[：:]/).slice(1).join('').trim();
        if (nt.includes('联系方式')) result.agent_phone = nt.split(/[：:]/).slice(1).join('').trim();
        next = next.nextElementSibling;
      }
    }
  });

  return JSON.stringify(result);
})()

11. 陷阱与最佳实践

⚠️ 已知问题

#	问题	解决方案
---	------	----------
1	`browser_click` 无法触发 Vue 组件事件	所有类型筛选和查询操作必须用 `browser_console` + JS `click()` + `dispatchEvent`
2	`browser_type` 操作日期输入框弹出的日历覆盖层	用 JS `Object.getOwnPropertyDescriptor` 设置 value + `dispatchEvent` 触发 Vue 响应
3	日历弹窗残留干扰后续操作	先重置再操作，且所有筛选步骤在一次 `browser_console` 调用中完成
4	每次翻页 ref ID 全部变化	翻页后必须重新 `browser_snapshot()`
5	notice-content 元数据不在 accessibility tree 中	用 JS 正则提取拼接字段（§10.5）
6	标题链接无 href 属性	通过 Vue 组件的 `rowClick(data)` 方法导航（§10.4）
7	翻页按钮 disabled 判定末页	检查 `[disabled]` 属性或 snapshot "暂无数据"
8	单日可能零数据（周末/节假日）	检查 "共 0 条" + "暂无数据" 后立即终止
9	类型切换可能未生效	用 `document.querySelector('.conditionwrap_right .active')` 确认选中状态的类型文本
10	页面可能使用全角字符	正则和分割兼容全角冒号 `：`、全角空格 `\u3000`

✅ 执行顺序最佳实践

类型+日期+查询在同一次 browser_console 调用中完成 — 中间被日历弹窗或其他覆盖层打断前完成全部操作
详情页采集顺序：先用 browser_console 提取 notice-content 元数据，再用 browser_snapshot 提取 heading 结构化字段（snapshot 更新慢，先抓不在 tree 中的元数据）
每条记录独立 try/except — 异常不中断整体采集
翻页后至少等待 2s — 给 Vue 组件足够的渲染时间
采集完毕 browser_navigate 回到首页释放浏览器资源
输出目录必须用户确认（Phase 0）— 用 clarify 工具，默认 ~/.hermes/cron/output/gdgpo/
反爬规避：随机延迟 1-3s、模拟鼠标滚动、避免固定频率翻页

🔄 翻页策略优先级

策略 1: 点击 "下一页" 按钮 → 检查 disabled 属性
策略 2: 输入页码到 spinbutton → Enter
策略 3: 重新执行查询 JS（回到第一页）→ 检查是否确实只有一页

📋 详情页字段确认清单

进入详情页后按以下顺序确认字段采集完整：

✅ detail_url — window.location.href
✅ project_id — h4 "一、项目编号"
✅ project_name — h4 "二、项目名称"
✅ supplier(s) — "三、采购结果" 下的 table
✅ 主要标的信息 — "四、主要标的信息"
✅ evaluation_experts — "五、评审专家" 后的 paragraph
✅ announcement_period — "七、公告期限" 后的 paragraph
✅ publisher, publish_datetime, procurement_plan_id, budget_amount, procurement_category, procurement_agent, project_handler, project_leader — notice-content 头部元数据（JS 正则提取）
✅ procurement_method — URL 中的 noticeType 参数映射
✅ 采购人/代理信息 — h6 标题后的段落（全角冒号兼容）

广东政府采购招标数据采集skill

概述