Automated wildlife detection pipeline for infrared trail camera footage. Version 2.2 — pre-execution interaction + configurable frame density.
| Optimization | Problem Addressed | Implementation |
|---|---|---|
| ------------- | ------------------ | ---------------- |
| Pre-execution interaction | Agent extracted frames before asking user preferences | All questions moved to Phase 0 before any file operations |
| Configurable frame density | High density caused excessive token usage | User selects High/Medium/Low before extraction starts |
| Three-zone frame extraction | Animal漏识 (animals entering mid-video missed) | Front 25% (50% frames) + Middle 50% (35% frames) + Last 25% (15% frames) |
| Increased frame density | Small/fast animals missed in sparse sampling | 8-28 frames per video (up from 5-20), 800px resolution (up from 640px) |
| Similar species contrast reasoning | Species misjudgment | Prompt model to compare distinguishing features dynamically |
| Human-priority detection rules | False positives in human activity scenes | Prompt: "detected humans → lower wildlife threshold, flag separately" |
| "Suspected wildlife" category | Direct denial of unclear cases | Output: has_wildlife: True/False/疑似 instead of binary |
| Cross-frame deduplication prompt | Count errors (same animal counted multiple times) | Prompt instructs model to check "跨帧一致性" |
| Dual-model API support | Single model bias | analyze_api.py supports Qwen-VL with checkpointing and retry |
5-phase pipeline for batch-processing trail camera videos:
| Phase | Task | Script | Agent Role |
|---|---|---|---|
| ------- | ------ | -------- | ------------ |
| 0 | 预执行交互确认(强制) | — | Agent asks user |
| 1 | Scan videos, extract metadata | inventory.py | Auto |
| 2 | Extract frames from each video (three-zone, user-selected density) | extract_frames.py | Auto |
| 3 | Vision analysis + location correction + write results | analyze_api.py / Agent review | API batch or Agent reviews |
| 4 | Export results to Excel | export_excel.py | Auto |
关键变更(v2.2):所有交互确认在 Phase 0 完成,确认后才执行任何文件操作。
ffmpeg.exe / ffprobe.exe on Windows)openpyxlEdit scripts/inventory.py and scripts/extract_frames.py top CONFIG section:
FFMPEG_BIN = r"C:\path\to\ffmpeg\bin" # Windows
# FFMPEG_BIN = "/usr/bin" # Linux/macOS
INPUT_DIR = r"C:\TrailCamera\Videos" # Your footage folder
OUTPUT_DIR = r"C:\TrailCamera\Output" # Results folder
⚠️ 执行任何文件操作前,Agent 必须先完成以下交互,获得用户明确确认后方可继续。
Prompt the user:
> "🦌 即将开始野生动物视觉识别。请指定红外相机安装地点(至少精确到省/地区,如'中国云南省'),这会用于修正物种识别结果。默认'中国'。"
Store location in output/location.txt (single line, e.g. 中国云南省高黎贡山).
Prompt the user with full transparency:
> "🧠 当前可用的视觉识别模型:
> - A) Qwen-VL-Plus API(默认,通过阿里云百炼接口,批量处理,支持物种对比推理、跨帧去重、断点续传)
> - B) Kimi 内置视觉(我逐帧查看,适合小批量或 API 不可用时)
> - C) 其他 API(GPT-4o / Claude / Gemini,需要你自行提供 API key)
>
> 请选 A/B/C,或告诉我你的偏好。如选 C,请提供 API key 和模型名称。"
根据用户选择配置对应脚本:
analyze_api.py 中已有阿里云 key,可直接使用analyze_api.py CONFIG,或创建新的分析脚本Prompt the user:
> "📐 帧密度选择(影响识别精度和 token / API 消耗):
> - 高:当前设置(三段式共 8-28 帧/视频,精度最高,token 消耗最大)
> - 中:调减 50%(三段式共 4-14 帧/视频,平衡精度与成本)
> - 低:调减 75%(三段式共 2-7 帧/视频,成本最低,适合快速初筛)
>
> 请选 高/中/低。"
Store density selection in output/frame_density.txt (single line: high / medium / low).
Frame density scaling rules:
| User Choice | Scaling | <30s | 30-60s | 60-120s | 120-300s | 300-600s | >600s |
|---|---|---|---|---|---|---|---|
| ------------- | --------- | ------ | -------- | --------- | ---------- | ---------- | -------- |
| 高 | 100% | 8 | 12 | 15 | 18 | 22 | 28 |
| 中 | 50% | 4 | 6 | 8 | 9 | 11 | 14 |
| 低 | 25% | 2 | 3 | 4 | 5 | 6 | 7 |
> 最低保障:每个 zone 至少提取 1 帧,确保覆盖视频前中后三段。
Prompt the user:
> "📋 确认信息:
> - 地点:[用户提供的地点]
> - 模型:[用户选择的模型]
> - 帧密度:[高/中/低]
> - 待处理视频:[INPUT_DIR 路径]
>
> 确认无误后回复'开始',我将执行扫描→帧提取(按选的密度)→视觉识别→导出报告。"
收到用户明确回复(如"开始"/"确认"/"跑吧")后,方可进入 Phase 1。
python scripts/inventory.py
Scans INPUT_DIR recursively for video files (.mp4, .mov, .avi, .mkv, .m4v, .mpg, .mpeg).
Extracts per video:
filenamecreation_time, fallback to file mtimeOutputs:
output/inventory.json — Raw dataoutput/inventory.xlsx — Excel preview (Phase 1 data only)IMG_YYYYMMDD_HHMMSS, YYYY-MM-DD_HH-MM-SS, YYYYMMDD_HHMMSS, YYYY_MM_DD_HH_MM_SScreation_time / date tagpython scripts/extract_frames.py
Reads output/inventory.json and output/frame_density.txt, extracts frames per video using three-zone strategy with user-selected density.
Frame naming: Flat structure — output/frames/PIRT0001_frame_001.jpg, PIRT0001_frame_002.jpg, etc.
Frame extraction reads frame_density.txt to determine scaling factor:
high → 100% of base frame count (8-28 frames)medium → 50% of base frame count (4-14 frames)low → 25% of base frame count (2-7 frames)Base frame count (high density):
| Video Duration | Total Frames | First 25% (trigger zone) | Middle 50% (activity zone) | Last 25% (exit zone) |
|---|---|---|---|---|
| ---------------- | ------------- | -------------------------- | --------------------------- | --------------------- |
| < 30 sec | 8 | 4 | 3 | 1 |
| 30–60 sec | 12 | 6 | 4 | 2 |
| 60–120 sec | 15 | 7 | 5 | 3 |
| 120–300 sec | 18 | 9 | 6 | 3 |
| 300–600 sec | 22 | 11 | 8 | 3 |
| > 600 sec | 28 | 14 | 10 | 4 |
> Rationale for three-zone split: Trail camera videos are triggered by motion, but animals may enter at start, linger mid-video, or exit at the end. Three-zone coverage maximizes detection probability across the full clip.
Frame resolution: 800px width (up from 640px) for better small animal detail.
Frame quality: JPEG quality=2 (high).
根据用户在 Phase 0b 的选择,执行对应的视觉识别:
Option A — API Batch Mode (Qwen-VL)
python scripts/analyze_api.py
Sends frames per video to Qwen-VL-Plus API (frame count depends on density selected in Phase 0c).
Option B — Agent Manual Review
> Agent views extracted frames using read tool on image files and records per-video summary.
Option C — Other API
> Use user-provided API key and model.
After raw vision results are in, read output/location.txt and apply correction rules:
references/wildlife_guide.md 中的地区物种参考和形态特征描述,对相似物种进行排除法推理。修正时不硬编码具体物种对,而是根据实际检出结果动态比对地区常见物种的形态特征(体型、毛色、尾型、行为模式等)。After applying location-based correction, write the final results to output/vision_analysis.json:
{
"location": "中国云南省高黎贡山",
"model_used": "qwen-vl-plus",
"frame_density": "medium",
"correction_applied": true,
"videos": [
{
"filename": "RCNX0001.AVI",
"has_human": false,
"has_wildlife": true,
"species_detected": ["野猪"],
"individual_count": {"野猪": 2},
"confidence": "high",
"notes": "夜间拍摄,成年个体带幼崽,从左侧进入画面"
}
]
}
Field reference:
| Field | Description |
|---|---|
| ------- | ------------- |
has_human | True / False / 疑似 |
has_wildlife | True / False / 疑似(v2新增"疑似"用于难以辨认的情况) |
species_detected | List of species names or [] |
individual_count | Dict: {species: count} or total int |
confidence | high / medium / low(low for suspected/unclear cases) |
notes | Free text: behavior, weather, lighting, API raw response, correction notes |
Writing command (for Agent or script):
import json
vision_analysis = {
"location": location, # from location.txt
"model_used": model_name, # e.g. "qwen-vl-plus" or "kimi-vision"
"frame_density": density, # from frame_density.txt
"correction_applied": True,
"videos": corrected_results # list of dicts
}
with open("output/vision_analysis.json", "w", encoding="utf-8") as f:
json.dump(vision_analysis, f, ensure_ascii=False, indent=2)
python scripts/export_excel.py
Reads inventory.json + vision_analysis.json, merges data, writes structured Excel:
| Column | Source |
|---|---|
| -------- | -------- |
| 序号 | auto |
| 原始文件名 | inventory |
| 拍摄时间 | inventory (parsed) |
| 视频时长(秒) | inventory |
| 是否有人类 | vision_analysis |
| 是否有野生动物 | vision_analysis |
| 识别物种 | vision_analysis (comma-separated) |
| 个体数量 | vision_analysis |
| 置信度 | vision_analysis |
| 备注 | vision_analysis |
Output: output/wildlife_report.xlsx
Color coding:
--fps 1 if you want 1 frame per second (modify extract_frames.py CONFIG)共 3 个版本