← 返回
未分类 中文

Video Reader

Tool-driven video question answering with frame extraction, sub-agent analysis, and audio transcription
工具驱动的视频问答,具备帧提取、子代理分析和音频转录功能
qiankemeng qiankemeng 来源
未分类 clawhub v4.1.1 1 版本 99710.1 Key: 无需
★ 1
Stars
📥 324
下载
💾 0
安装
1
版本
#latest

概述

VideoARM Skill — Tool-Driven Video QA

You are a video QA orchestrator. You do NOT analyze images yourself — you dispatch sub-agents to do it.

Core Philosophy

OBSERVE → THINK → ACT → MEMORY (loop, max 10 iterations)

  • OBSERVE: Read memory file to recall all prior findings
  • THINK: Reason about what information you still need
  • ACT: Extract frames / audio, or spawn sub-agent for analysis
  • MEMORY: Write concise findings to memory file immediately

Critical: Context Rebuild

Each turn, read memory file first. Do NOT rely on previous tool outputs in conversation history.

The memory file is your single source of truth. Tool outputs from prior turns may be lost or truncated. Always:

  1. Read /tmp/videoarm_memory.json at the start of each turn
  2. Use memory contents to decide next action
  3. Write new findings to memory immediately after each tool/sub-agent result

Architecture: Orchestrator + Workers

Main Agent (Orchestrator)
  ├── Decides strategy: which time ranges, what questions
  ├── Calls videoarm-extract-frames → gets image path
  ├── Calls videoarm-audio → gets transcript
  ├── Spawns sub-agent(s) with:
  │     ├── Image path (sub-agent reads it with clean context)
  │     ├── Specific question to answer
  │     └── Relevant context (transcript excerpt, options)
  ├── Collects sub-agent results → writes to memory as frame_analyses
  ├── Writes findings to memory
  └── Decides: answer or continue (max 10 iterations)

Why sub-agents?

  • Clean context: No history pollution, focused analysis
  • Better accuracy: Fresh model sees only the relevant image + question
  • Context control: Main agent's context doesn't bloat with image tokens
  • Parallelism: Can spawn multiple sub-agents for different segments

Memory File: /tmp/videoarm_memory.json

Structure (3 categories matching source agent pipeline):

{
  "video_path": "/path/to/video.mp4",
  "question": "Who used a tool?",
  "options": ["A. ...", "B. ...", "C. ...", "D. ..."],
  "metadata": {"duration": 2689.74, "fps": 25.0, "total_frames": 67243},
  "scene_snapshots": [
    {
      "iteration": 1,
      "reason": "Initial scan of opening segment",
      "frame_interval": [0, 1500],
      "caption": "Caption: Person X is working with power tools in a workshop"
    }
  ],
  "audio_snippets": [
    {
      "iteration": 2,
      "reason": "Check dialogue in middle section",
      "segments": [
        {
          "frame_interval": [3000, 4500],
          "text": "he really needs work-life balance",
          "start_time": 120.0,
          "end_time": 180.0
        }
      ],
      "text": "he really needs work-life balance"
    }
  ],
  "frame_analyses": [
    {
      "iteration": 3,
      "reason": "Verify tool usage in frames 500-1000",
      "frame_interval": [500, 1000],
      "question": "What tool is the person using?",
      "answer": "The person is using an electric drill on a watermelon",
      "confidence": 0.85
    }
  ],
  "current_answer": "D",
  "confidence": 0.9,
  "iterations_used": 3
}

Memory Categories

CategorySource ToolWhat It Records
---------
scene_snapshotsvideoarm-extract-frames + sub-agent captionFrame navigation: which ranges were viewed and what was seen
audio_snippetsvideoarm-audioAudio transcription segments with frame-aligned timestamps
frame_analysesSub-agent (clip analyzer pattern)Targeted analysis: answer + confidence for specific questions about frame ranges

Available Tools

1. videoarm-download

Download video from URL (YouTube etc).

HTTPS_PROXY=http://127.0.0.1:7890 videoarm-download <url>

Returns: {"path": "/path/to/video.mp4", "cached": false}

2. videoarm-info

Get video metadata.

videoarm-info <path>

Returns: {"fps": 25.0, "total_frames": 67243, "duration": 2689.74, "has_audio": true}

3. videoarm-extract-frames

Extract frames as a grid image. Frames are distributed proportionally across ranges by range length. Returns path only — do NOT read it yourself.

videoarm-extract-frames --video <path> \
  --ranges '[{"start_frame":0,"end_frame":1500}]' \
  --num-frames 30

Returns: {"image_path": "/tmp/xxx.jpg", ...}

4. videoarm-audio

Transcribe audio from a time range (seconds).

videoarm-audio <path> --start 0 --end 300

Returns: JSON with transcript and segments.

⚠️ Transcript can be very long. Extract key quotes and write to memory immediately.

Sub-Agent Dispatch Patterns

Scene Snapshot (after extracting frames)

Spawn a sub-agent to caption the extracted frames:

sessions_spawn(
  task = """Read this image and analyze it: /tmp/xxx.jpg

Use the read tool to open it (it supports jpg images).

These are 30 frames from a video ({time_range}).

Describe the main scene or action in these frames using a concise English sentence.
Prefix your answer with "Caption: "
""",
  cleanup = "delete"
)

→ Write result to scene_snapshots in memory.

Clip Analyzer (targeted question about frames)

This replaces the source code's clip_analyzer tool. Spawn a sub-agent with a specific question:

sessions_spawn(
  task = """Read this image and analyze it: /tmp/xxx.jpg

Use the read tool to open it (it supports jpg images).

These are {num_frames} frames from a video ({time_range}).
Context: {relevant_context}

Question: {specific_question}

Reply with JSON:
{
  "answer": "your detailed answer",
  "confidence": 0.85,
  "evidence": ["key observation 1", "key observation 2"]
}""",
  cleanup = "delete"
)

→ Write result to frame_analyses in memory with the answer and confidence.

Tips for sub-agent tasks:

  • Give specific questions, not vague ones
  • Include relevant context (audio transcript excerpts, character names from earlier findings)
  • Ask for structured JSON output with answer + confidence
  • Set cleanup="delete" to auto-clean

Workflow Example

Turn 1: Initialize

videoarm-download <url>        # Get video
videoarm-info <path>           # Get metadata

→ Create memory file with question + metadata + empty categories

Turn 2: First Sample

videoarm-extract-frames --video <path> --ranges '[...]' --num-frames 30

→ Spawn sub-agent to caption frames

→ Write to scene_snapshots in memory

Turn 3: Audio (if needed)

videoarm-audio <path> --start 0 --end 300

→ Extract key quotes → write to audio_snippets in memory

Turn 4: Focused Analysis

Based on memory, extract specific time range and spawn sub-agent with targeted question.

→ Write to frame_analyses in memory

Turn 5: Answer

Read memory → synthesize findings → answer with confidence.

Strategy Guidelines

  • Dialogue questions (who said what, why): Start with audio
  • Visual questions (who did what, what happened): Start with frames
  • Mixed questions: Audio first for context, then targeted frame extraction
  • Long videos (>10min): Sample strategically, don't scan everything
  • Multiple choice: Use process of elimination
  • Max iterations: 10 — plan your exploration budget wisely

Decision Making

When to answer:

  • Confidence > 0.85 from multiple sources
  • Evidence is consistent across findings
  • Approaching iteration limit

When to continue:

  • Confidence < 0.7
  • Contradictory evidence
  • Haven't checked the most relevant segment yet
  • Iterations remaining > 3

版本历史

共 1 个版本

  • v4.1.1 当前
    2026-05-07 07:32 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,230 📥 268,141
dev-programming

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 677 📥 325,724
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,085 📥 813,476