概述

see-video

Extract frames from a video and inject them as a grid image + XML timestamps into LLM context.

Setup (first time only)

cd <skill directory>
npm install

Usage

node {baseDir}/scripts/inject.mjs <video_path> [--mode uniform|highlight] [--start N] [--end N]

On success, outputs JSON to stdout:

{
  "gridPath": "/tmp/video_llm-frames.jpg",
  "description": "<video_frames>...</video_frames>",
  "duration": 1326,
  "frameCount": 28,
  "layout": { "cols": 4, "rows": 7, "cellW": 384, "cellH": 216 },
  "videoWidth": 854,
  "videoHeight": 480,
  "inputSizeMb": 42.3
}

If the video exceeds 10 minutes and uniform mode was used without --start/--end, a hint field is included:

{
  "hint": "Video is 30 minutes long. This is a uniform overview. For better scene coverage re-run with --mode highlight, or use --start/--end to zoom into a specific section."
}

Recommended workflow for long videos:

First run with --mode highlight — shows key scene changes across the whole video
If the user wants detail on a specific section, re-run with --start N --end N

On error, writes ERROR: + Hint: to stderr and exits 1.

Injection procedure

Step 1 — Run the script (bash tool):

node {baseDir}/scripts/inject.mjs "/path/to/video.mp4"

Step 2 — Parse JSON:

Extract gridPath and description.

Step 3 — Inject image (read tool):

read <gridPath>

The read tool injects the jpg as a native multimodal image block into context.

After viewing the grid, use the description XML timestamps to reference frames:

> "Look at the grid image above. Use the timestamps in the description XML to analyze the video. The number in the top-left of each cell is the frame index."

On error:

Translate the Hint: message into natural language for the user. Do not paste raw error output.
If read fails — /tmp/ files are ephemeral. Re-run the script and read immediately.

Options

Option	Default	Description
---	---	---
`--mode uniform`	✅	Evenly spaced frames
`--mode highlight`		Scene-change biased sampling
`--start N`	`0`	Segment start (seconds)
`--end N`	end of video	Segment end (seconds)

Diagnostics

Error	Cause	Action
---	---	---
`Input file not found`	File missing or dropped by channel media size limit	Ask the user to share the file path directly as text
`corrupt, incomplete, or unsupported format`	Damaged file, interrupted transfer, or unsupported codec	Try a different file, or use `--start`/`--end` to skip problematic sections
`moov atom not found`	Incomplete mp4 (streaming not finished)	Retry with a complete file
`ffmpeg not found`	ffmpeg not installed	Check ffmpeg installation

Notes

Frame count and cell size are determined automatically from video duration and aspect ratio
Grid is ~1500×1500px, cell long side 384–512px
Timestamps are in the description XML only, not overlaid on the image
Portrait and landscape videos both supported
Telegram users: if a video file is not attached to the message, check channels.telegram.mediaMaxMb in the OpenClaw config — the file may have been dropped at the channel level before reaching the agent

版本历史

共 1 个版本

v1.0.0 当前

2026-05-07 04:49 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

see-video

概述

see-video

Setup (first time only)

Usage

Injection procedure

Options

Diagnostics

Notes

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Nano Banana Pro

UI/UX Pro Max

karpathy-llm-wiki