概述

Video Understanding

Use this skill when you need to understand the content of a video.

Prerequisites

FunASR conda environment (asr-local) must be activated for audio processing
Ollama must be running with qwen3-vl:8b model available
ffmpeg must be in PATH

Workflow

Step 1: Extract Audio

ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y

Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.

Step 2: Extract Key Frames

mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y

Step 3: Speech Recognition (FunASR)

conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
    model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision='v2.0.4',
    disable_update=True,
    ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"

Step 4: Image Understanding (qwen3-vl)

ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"

Step 5: Combine Results

Audio transcription → FunASR (local, Chinese speech recognition)
Key frames → qwen3-vl:8b via Ollama (local image understanding)
Summary/Analysis → Cloud LLM API (if needed)

Important Notes

Image reading via Read tool does NOT provide image understanding - always use qwen3-vl
For Chinese audio, FunASR is preferred over Whisper
Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
Modelscope cache at C:/Users/TOM/.cache/modelscope for FunASR models

版本历史

共 1 个版本

v1.0.0 当前

2026-05-03 06:45 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Tom Video Understanding

概述

Video Understanding

Prerequisites

Workflow

Step 1: Extract Audio

Step 2: Extract Key Frames

Step 3: Speech Recognition (FunASR)

Step 4: Image Understanding (qwen3-vl)

Step 5: Combine Results

Important Notes

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

清华网络学堂

Local Video Understanding

Openclaw Task Executor