← 返回
未分类 中文

Tom Video Understanding

Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.
本地视频理解技能。使用ffmpeg提取音频和帧,FunASR进行语音识别,qwen3-vl进行图像理解。
tomuiv
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 454
下载
💾 0
安装
1
版本
#latest

概述

Video Understanding

Use this skill when you need to understand the content of a video.

Prerequisites

  • FunASR conda environment (asr-local) must be activated for audio processing
  • Ollama must be running with qwen3-vl:8b model available
  • ffmpeg must be in PATH

Workflow

Step 1: Extract Audio

ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y

Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.

Step 2: Extract Key Frames

mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y

Step 3: Speech Recognition (FunASR)

conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
    model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision='v2.0.4',
    disable_update=True,
    ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"

Step 4: Image Understanding (qwen3-vl)

ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"

Step 5: Combine Results

  • Audio transcription → FunASR (local, Chinese speech recognition)
  • Key frames → qwen3-vl:8b via Ollama (local image understanding)
  • Summary/Analysis → Cloud LLM API (if needed)

Important Notes

  • Image reading via Read tool does NOT provide image understanding - always use qwen3-vl
  • For Chinese audio, FunASR is preferred over Whisper
  • Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
  • Modelscope cache at C:/Users/TOM/.cache/modelscope for FunASR models

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-03 06:45 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

清华网络学堂

tomuiv
清华网络学堂自动化:本地 DPAPI 加密保存密码,自动登录、查看待办、下载课件、提交作业、批量标记已读。用于日常学习任务。
★ 4 📥 549

Local Video Understanding

tomuiv
本地视频理解技能。使用ffmpeg提取音频和帧,FunASR进行语音识别,qwen3-vl进行图像理解。
★ 1 📥 442

Openclaw Task Executor

tomuiv
根据复杂度分类任务、规划执行、按需生成并监控子代理、持续报告进度与结果。
★ 0 📥 376