Video Subtitle Extractor 🎬→📝

Cross-platform multi-engine ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with SenseVoice / whisper.cpp / openai-whisper, and applies LLM-based text calibration for Chinese content.

Default engine: SenseVoice Small (Alibaba FunASR) — ~1.5GB RAM, 234MB disk, 20× realtime speed, ~96% Chinese accuracy.

Tested & verified on Windows 11 with real Bilibili & Xiaohongshu videos.

Quick Start

# One-command full pipeline (SenseVoice Small — default, blazing fast for Chinese)
python scripts/run.py <video_url> --output-dir ./output

# Use whisper.cpp GGML (even lighter, 2GB RAM)
python scripts/run.py <video_url> --backend whispercpp --model medium-q5_1

# Use openai-whisper (standard, 5GB RAM)
python scripts/run.py <video_url> --backend openai --model medium

# Download audio only
python scripts/download_audio.py <video_url> <output_dir>

# Download audio + video (keep both as middleware)
python scripts/download_audio.py <video_url> <output_dir> --save-video --video-quality 1080

# Transcribe existing audio with auto backend selection
python scripts/transcribe.py <audio_file> --backend auto --language zh

# Transcribe with specific backend
python scripts/transcribe.py <audio_file> --backend sensevoice --language zh
python scripts/transcribe.py <audio_file> --backend whispercpp --model medium-q5_1
python scripts/transcribe.py <audio_file> --backend openai --model medium

When to Use This Skill

Use this skill when:

The video has no built-in subtitles (Bilibili, Xiaohongshu, YouTube, etc.)
You need high-accuracy Chinese transcription (~95% with medium model)
You want multiple output formats (TXT, SRT, VTT, JSON)
You need LLM-assisted text calibration for financial/technical terms
The user says: "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写"

Workflow

Step 0: Install Dependencies (once)

python scripts/install_deps.py

Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.

Step 1: Download Audio

Run scripts/download_audio.py [output_dir].

Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.

Optional: Download video as middleware

Add --save-video to persist the full video alongside audio:

python scripts/download_audio.py <url> --save-video --video-quality 1080
python scripts/run.py <url> --save-video --video-quality 720 --calibrate

--video-quality preset table:

| Preset | yt-dlp behaviour | Typical result |

|--------|------------------|----------------|

| best (default) | Highest available | 4K on YouTube, 480p on B站 (no login) |

| 1080 | ≤1080p, fallback gracefully | 1080p where available |

| 720 | ≤720p | Good balance for local storage |

| 480 | ≤480p | Minimum acceptable for reference |

| 360 | ≤360p | Extremely small files |

| raw string | Direct yt-dlp format selector | Full flexibility |

> ⚠️ B站 note: Without login cookies, B站 caps at 480p. 720p+ requires --cookies-from-browser.

If download fails: the video may require cookies. Try:

yt-dlp --cookies-from-browser chrome <url>

Step 2: ASR Transcription (Multi-Backend)

Run scripts/transcribe.py .

Three backends, auto-selected by default (priority: SenseVoice → whisper.cpp → openai-whisper):

🥇 SenseVoice Small (default for Chinese)

| Property | Value |

|----------|-------|

| RAM | ~1.5GB |

| Disk | ~234MB |

| Speed | 20× realtime (CPU) |

| Chinese accuracy | ~96% 🏆 |

| Model source | ModelScope (auto-download, no VPN needed) |

| Install | pip install funasr modelscope |

> Why SenseVoice? Alibaba's FunASR engine, Chinese-optimized from the ground up. No HuggingFace download needed (uses ModelScope mirror in China). ~4× faster than openai-whisper medium on CPU, with comparable or better Chinese accuracy.

🥈 whisper.cpp (GGML quantized, CPU-optimized)

|-------|-----|------|-------|---------|----------|

| tiny-q5_1 | ~0.5GB | 32MB | fastest | low | Testing |

| medium-q5_1 | ~2GB | 1.1GB | 3-5× faster than openai | ~95% ⭐ | Lightweight quality |

> Setup: pip install pywhispercpp + download GGML model from huggingface.co/ggerganov/whisper.cpp → place in ~/.cache/whispercpp/. See --show-backends for instructions.

🥉 openai-whisper (standard)

|-------|-----|------|-------|---------|----------|

| medium | ~5GB | 1.42GB | ~165 fps | ~95% | Standard |

| large-v3 | ~10GB | 2.88GB | ~80 fps | ~97% | Best accuracy |

| large-v3-turbo | ~6GB | 1.6GB | ~120 fps | ~96% | Good balance |

> Note: small model removed (88-90% accuracy, superseded by SenseVoice). Use SenseVoice for light weight.

Step 3: Rule-Based Calibration

After transcription, apply calibrate.py for mechanical corrections:

# Calibrate a raw .txt transcript
python scripts/calibrate.py <output_dir>/<video_title>.txt
# Output: <video_title>_calibrated.txt

# Skip traditional→simplified conversion (already simplified input)
python scripts/calibrate.py raw.txt --no-tradsimp

Calibrate categories: homophone fixes, financial terms, AI/tech company names, semiconductor/hardware terms, traditional→simplified (600+ chars).

For context-aware fixes (semantic errors, ambiguous names), use LLM review on top of rule-based output.

See references/calibration_guide.md for the full 80+ pattern library.

Step 4: Deliver Results

Present the calibrated text. Always include:

Model used (medium/large) and quality notes
Any sections with low confidence or unclear audio
Summary of corrections applied (counts by category)

Intermediate Artifacts & Step-by-Step Reuse

Every pipeline stage saves its output to disk. All artifacts persist in output_dir/ after the run — no data is lost between stages.

Artifact Map

|-------|----------|--------|------------------|-----------------|

| Step 3 | Transcript (calibrated) | _calibrated.txt | _calibrated.txt</code> | ✅ <code>calibrate.py</code> | </p><h3>Selective Reuse via <code>run.py</code> Flags </h3><pre><code># Full pipeline (download + transcribe + calibrate) python scripts/run.py <url> --calibrate --output-dir ./out # Save video as middleware (downloads .m4a + .mp4) python scripts/run.py <url> --save-video --calibrate --output-dir ./out # Audio already exists → skip download, re-transcribe python scripts/run.py <url> --skip-download --output-dir ./out # Audio + transcript already exist → skip both, re-run calibration only python scripts/run.py <url> --skip-download --skip-transcribe --calibrate --output-dir ./out </code></pre><p><code>run.py</code> auto-detects existing artifacts by matching the audio filename base. It will warn and fall back to downloading/transcribing if no match is found. </p><h3>Standalone Scripts </h3><p>Each stage has an independent entry point: </p><pre><code># Stage 1: Download audio only python scripts/download_audio.py <url> [output_dir] [filename] # Stage 1: Download audio + video at 720p (standalone) python scripts/download_audio.py <url> [output_dir] --save-video --video-quality 720 # Stage 2: Transcribe existing audio python scripts/transcribe.py <audio.m4a> --model medium --language zh --output-dir ./out # Stage 3: Calibrate raw transcript (rule-based only) python scripts/calibrate.py <raw.txt> [--output <path>] [--no-tradsimp] </code></pre><h3>Typical Reuse Scenarios </h3><p><strong>Scenario A — Change model, keep audio</strong> </p><pre><code># Already have .m4a from previous run python scripts/run.py <url> --skip-download --model large-v3 --output-dir ./out </code></pre><p><strong>Scenario B — Change language, keep audio + transcript</strong> </p><pre><code># Have both .m4a and .txt; just recalibrate python scripts/run.py <url> --skip-download --skip-transcribe --calibrate --language en --output-dir ./out </code></pre><p><strong>Scenario C — Batch calibrate multiple transcripts</strong> </p><pre><code># Apply calibration to all raw .txt files in a directory Get-ChildItem .\out\*.txt | Where-Object { $_.Name -notmatch '_calibrated' } | ForEach-Object { python scripts/calibrate.py $_.FullName } </code></pre><h2>Platform Support </h2><p>| Platform | Status | Notes | </p><p>|----------|--------|-------| </p><p>| Bilibili | ✅ | Audio-only streams available without login. 720P+ video needs cookies. | </p><p>| Xiaohongshu | ✅ | Full support via <code>XiaoHongShu</code> extractor. Short links (xhslink.com) auto-resolved. No cookies needed. | </p><p>| YouTube | ✅ | Full support. Cookies may improve format selection. | </p><p>| Douyin/TikTok | ⚠️ | Requires login cookies (<code>--cookies-from-browser</code> or <code>--cookies cookies.txt</code>). No cookies = download fails. | </p><p>| All yt-dlp sites | ✅ | 1800+ supported platforms | </p><h2>Extending with New ASR Models </h2><p><code>scripts/transcribe.py</code> is designed for backend extensibility: </p><ol><li>Add model info to <code>ALL_MODELS</code> dict </li><li>Implement <code>transcribe_<backend>()</code> function </li><li>Add CLI flag in argparse </li><li>Add backend to <code>detect_backends()</code> </li></ol><p><strong>Available backends</strong>: sensevoice (✅ production), whispercpp (✅ code ready, GGML model manual), openai (✅ production) </p><p><strong>Backend auto-selection</strong>: When <code>--backend auto</code> (default), the engine picks the best available backend in priority order: </p><ol><li><strong>SenseVoice</strong> — Chinese-optimized, fastest, lightest </li><li><strong>whisper.cpp</strong> — CPU-optimized, quantized models </li><li><strong>openai-whisper</strong> — general purpose, most compatible </li></ol><h2>Troubleshooting </h2><p>| Problem | Solution | </p><p>|---------|----------| </p><p>| SIGKILL / ffmpeg FileNotFoundError | ffmpeg not in PATH. Script auto-detects 7 common install locations (winget, scoop, chocolatey, manual). If ffmpeg is elsewhere, add its directory to system PATH. | </p><p>| yt-dlp download fails | Update yt-dlp: <code>pip install -U yt-dlp</code>. Try with cookies. | </p><p>| "No subtitles found" | Expected. This skill uses ASR, not built-in captions. | </p><p>| ffmpeg not found | Run <code>install_deps.py</code> (handles Windows non-PATH detection). | </p><p>| GPU not utilized | openai-whisper CPU-only by default. SenseVoice also runs on CPU. whisper.cpp uses AVX/SSE SIMD on CPU. | </p><p>| <code>funasr</code> import error | <code>pip install funasr modelscope</code> — SenseVoice backend dependency. | </p><p>| <code>pywhispercpp</code> import error | <code>pip install pywhispercpp</code> — whisper.cpp backend dependency. | </p><p>| GGML model not found | Download from huggingface.co/ggerganov/whisper.cpp → place in <code>~/.cache/whispercpp/</code>. Or use <code>--backend sensevoice</code> instead. | </p><p>| WDAC/AppLocker blocks torch DLL | Use <code>pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu</code> (CPU-only, signed). Also: <code>pip install numpy==2.0.2</code>. | </p><p>| HuggingFace blocked (China) | SenseVoice uses ModelScope (no VPN needed). For whisper.cpp GGML, use a VPN to download models once. | </p><h2>Performance Benchmarks (Tested) </h2><p>| Video Duration | Model | Backend | Time | RAM Peak | Accuracy | </p><p>|---------------|-------|---------|------|----------|----------| </p><p>| 7m 49s (Bilibili) | SenseVoice Small | sensevoice | ~20s | ~1.5GB | <strong>~96%</strong> 🏆 | </p><p>| 9m 30s (Bilibili) | medium-q5_1 | whispercpp | ~2m | ~2GB | ~95% | </p><p>| 9m 30s (Bilibili) | medium | openai | ~4m | ~5GB | ~95% | </p><p>| 23m (Bilibili) | SenseVoice Small | sensevoice | ~60s | ~1.5GB | <strong>~96%</strong> 🏆 | </p><p>| 23m (Bilibili) | medium | openai | ~12m | ~5GB | ~95% | </p><p>Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed. </p><h2>Changelog </h2><h3>v2.0.0 — Multi-Engine ASR </h3><ul><li><strong>New</strong>: SenseVoice Small backend (Alibaba FunASR) — default engine, ~1.5GB RAM, 20× realtime, ~96% Chinese accuracy </li><li><strong>New</strong>: whisper.cpp GGML backend via pywhispercpp — CPU-optimized quantized models (0.5-2GB RAM) </li><li><strong>New</strong>: <code>--backend</code> parameter (<code>auto</code>/<code>sensevoice</code>/<code>whispercpp</code>/<code>openai</code>) </li><li><strong>New</strong>: <code>transcribe.py --show-backends</code> diagnostic command </li><li><strong>New</strong>: Auto-backend detection and fallback chain (SenseVoice → whisper.cpp → openai-whisper) </li><li><strong>Remove</strong>: <code>small</code> model from openai-whisper (superseded by SenseVoice) </li><li><strong>Remove</strong>: faster-whisper stub (CTranslate2 — WDAC incompatibility, never actually worked) </li><li><strong>Improve</strong>: transcriber architecture — clean backend dispatch, shared output writer </li><li><strong>Improve</strong>: Chinese-optimized default path (SenseVoice Small via ModelScope, no VPN needed) </li></ul><h3>v1.0.8 </h3><ul><li><strong>New</strong>: Semiconductor / hardware domain calibration rules (30+ patterns) — 韬定律→道定律, 全站协同→全栈协同, 量子碎穿→量子隧穿, 吸片→芯片, 奈米→纳米, etc. </li><li><strong>Improve</strong>: Huawei 道定律 video now correctly calibrated (0→26 pattern corrections, ~98% accuracy) </li><li><strong>Fix</strong>: Douyin/TikTok platform note — clearly states cookies requirement </li></ul><h3>v1.0.7 </h3><ul><li><strong>New</strong>: <code>--video-quality</code> parameter — presets best/2160/1440/1080/720/480/360 + raw yt-dlp format support </li><li><strong>New</strong>: <code>find_artifacts_in_dir()</code> replaces <code>find_audio_in_dir()</code> — caches both .m4a AND .mp4 on <code>--skip-download --save-video</code> </li><li><strong>Change</strong>: Video download format selector now uses <code>height<=</code> filters (graceful fallback) instead of hardcoded mp4-only </li><li><strong>Improve</strong>: <code>download_video()</code> now reports which quality preset is in use </li><li><strong>Improve</strong>: Artifact map includes video (.mp4) with quality column </li><li><strong>Improve</strong>: SKILL.md adds <code>--video-quality</code> usage table and B站 login caveats </li></ul><h3>v1.0.6 </h3><ul><li><strong>New</strong>: <code>--save-video</code> flag — download and persist full video (.mp4) alongside audio </li><li><strong>New</strong>: <code>download_video()</code> function in download_audio.py (standalone: <code>--save-video</code>) </li><li><strong>Improve</strong>: Artifact map now includes video (.mp4) as first-class middleware </li><li><strong>Improve</strong>: <code>_pipeline_meta.json</code> includes <code>video_path</code> for full traceability </li></ul><h3>v1.0.5 </h3><ul><li><strong>Remove</strong>: <code>small</code> model from all backends (88-90% accuracy, too poor for production) </li><li><strong>Change</strong>: Default model locked to <code>medium</code> (was medium in code, but docs still promoted small) </li><li><strong>Improve</strong>: Model table and benchmarks now medium-only baseline </li></ul><h3>v1.0.2 </h3><ul><li><strong>New</strong>: Xiaohongshu (小红书) platform support — yt-dlp <code>XiaoHongShu</code> extractor </li><li><strong>New</strong>: Short link auto-resolution (xhslink.com → full URL via redirect) </li><li><strong>Improve</strong>: Platform support table now lists 小红书 explicitly </li><li><strong>Improve</strong>: Quick Start examples include xhslink.com usage </li></ul><h3>v1.0.1 </h3><ul><li><strong>Fix</strong>: Expanded ffmpeg search paths from 3→7 (winget/scoop/chocolatey/ProgramFiles(x86)) </li><li><strong>Fix</strong>: <code>ensure_deps()</code> now injects ffmpeg into <code>os.environ['PATH']</code> on success </li><li><strong>Fix</strong>: SIGKILL troubleshooting updated — root cause is ffmpeg PATH, not OOM </li><li><strong>Improve</strong>: Auto-detect GPU (<code>torch.cuda.is_available()</code>) for fp16 support </li><li><strong>Improve</strong>: <code>verbose=True</code> for real-time transcription progress visibility </li><li><strong>Improve</strong>: More accurate error messages in dependency checks </li></ul><h3>v1.0.0 </h3><ul><li>Initial release: download (yt-dlp) + transcribe (whisper) + calibrate (LLM) pipeline </li><li>7 ffmpeg install path auto-detection </li><li>Multi-format output (TXT, SRT, VTT, JSON) </li><li>Platform support: Bilibili, YouTube, all yt-dlp sites</li></ul></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 5 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v2.0.0</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-28 13:08 </div> </li> <li> <div> <span class="version-tag">v1.0.8</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-26 23:14 </div> </li> <li> <div> <span class="version-tag">v1.0.6</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-26 17:44 </div> </li> <li> <div> <span class="version-tag">v1.0.3</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-23 16:31 安全安全 </div> </li> <li> <div> <span class="version-tag">v1.0.0</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-21 23:56 安全安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-queued"> 队列中 </div> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-queued"> 队列中 </div> </div> </div> </div> </div>  <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">design-media</span> <h3><a href="/s/openai-whisper">Openai Whisper</a></h3> <div class="rec-owner">steipete</div> <div class="rec-desc">使用 Whisper CLI 进行本地语音转文字（无需 API 密钥）</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 331</span> <span style="color:#5b6abf;">📥 93,897</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">design-media</span> <h3><a href="/s/ui-ux-pro-max">UI/UX Pro Max</a></h3> <div class="rec-owner">xobi667</div> <div class="rec-desc">提供 UI/UX 设计智能与实现指导，帮助打造精美界面。适用于 UI 设计、UX 流程、信息架构、视觉风格、设计系统/标记、组件规格、文案/微文案、无障碍及前端 UI（HTML/CSS/JS、React、Next.js、Vue、Svelte</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 218</span> <span style="color:#5b6abf;">📥 47,758</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;">design-media</span> <h3><a href="/s/video-frames">Video Frames</a></h3> <div class="rec-owner">steipete</div> <div class="rec-desc">使用 ffmpeg 从视频中提取帧或短片。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 134</span> <span style="color:#5b6abf;">📥 52,903</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>

Video Subtitle Extractor

概述