Cross-platform multi-engine ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with SenseVoice / whisper.cpp / openai-whisper, and applies LLM-based text calibration for Chinese content.
Default engine: SenseVoice Small (Alibaba FunASR) — ~1.5GB RAM, 234MB disk, 20× realtime speed, ~96% Chinese accuracy.
Tested & verified on Windows 11 with real Bilibili & Xiaohongshu videos.
# One-command full pipeline (SenseVoice Small — default, blazing fast for Chinese)
python scripts/run.py <video_url> --output-dir ./output
# Use whisper.cpp GGML (even lighter, 2GB RAM)
python scripts/run.py <video_url> --backend whispercpp --model medium-q5_1
# Use openai-whisper (standard, 5GB RAM)
python scripts/run.py <video_url> --backend openai --model medium
# Download audio only
python scripts/download_audio.py <video_url> <output_dir>
# Download audio + video (keep both as middleware)
python scripts/download_audio.py <video_url> <output_dir> --save-video --video-quality 1080
# Transcribe existing audio with auto backend selection
python scripts/transcribe.py <audio_file> --backend auto --language zh
# Transcribe with specific backend
python scripts/transcribe.py <audio_file> --backend sensevoice --language zh
python scripts/transcribe.py <audio_file> --backend whispercpp --model medium-q5_1
python scripts/transcribe.py <audio_file> --backend openai --model medium
Use this skill when:
python scripts/install_deps.py
Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.
Run scripts/download_audio.py .
Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.
Optional: Download video as middleware
Add --save-video to persist the full video alongside audio:
python scripts/download_audio.py <url> --save-video --video-quality 1080
python scripts/run.py <url> --save-video --video-quality 720 --calibrate
--video-quality preset table:
| Preset | yt-dlp behaviour | Typical result |
|--------|------------------|----------------|
| best (default) | Highest available | 4K on YouTube, 480p on B站 (no login) |
| 1080 | ≤1080p, fallback gracefully | 1080p where available |
| 720 | ≤720p | Good balance for local storage |
| 480 | ≤480p | Minimum acceptable for reference |
| 360 | ≤360p | Extremely small files |
| raw string | Direct yt-dlp format selector | Full flexibility |
> ⚠️ B站 note: Without login cookies, B站 caps at 480p. 720p+ requires --cookies-from-browser.
If download fails: the video may require cookies. Try:
yt-dlp --cookies-from-browser chrome <url>
Run scripts/transcribe.py .
Three backends, auto-selected by default (priority: SenseVoice → whisper.cpp → openai-whisper):
| Property | Value |
|----------|-------|
| RAM | ~1.5GB |
| Disk | ~234MB |
| Speed | 20× realtime (CPU) |
| Chinese accuracy | ~96% 🏆 |
| Model source | ModelScope (auto-download, no VPN needed) |
| Install | pip install funasr modelscope |
> Why SenseVoice? Alibaba's FunASR engine, Chinese-optimized from the ground up. No HuggingFace download needed (uses ModelScope mirror in China). ~4× faster than openai-whisper medium on CPU, with comparable or better Chinese accuracy.
| Model | RAM | Disk | Speed | Quality | Best For |
|-------|-----|------|-------|---------|----------|
| tiny-q5_1 | ~0.5GB | 32MB | fastest | low | Testing |
| small-q5_1 | ~1GB | 466MB | fast | decent | Quick preview |
| medium-q5_1 | ~2GB | 1.1GB | 3-5× faster than openai | ~95% ⭐ | Lightweight quality |
> Setup: pip install pywhispercpp + download GGML model from huggingface.co/ggerganov/whisper.cpp → place in ~/.cache/whispercpp/. See --show-backends for instructions.
| Model | RAM | Disk | Speed | Quality | Best For |
|-------|-----|------|-------|---------|----------|
| medium | ~5GB | 1.42GB | ~165 fps | ~95% | Standard |
| large-v3 | ~10GB | 2.88GB | ~80 fps | ~97% | Best accuracy |
| large-v3-turbo | ~6GB | 1.6GB | ~120 fps | ~96% | Good balance |
> Note: small model removed (88-90% accuracy, superseded by SenseVoice). Use SenseVoice for light weight.
After transcription, apply calibrate.py for mechanical corrections:
# Calibrate a raw .txt transcript
python scripts/calibrate.py <output_dir>/<video_title>.txt
# Output: <video_title>_calibrated.txt
# Skip traditional→simplified conversion (already simplified input)
python scripts/calibrate.py raw.txt --no-tradsimp
Calibrate categories: homophone fixes, financial terms, AI/tech company names, semiconductor/hardware terms, traditional→simplified (600+ chars).
For context-aware fixes (semantic errors, ambiguous names), use LLM review on top of rule-based output.
See references/calibration_guide.md for the full 80+ pattern library.
Present the calibrated text. Always include:
Every pipeline stage saves its output to disk. All artifacts persist in output_dir/ after the run — no data is lost between stages.
| Stage | Artifact | Format | Filename pattern | Reusable alone? |
|-------|----------|--------|------------------|-----------------|
| Step 1 | Audio | .m4a | | ✅ download_audio.py |
| Step 1 | Video (optional) | .mp4 | | ✅ download_audio.py --save-video |
| Step 2 | Transcript (raw) | .txt .srt .vtt .json | | ✅ transcribe.py |
| Step 2 | Pipeline metadata | .json | _pipeline_meta.json | ✅ (reference only) |
| Step 3 | Transcript (calibrated) | _calibrated.txt | | ✅ calibrate.py |
run.py Flags
# Full pipeline (download + transcribe + calibrate)
python scripts/run.py <url> --calibrate --output-dir ./out
# Save video as middleware (downloads .m4a + .mp4)
python scripts/run.py <url> --save-video --calibrate --output-dir ./out
# Audio already exists → skip download, re-transcribe
python scripts/run.py <url> --skip-download --output-dir ./out
# Audio + transcript already exist → skip both, re-run calibration only
python scripts/run.py <url> --skip-download --skip-transcribe --calibrate --output-dir ./out
run.py auto-detects existing artifacts by matching the audio filename base. It will warn and fall back to downloading/transcribing if no match is found.
Each stage has an independent entry point:
# Stage 1: Download audio only
python scripts/download_audio.py <url> [output_dir] [filename]
# Stage 1: Download audio + video at 720p (standalone)
python scripts/download_audio.py <url> [output_dir] --save-video --video-quality 720
# Stage 2: Transcribe existing audio
python scripts/transcribe.py <audio.m4a> --model medium --language zh --output-dir ./out
# Stage 3: Calibrate raw transcript (rule-based only)
python scripts/calibrate.py <raw.txt> [--output <path>] [--no-tradsimp]
Scenario A — Change model, keep audio
# Already have .m4a from previous run
python scripts/run.py <url> --skip-download --model large-v3 --output-dir ./out
Scenario B — Change language, keep audio + transcript
# Have both .m4a and .txt; just recalibrate
python scripts/run.py <url> --skip-download --skip-transcribe --calibrate --language en --output-dir ./out
Scenario C — Batch calibrate multiple transcripts
# Apply calibration to all raw .txt files in a directory
Get-ChildItem .\out\*.txt | Where-Object { $_.Name -notmatch '_calibrated' } | ForEach-Object {
python scripts/calibrate.py $_.FullName
}
| Platform | Status | Notes |
|----------|--------|-------|
| Bilibili | ✅ | Audio-only streams available without login. 720P+ video needs cookies. |
| Xiaohongshu | ✅ | Full support via XiaoHongShu extractor. Short links (xhslink.com) auto-resolved. No cookies needed. |
| YouTube | ✅ | Full support. Cookies may improve format selection. |
| Douyin/TikTok | ⚠️ | Requires login cookies (--cookies-from-browser or --cookies cookies.txt). No cookies = download fails. |
| All yt-dlp sites | ✅ | 1800+ supported platforms |
scripts/transcribe.py is designed for backend extensibility:
ALL_MODELS dict
transcribe_() function
detect_backends()
Available backends: sensevoice (✅ production), whispercpp (✅ code ready, GGML model manual), openai (✅ production)
Backend auto-selection: When --backend auto (default), the engine picks the best available backend in priority order:
| Problem | Solution |
|---------|----------|
| SIGKILL / ffmpeg FileNotFoundError | ffmpeg not in PATH. Script auto-detects 7 common install locations (winget, scoop, chocolatey, manual). If ffmpeg is elsewhere, add its directory to system PATH. |
| yt-dlp download fails | Update yt-dlp: pip install -U yt-dlp. Try with cookies. |
| "No subtitles found" | Expected. This skill uses ASR, not built-in captions. |
| ffmpeg not found | Run install_deps.py (handles Windows non-PATH detection). |
| GPU not utilized | openai-whisper CPU-only by default. SenseVoice also runs on CPU. whisper.cpp uses AVX/SSE SIMD on CPU. |
| funasr import error | pip install funasr modelscope — SenseVoice backend dependency. |
| pywhispercpp import error | pip install pywhispercpp — whisper.cpp backend dependency. |
| GGML model not found | Download from huggingface.co/ggerganov/whisper.cpp → place in ~/.cache/whispercpp/. Or use --backend sensevoice instead. |
| WDAC/AppLocker blocks torch DLL | Use pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu (CPU-only, signed). Also: pip install numpy==2.0.2. |
| HuggingFace blocked (China) | SenseVoice uses ModelScope (no VPN needed). For whisper.cpp GGML, use a VPN to download models once. |
| Video Duration | Model | Backend | Time | RAM Peak | Accuracy |
|---------------|-------|---------|------|----------|----------|
| 7m 49s (Bilibili) | SenseVoice Small | sensevoice | ~20s | ~1.5GB | ~96% 🏆 |
| 9m 30s (Bilibili) | medium-q5_1 | whispercpp | ~2m | ~2GB | ~95% |
| 9m 30s (Bilibili) | medium | openai | ~4m | ~5GB | ~95% |
| 23m (Bilibili) | SenseVoice Small | sensevoice | ~60s | ~1.5GB | ~96% 🏆 |
| 23m (Bilibili) | medium | openai | ~12m | ~5GB | ~95% |
Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.
--backend parameter (auto/sensevoice/whispercpp/openai)
transcribe.py --show-backends diagnostic command
small model from openai-whisper (superseded by SenseVoice)
--video-quality parameter — presets best/2160/1440/1080/720/480/360 + raw yt-dlp format support
find_artifacts_in_dir() replaces find_audio_in_dir() — caches both .m4a AND .mp4 on --skip-download --save-video
height<= filters (graceful fallback) instead of hardcoded mp4-only
download_video() now reports which quality preset is in use
--video-quality usage table and B站 login caveats
--save-video flag — download and persist full video (.mp4) alongside audio
download_video() function in download_audio.py (standalone: --save-video)
_pipeline_meta.json includes video_path for full traceability
small model from all backends (88-90% accuracy, too poor for production)
medium (was medium in code, but docs still promoted small)
XiaoHongShu extractor
ensure_deps() now injects ffmpeg into os.environ['PATH'] on success
torch.cuda.is_available()) for fp16 support
verbose=True for real-time transcription progress visibility
共 5 个版本