Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.
Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.
# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg
# Force a specific model
scripts/transcribe.py voice.ogg --model small
# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en # English
scripts/transcribe.py voice.ogg --language yue # Cantonese
# Show segment timestamps
scripts/transcribe.py voice.ogg --segments
# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt
The script analyzes audio duration + complexity and selects the optimal model automatically:
| Audio Characteristic | Model Used | Why |
|---|---|---|
| --- | --- | --- |
| Short (<10s), clean speech | base | Fast (2-3s). Accurate enough for simple content. |
| Short (<10s), mixed languages | small | Better multilingual handling for code-switching. |
| Medium (10-60s), clean | base | Balanced speed and accuracy. |
| Medium (10-60s), mixed | small | Handles accents and language transitions. |
| Long (1-2min) | small | Maintains context, still fast enough. |
| Very long (2min+) | medium | Maximum accuracy for extended recordings. |
You don't need to think about models. Just send audio.
python3 scripts/install.py
pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu
> Note: First run downloads the Whisper model (~139MB for base, ~461MB for small).
> Subsequent runs use the cached model (~/.cache/whisper/) and load instantly.
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| --- | --- | --- | --- | --- |
| tiny | 72MB | ⚡⚡⚡ | ⭐⭐ | Real-time preview, very short clips |
| base | 139MB | ⚡⚡ | ⭐⭐⭐ | General use (auto-select default for short audio) |
| small | 461MB | ⚡ | ⭐⭐⭐⭐ | Mixed languages, accents (auto-select for long/complex) |
| medium | 1.5GB | 🐢 | ⭐⭐⭐⭐⭐ | Maximum accuracy, long recordings |
| large | 2.9GB | 🐢 | ⭐⭐⭐⭐⭐ | Research-grade transcription |
Whisper supports 99 languages including:
Auto-detects language by default. Use --language to provide a hint for better accuracy.
| Feature | Description |
|---|---|
| --- | --- |
| 🔒 100% Private | Everything runs locally. No data leaves your machine. |
| 🆓 No API Costs | Free unlimited transcription. No quotas, no keys. |
| 🌐 99 Languages | Supports virtually all major world languages. |
| 🧠 Smart Auto-Model | Analyzes audio → picks optimal model automatically. |
| ⚡ Fast by Default | Short clips → base model (2-3s). Long clips → small/medium. |
| 🎯 Accurate When Needed | Complex/mixed audio automatically upgrades the model. |
| 📊 Segment Timestamps | Sentence-level timing for long recordings. |
| 📁 Multiple Formats | OGG, WAV, MP3, M4A, FLAC, OPUS and more. |
| Format | Extension | Notes |
|---|---|---|
| --- | --- | --- |
| OGG Opus | .ogg | Common voice message format ✅ |
| WAV | .wav | Uncompressed, high quality |
| MP3 | .mp3 | Compressed audio |
| M4A | .m4a | Apple/MPEG-4 audio |
| FLAC | .flac | Lossless compressed |
| OPUS | .opus | Pure Opus stream |
$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱ Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...
# Chinese
scripts/transcribe.py voice.ogg --language zh
# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments
# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg
# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt
# Force high accuracy
scripts/transcribe.py important.wav --model medium
$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱ Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?
📝 Segments:
[0.0s - 3.6s] Now I'm sending this voice message
[3.6s - 7.4s] to XiaoA, can you recognize what I said?
| Problem | Solution |
|---|---|
| --- | --- |
No module error | Use the venv Python: python3 scripts/transcribe.py or run scripts/install.py |
| Slow transcription | First download caches the model (~139-461MB). Normal for first run. |
| Wrong language detected | Pass --language en or --language zh for a hint |
| Background noise | Use --model small or --model medium for noisy environments |
| Scenario | Cloud API Cost | This Skill | Savings |
|---|---|---|---|
| --- | --- | --- | --- |
| 10 short voice messages/day | ~$0.60/day (Whisper API) | $0 | ∞ |
| 1 hour meeting transcription | ~$2.88 (Deepgram) | $0 | ∞ |
| 1000 files for a project | ~$50-200 | $0 | ∞ |
| Agent processing voice inputs | LLM tokens + API fees | 0 tokens | Full token budget saved |
Your audio is yours. Always.
共 1 个版本