← 返回
未分类 中文

Voice Recognition

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...
本地 Whisper 智能语音转文字,无需 API 密钥,完全私密,适用于音频文件和语音消息转录。
08jacky04 08jacky04 来源
未分类 clawhub v1.1.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 281
下载
💾 0
安装
1
版本
#latest

概述

🎤 Voice Recognition — Smart Auto-Model Selection

Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.

Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.

Quick Start

# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg

# Force a specific model
scripts/transcribe.py voice.ogg --model small

# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh   # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en   # English
scripts/transcribe.py voice.ogg --language yue  # Cantonese

# Show segment timestamps
scripts/transcribe.py voice.ogg --segments

# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt

Smart Auto-Selection

The script analyzes audio duration + complexity and selects the optimal model automatically:

Audio CharacteristicModel UsedWhy
---------
Short (<10s), clean speechbaseFast (2-3s). Accurate enough for simple content.
Short (<10s), mixed languagessmallBetter multilingual handling for code-switching.
Medium (10-60s), cleanbaseBalanced speed and accuracy.
Medium (10-60s), mixedsmallHandles accents and language transitions.
Long (1-2min)smallMaintains context, still fast enough.
Very long (2min+)mediumMaximum accuracy for extended recordings.

You don't need to think about models. Just send audio.

Installation

Prerequisites

  • Python 3.10+
  • pip (Python package manager)

Via bundled installer

python3 scripts/install.py

Manual

pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu

Using requirements.txt

pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu

> Note: First run downloads the Whisper model (~139MB for base, ~461MB for small).

> Subsequent runs use the cached model (~/.cache/whisper/) and load instantly.

Model Reference

ModelSizeSpeedAccuracyBest For
---------------
tiny72MB⚡⚡⚡⭐⭐Real-time preview, very short clips
base139MB⚡⚡⭐⭐⭐General use (auto-select default for short audio)
small461MB⭐⭐⭐⭐Mixed languages, accents (auto-select for long/complex)
medium1.5GB🐢⭐⭐⭐⭐⭐Maximum accuracy, long recordings
large2.9GB🐢⭐⭐⭐⭐⭐Research-grade transcription

Language Support

Whisper supports 99 languages including:

  • 🇨🇳 Chinese (Mandarin, Cantonese)
  • 🇺🇸 English
  • 🇪🇸 Spanish
  • 🇯🇵 Japanese
  • 🇰🇷 Korean
  • 🇫🇷 French
  • 🇩🇪 German

Auto-detects language by default. Use --language to provide a hint for better accuracy.

Features

FeatureDescription
------
🔒 100% PrivateEverything runs locally. No data leaves your machine.
🆓 No API CostsFree unlimited transcription. No quotas, no keys.
🌐 99 LanguagesSupports virtually all major world languages.
🧠 Smart Auto-ModelAnalyzes audio → picks optimal model automatically.
Fast by DefaultShort clips → base model (2-3s). Long clips → small/medium.
🎯 Accurate When NeededComplex/mixed audio automatically upgrades the model.
📊 Segment TimestampsSentence-level timing for long recordings.
📁 Multiple FormatsOGG, WAV, MP3, M4A, FLAC, OPUS and more.

Supported Audio Formats

FormatExtensionNotes
---------
OGG Opus.oggCommon voice message format ✅
WAV.wavUncompressed, high quality
MP3.mp3Compressed audio
M4A.m4aApple/MPEG-4 audio
FLAC.flacLossless compressed
OPUS.opusPure Opus stream

Usage Examples

Quick transcription (auto model)

$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱  Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...

Transcription in context

# Chinese
scripts/transcribe.py voice.ogg --language zh

# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments

# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg

# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt

# Force high accuracy
scripts/transcribe.py important.wav --model medium

Output with segments

$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱  Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?

📝 Segments:
   [0.0s - 3.6s] Now I'm sending this voice message
   [3.6s - 7.4s] to XiaoA, can you recognize what I said?

Troubleshooting

ProblemSolution
------
No module errorUse the venv Python: python3 scripts/transcribe.py or run scripts/install.py
Slow transcriptionFirst download caches the model (~139-461MB). Normal for first run.
Wrong language detectedPass --language en or --language zh for a hint
Background noiseUse --model small or --model medium for noisy environments

Token Savings Examples

ScenarioCloud API CostThis SkillSavings
------------
10 short voice messages/day~$0.60/day (Whisper API)$0
1 hour meeting transcription~$2.88 (Deepgram)$0
1000 files for a project~$50-200$0
Agent processing voice inputsLLM tokens + API fees0 tokensFull token budget saved

Privacy & Security

  • 100% offline — no data leaves your machine.
  • No API keys — no third-party services, no accounts.
  • No telemetry — zero tracking.
  • No cloud — everything runs locally.
  • Zero token consumption — frees your LLM budget for reasoning.

Your audio is yours. Always.

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-08 04:07 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

design-media

Openai Whisper

steipete
使用 Whisper CLI 进行本地语音转文字(无需 API 密钥)
★ 335 📥 94,729
design-media

UI/UX Pro Max

xobi667
提供 UI/UX 设计智能与实现指导,帮助打造精美界面。适用于 UI 设计、UX 流程、信息架构、视觉风格、设计系统/标记、组件规格、文案/微文案、无障碍及前端 UI(HTML/CSS/JS、React、Next.js、Vue、Svelte
★ 227 📥 48,900
design-media

Nano Banana Pro

steipete
使用 Nano Banana Pro (Gemini 3 Pro Image) 生成或编辑图像。支持文生图、图生图及 1K/2K/4K 分辨率,适用于图像创建、修改及编辑请求,使用 --input-image 指定输入图像。
★ 435 📥 117,905