Ten operations across four capabilities: identify (认) · manage (存) · transcribe (听) · clone (说).
| Component | Install | Purpose |
|---|---|---|
| ----------- | --------- | --------- |
| Whisper | pip install openai-whisper | Speech-to-text |
| Speaker ID | pip install transformers librosa | Speaker identification (UniSpeech-SAT) |
| CosyVoice2 | SiliconFlow API (SF_API_KEY) | Voice cloning |
| ffmpeg | System package | Audio conversion |
Voice references are stored in voice-refs/ at workspace root.
Metadata lives in TOOLS.md under a "Voice Library" section.
See references/voice-library-format.md for format spec.
Input: audio → Output: who is speaking (or "unknown")
python3 scripts/voice_identify.py <audio_file> [--threshold 0.75]
Compares audio against all voice-refs/-ref.* using UniSpeech-SAT x-vector embeddings.
First run downloads model (~360MB) to /tmp/hf_models/.
Accuracy: Reliably separates male/female voices. Same-gender speakers need ≥5s audio for best results. Threshold 0.75 is default; raise to 0.85 for stricter matching.
Input: audio + speaker name → stores in voice library
voice-refs/-ref1. whisper TOOLS.md (see format in references/)voice_identify.py SPEAKER_MAPGood reference audio: 10-15s clear speech, minimal noise, natural pace. 5s minimum.
TOOLS.md voice library section + ls voice-refs/voice-refs/, update TOOLS.md entryvoice-refs/, remove TOOLS.md entry, remove from SPEAKER_MAPInput: text + library speaker → Output: audio in that speaker's voice
set -a; source <env_file_with_SF_API_KEY>; set +a
python3 scripts/cosyvoice_clone.py \
--text "Text to speak" \
--ref voice-refs/<speaker>-ref1.<ext> \
--ref-text "What is said in reference audio" \
--output /tmp/clone_output.wav
Long reference (>15s): truncate first with ffmpeg -y -i -t 15 -ar 24000 -ac 1 /tmp/ref_trimmed.wav.
Input: audio → Output: text
whisper <audio_file> --model small --output_format txt --output_dir /tmp --language <lang>
Languages: zh (Chinese), en (English), ja (Japanese). Omit for auto-detect.
Input: audio → Output: who said what
Run Op 5 and Op 1 in parallel, report both results together.
Input: two audio files → Output: same person or not
python3 scripts/voice_identify.py <audio_1> --threshold 0.75
python3 scripts/voice_identify.py <audio_2> --threshold 0.75
Compare the top-ranked speaker from both runs. If they match → same person.
For direct pairwise comparison without a library, extract embeddings and compute cosine similarity (see voice_identify.py internals).
Input: audio + library speaker → Output: same words, different voice
Input: audio question + library speaker → Output: AI answer in that speaker's voice
Input: text question + library speaker → Output: AI answer in that speaker's voice
set -a; source <env_file>; set +a
bash scripts/feishu_send_audio.sh <wav_file> <receive_id>
Converts wav → opus, uploads, sends as voice message.
Requires FEISHU_APP_ID + FEISHU_APP_SECRET env vars.
ffmpeg -y -i <video_file> -vn -ar 24000 -ac 1 /tmp/extracted_audio.wav
共 1 个版本