← 返回
效率工具 Key 中文

Audio Speaker Tools

Speaker separation, voice comparison, and audio processing tools. Use when working with multi-speaker audio, voice cloning, or speaker verification tasks inc...
说话人分离、声纹比对及音频处理工具。适用于多说话人音频、声音克隆或说话人验证等任务。
cmfinlan
效率工具 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 628
下载
💾 29
安装
1
版本
#latest

概述

Audio Speaker Tools

Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

Overview

This skill provides three main workflows:

  1. Speaker separation - Extract per-speaker audio from multi-speaker recordings
  2. Voice comparison - Measure speaker similarity between two audio files
  3. Audio processing - Segment extraction and voice isolation

Prerequisites

Setup Virtual Environment

Run once to create the venv and install dependencies:

bash scripts/setup_venv.sh

Default venv location: ./.venv

Requirements:

  • Python 3.9+
  • ffmpeg (brew install ffmpeg)
  • HuggingFace token (set as env var HF_TOKEN)

Scripts

1. Speaker Separation: diarize_and_slice_mps.py

Separate speakers from multi-speaker audio:

# Basic usage
HF_TOKEN=<your-hf-token> \
  /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir /path/to/output \
  --prefix MyShow

# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir ./out \
  --min-speakers 2 \
  --max-speakers 5 \
  --pad-ms 100

Process:

  1. Converts input to 16kHz mono WAV
  2. Runs Demucs vocal/background separation (optional, for cleaner input)
  3. Runs pyannote speaker diarization (MPS-accelerated)
  4. Extracts concatenated per-speaker WAV files

Output:

  • _speaker1.wav, _speaker2.wav, etc. (one per detected speaker)
  • diarization.rttm (time-stamped speaker segments)
  • segments.jsonl (JSON segments metadata)
  • meta.json (pipeline info and speaker index)

Important:

  • Always pass HF token via HF_TOKEN env var, never as CLI arg
  • MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
  • Default output: ./separated/

2. Voice Comparison: compare_voices.py

Measure similarity between two voice samples using Resemblyzer:

# Basic comparison
python scripts/compare_voices.py \
  --audio1 sample1.wav \
  --audio2 sample2.wav

# JSON output
python scripts/compare_voices.py \
  --audio1 reference.wav \
  --audio2 clone.wav \
  --threshold 0.85 \
  --json

# Exit code = 0 if pass, 1 if fail

Scores:

  • < 0.75 = Different speakers
  • 0.75-0.84 = Likely same speaker
  • 0.85+ = Excellent match (ideal for voice cloning validation)

Use cases:

  • Voice clone quality assessment (compare clone vs. original)
  • Speaker verification (authenticate speaker identity)
  • Validate speaker separation (confirm separated speakers are distinct)

See: references/scoring-guide.md for detailed interpretation

3. Audio Trimming

Use ffmpeg directly for segment extraction:

# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3

# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3

Workflows

Workflow 1: Extract Clean Voice Sample for Cloning

Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning

# 1. Separate speakers
HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \
  --input podcast.mp3 --outdir ./out --prefix Podcast

# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)

# 3. Select best sample (5-30s, clean speech)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav

# 4. Upload to ElevenLabs as instant voice clone

See: references/elevenlabs-cloning.md for best practices

Workflow 2: Validate Voice Clone Quality

Goal: Measure how well a cloned voice matches the original

# 1. Generate test audio with ElevenLabs clone
# (done via ElevenLabs web UI or API)

# 2. Compare clone vs. reference
python scripts/compare_voices.py \
  --audio1 original_sample.wav \
  --audio2 elevenlabs_clone.wav \
  --threshold 0.85 \
  --json

# 3. Interpret score:
#    0.85+ = excellent, publish-ready
#    0.80-0.84 = acceptable, may need tweaking
#    < 0.80 = poor, try different sample or settings

See: references/scoring-guide.md for troubleshooting low scores

Workflow 3: Multi-Speaker Conversation Analysis

Goal: Separate and identify speakers in a conversation

# 1. Run diarization
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
  --input meeting.mp3 --outdir ./out --prefix Meeting

# 2. Check detected speakers (meta.json)
cat out/meta.json

# 3. Compare speaker pairs to confirm separation
python scripts/compare_voices.py \
  --audio1 out/Meeting_speaker1.wav \
  --audio2 out/Meeting_speaker2.wav

# Expected: < 0.75 if separation worked correctly

Technical Notes

Device Acceleration

  • pyannote diarization: MPS (Metal) by default, CPU fallback
  • Resemblyzer: CPU only (no GPU acceleration)
  • Demucs: MPS by default when available

To force CPU for diarization: --device cpu

Audio Formats

  • Input: Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.)
  • Processing: Internally converted to 16kHz mono WAV for diarization
  • Output: WAV format (44.1kHz stereo preserved from source)

HuggingFace Token

  • Required for: pyannote speaker diarization
  • Access: Must accept gated repo pyannote/speaker-diarization-3.1 on HF
  • Storage: Any secure secrets manager
  • Usage: Always pass via HF_TOKEN env var, never CLI arg

Sample Quality Tips

  • Shorter is better: 5-30s clean samples often score higher than 60+ second samples
  • Clean audio: Remove background noise with Demucs --two-stems vocals
  • Single speaker: Ensure isolated voice, not mixed conversation
  • Good recording: Studio mic > phone mic for voice comparison accuracy

References

  • elevenlabs-cloning.md - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations)
  • scoring-guide.md - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting)

Common Issues

"Missing HF token" error

  • Export token before running: export HF_TOKEN=
  • Or pass inline: HF_TOKEN= python script.py ...

Low voice comparison scores for same speaker

  • Try shorter, cleaner samples (5-30s)
  • Use Demucs to isolate vocals: demucs --two-stems vocals input.mp3
  • Ensure consistent recording quality (same mic, environment)
  • See references/scoring-guide.md troubleshooting section

Diarization not detecting all speakers

  • Adjust --min-speakers and --max-speakers flags
  • Check audio quality (clear speech, minimal overlap)
  • Try longer audio (30+ seconds) for better speaker modeling

MPS/Metal acceleration not working

  • Ensure PyTorch with MPS support: python -c "import torch; print(torch.backends.mps.is_available())"
  • Fallback to CPU: --device cpu
  • Re-run setup_venv.sh to reinstall PyTorch

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 04:00 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

productivity

Obsidian

steipete
操作 Obsidian 仓库(纯 Markdown 笔记)并通过 obsidian-cli 自动化。
★ 433 📥 103,838
productivity

Weather

steipete
获取当前天气和预报(无需API密钥)
★ 446 📥 226,420
productivity

Nano Pdf

steipete
使用nano-pdf CLI通过自然语言指令编辑PDF
★ 275 📥 114,937