← 返回
未分类

vibevoice-tts

Generate speech audio from text using the VibeVoice-1.5B TTS model. This skill should be used when the user asks to generate speech, synthesize voice, create audio from text, produce a podcast, or perform text-to-speech. Supports single and multi-speaker (up to 4) generation with voice cloning. Default voice is zh-Xinran_woman (Chinese female). Runs locally with CUDA, MPS (Apple Silicon), or CPU. Uses uv for environment management.
u_52c67e90
未分类 enterprise v1.0.1 97058.8 Key: 无需
★ 0
Stars
📥 33
下载
💾 0
安装

概述

VibeVoice TTS Skill

Generate high-quality speech audio from text using the VibeVoice-1.5B model running locally.

Quick Reference

  • Skill directory: {SKILL_DIR} (resolve at runtime, e.g. ~/.codebuddy/skills/vibevoice-tts)
  • Setup script: {SKILL_DIR}/scripts/setup.sh
  • TTS script: {SKILL_DIR}/scripts/tts_generate.py
  • Virtual env: {SKILL_DIR}/.venv/
  • Bundled voices: {SKILL_DIR}/voices/ (9 presets included)
  • Output directory: {SKILL_DIR}/outputs/
  • Default voice: zh-Xinran_woman (Chinese female)
  • Default model: vibevoice/VibeVoice-1.5B (auto-downloaded from HuggingFace)

> Note for AI agent: {SKILL_DIR} refers to the parent directory of scripts/. At runtime, resolve it to the actual installed skill path. All commands use uv run --python {SKILL_DIR}/.venv/bin/python to execute within the skill's own virtual environment.

First-Time Setup

Run the setup script once to create the virtual environment and install dependencies:

bash {SKILL_DIR}/scripts/setup.sh

This will:

  1. Create a .venv/ virtual environment in the skill directory (Python 3.11)
  2. Install vibevoice from GitHub source (latest version) with all dependencies

Prerequisites: only uv needs to be installed on the system. Install with:

curl -LsSf https://astral.sh/uv/install.sh | sh

Verify Setup

Check that the environment is ready:

uv run --python {SKILL_DIR}/.venv/bin/python -c "import vibevoice; print('vibevoice OK')"

Workflow

Step 0: Ensure Environment

Before generating, check if {SKILL_DIR}/.venv/ exists. If not, run bash {SKILL_DIR}/scripts/setup.sh first.

Step 1: Determine Input

Identify the text to synthesize and the number of speakers.

  • Plain text (no Speaker N: tags): auto-wrapped as single speaker.
  • Multi-speaker script: must follow Speaker N: ... format (N from 1 to 4).

For Chinese text, prefer English punctuation (commas , and periods .) to avoid pronunciation instability. Refer to references/script_format.md for detailed format guidance.

Step 2: Choose Voices

Select voice presets for each speaker. Refer to references/voice_presets.md for the full list.

Common choices:

  • Chinese female (default): zh-Xinran_woman
  • Chinese male: zh-Bowen_man
  • English female: en-Alice_woman or en-Maya_woman
  • English male: en-Frank_man or en-Carter_man

Short aliases like Xinran, Bowen, Alice, Frank are also accepted.

Step 3: Generate Audio

All commands use the skill's virtual environment via uv run:

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py [arguments]

Single speaker from plain text (most common)

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "你好,欢迎收听本期节目."

Output is saved to {SKILL_DIR}/outputs/tts_TIMESTAMP.wav.

Single speaker with custom voice and output

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello and welcome to today's episode." \
    --speaker_names en-Alice_woman \
    --output /path/to/output.wav

Multi-speaker from a script file

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --speaker_names zh-Xinran_woman zh-Bowen_man

Use custom voice files from a different directory

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello world." \
    --voices_dir /path/to/my/custom/voices

All available arguments

ArgumentDefaultDescription
--------------------------------
--text(required if no --txt_path)Plain text string to synthesize
--txt_path(required if no --text)Path to a .txt script file
--model_pathvibevoice/VibeVoice-1.5BHuggingFace model id or local path
--speaker_nameszh-Xinran_womanVoice preset name(s), space-separated
--output / -oauto-generatedOutput .wav file path
--output_dir{SKILL_DIR}/outputs/Directory for auto-named outputs
--deviceauto-detectcuda, mps, or cpu
--cfg_scale1.3CFG guidance scale
--seedrandomRandom seed for reproducibility
--ddpm_steps10DDPM denoising steps (more = higher quality, slower)
--disable_prefillfalseSkip voice cloning / speaker conditioning
--checkpoint_pathNonePath to fine-tuned LoRA checkpoint directory (optional)
--voices_dirbundled voicesDirectory containing custom voice .wav files

Step 4: Verify Output

After generation, the script prints a summary with output path, duration, and generation time. Inform the user of the output file location.

Important Notes

  • First run downloads the model (~3 GB) from HuggingFace. Subsequent runs use the cached model.
  • No git clone needed: the setup script installs vibevoice directly from GitHub via uv pip install.
  • Mac MPS: uses float32 + sdpa attention. Requires ~6 GB unified memory for the 1.5B model.
  • CUDA: uses bfloat16 + flash_attention_2 for optimal speed. Falls back to sdpa if flash attention is unavailable.
  • Generation speed: expect roughly 2-5x real-time factor on Apple Silicon (e.g. 30s audio takes 60-150s).
  • Long text: the model supports up to ~90 minutes of audio. For very long scripts, generation may take a long time.
  • Multi-line text: --text is for single-line, short text only (shell will swallow newlines). When users provide multi-line text or long paragraphs, always prepare a temporary .txt file in the proper Speaker N: format and use --txt_path instead.
  • Fine-tuned models: use --checkpoint_path /path/to/lora/dir to load LoRA adapters trained via the fine-tuning pipeline.
  • Custom voices: place .wav files in a directory and pass --voices_dir to use them.

Loading a fine-tuned LoRA checkpoint

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --checkpoint_path /path/to/finetuned/checkpoint \
    --speaker_names en-Alice_woman

Portability

This skill is fully self-contained and portable across machines. Everything lives under the skill directory:

{SKILL_DIR}/
├── .venv/          ← virtual environment (created by setup.sh)
├── voices/         ← 9 bundled voice presets (~10 MB)
├── outputs/        ← generated audio files
├── scripts/
│   ├── setup.sh    ← one-time environment setup
│   └── tts_generate.py
├── references/
└── SKILL.md

To use on a new machine:

  1. Copy the skill directory (or let CodeBuddy sync it)
  2. Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
  3. Run bash {SKILL_DIR}/scripts/setup.sh
  4. Done! Start generating speech.

No hardcoded paths, no external project dependencies, no git clone needed.

版本历史

共 2 个版本

  • v1.0.1 Initial release 当前
    2026-06-04 17:04 安全 安全
  • v1.0.0 Initial release
    2026-06-01 14:28 安全 安全

安全检测

暂无安全检测报告