概述

VibeVoice TTS Skill

Generate high-quality speech audio from text using the VibeVoice-1.5B model running locally.

Quick Reference

Skill directory: {SKILL_DIR} (resolve at runtime, e.g. ~/.codebuddy/skills/vibevoice-tts)
Setup script: {SKILL_DIR}/scripts/setup.sh
TTS script: {SKILL_DIR}/scripts/tts_generate.py
Virtual env: {SKILL_DIR}/.venv/
Bundled voices: {SKILL_DIR}/voices/ (9 presets included)
Output directory: {SKILL_DIR}/outputs/
Default voice: zh-Xinran_woman (Chinese female)
Default model: vibevoice/VibeVoice-1.5B (auto-downloaded from HuggingFace)

> Note for AI agent: {SKILL_DIR} refers to the parent directory of scripts/. At runtime, resolve it to the actual installed skill path. All commands use uv run --python {SKILL_DIR}/.venv/bin/python to execute within the skill's own virtual environment.

First-Time Setup

Run the setup script once to create the virtual environment and install dependencies:

bash {SKILL_DIR}/scripts/setup.sh

This will:

Create a .venv/ virtual environment in the skill directory (Python 3.11)
Install vibevoice from GitHub source (latest version) with all dependencies

Prerequisites: only uv needs to be installed on the system. Install with:

curl -LsSf https://astral.sh/uv/install.sh | sh

Verify Setup

Check that the environment is ready:

uv run --python {SKILL_DIR}/.venv/bin/python -c "import vibevoice; print('vibevoice OK')"

Workflow

Step 0: Ensure Environment

Before generating, check if {SKILL_DIR}/.venv/ exists. If not, run bash {SKILL_DIR}/scripts/setup.sh first.

Step 1: Determine Input

Identify the text to synthesize and the number of speakers.

Plain text (no Speaker N: tags): auto-wrapped as single speaker.
Multi-speaker script: must follow Speaker N: ... format (N from 1 to 4).

For Chinese text, prefer English punctuation (commas , and periods .) to avoid pronunciation instability. Refer to references/script_format.md for detailed format guidance.

Step 2: Choose Voices

Select voice presets for each speaker. Refer to references/voice_presets.md for the full list.

Common choices:

Chinese female (default): zh-Xinran_woman
Chinese male: zh-Bowen_man
English female: en-Alice_woman or en-Maya_woman
English male: en-Frank_man or en-Carter_man

Short aliases like Xinran, Bowen, Alice, Frank are also accepted.

Step 3: Generate Audio

All commands use the skill's virtual environment via uv run:

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py [arguments]

Single speaker from plain text (most common)

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "你好,欢迎收听本期节目."

Output is saved to {SKILL_DIR}/outputs/tts_TIMESTAMP.wav.

Single speaker with custom voice and output

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello and welcome to today's episode." \
    --speaker_names en-Alice_woman \
    --output /path/to/output.wav

Multi-speaker from a script file

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --speaker_names zh-Xinran_woman zh-Bowen_man

Use custom voice files from a different directory

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --text "Hello world." \
    --voices_dir /path/to/my/custom/voices

All available arguments

Argument	Default	Description
----------	---------	-------------
`--text`	(required if no --txt_path)	Plain text string to synthesize
`--txt_path`	(required if no --text)	Path to a .txt script file
`--model_path`	`vibevoice/VibeVoice-1.5B`	HuggingFace model id or local path
`--speaker_names`	`zh-Xinran_woman`	Voice preset name(s), space-separated
`--output` / `-o`	auto-generated	Output .wav file path
`--output_dir`	`{SKILL_DIR}/outputs/`	Directory for auto-named outputs
`--device`	auto-detect	`cuda`, `mps`, or `cpu`
`--cfg_scale`	`1.3`	CFG guidance scale
`--seed`	random	Random seed for reproducibility
`--ddpm_steps`	`10`	DDPM denoising steps (more = higher quality, slower)
`--disable_prefill`	`false`	Skip voice cloning / speaker conditioning
`--checkpoint_path`	`None`	Path to fine-tuned LoRA checkpoint directory (optional)
`--voices_dir`	bundled voices	Directory containing custom voice .wav files

Step 4: Verify Output

After generation, the script prints a summary with output path, duration, and generation time. Inform the user of the output file location.

Important Notes

First run downloads the model (~3 GB) from HuggingFace. Subsequent runs use the cached model.
No git clone needed: the setup script installs vibevoice directly from GitHub via uv pip install.
Mac MPS: uses float32 + sdpa attention. Requires ~6 GB unified memory for the 1.5B model.
CUDA: uses bfloat16 + flash_attention_2 for optimal speed. Falls back to sdpa if flash attention is unavailable.
Generation speed: expect roughly 2-5x real-time factor on Apple Silicon (e.g. 30s audio takes 60-150s).
Long text: the model supports up to ~90 minutes of audio. For very long scripts, generation may take a long time.
Multi-line text: --text is for single-line, short text only (shell will swallow newlines). When users provide multi-line text or long paragraphs, always prepare a temporary .txt file in the proper Speaker N: format and use --txt_path instead.
Fine-tuned models: use --checkpoint_path /path/to/lora/dir to load LoRA adapters trained via the fine-tuning pipeline.
Custom voices: place .wav files in a directory and pass --voices_dir to use them.

Loading a fine-tuned LoRA checkpoint

uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
    --txt_path /path/to/script.txt \
    --checkpoint_path /path/to/finetuned/checkpoint \
    --speaker_names en-Alice_woman

Portability

This skill is fully self-contained and portable across machines. Everything lives under the skill directory:

{SKILL_DIR}/
├── .venv/          ← virtual environment (created by setup.sh)
├── voices/         ← 9 bundled voice presets (~10 MB)
├── outputs/        ← generated audio files
├── scripts/
│   ├── setup.sh    ← one-time environment setup
│   └── tts_generate.py
├── references/
└── SKILL.md

To use on a new machine:

Copy the skill directory (or let CodeBuddy sync it)
Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
Run bash {SKILL_DIR}/scripts/setup.sh
Done! Start generating speech.

No hardcoded paths, no external project dependencies, no git clone needed.

版本历史

共 2 个版本

v1.0.1 Initial release 当前

2026-06-04 17:04 安全安全
v1.0.0 Initial release

2026-06-01 14:28 安全安全

安全检测

暂无安全检测报告

vibevoice-tts

概述