Generate high-quality speech audio from text using the VibeVoice-1.5B model running locally.
{SKILL_DIR} (resolve at runtime, e.g. ~/.codebuddy/skills/vibevoice-tts){SKILL_DIR}/scripts/setup.sh{SKILL_DIR}/scripts/tts_generate.py{SKILL_DIR}/.venv/{SKILL_DIR}/voices/ (9 presets included){SKILL_DIR}/outputs/zh-Xinran_woman (Chinese female)vibevoice/VibeVoice-1.5B (auto-downloaded from HuggingFace)> Note for AI agent: {SKILL_DIR} refers to the parent directory of scripts/. At runtime, resolve it to the actual installed skill path. All commands use uv run --python {SKILL_DIR}/.venv/bin/python to execute within the skill's own virtual environment.
Run the setup script once to create the virtual environment and install dependencies:
bash {SKILL_DIR}/scripts/setup.sh
This will:
.venv/ virtual environment in the skill directory (Python 3.11)vibevoice from GitHub source (latest version) with all dependenciesPrerequisites: only uv needs to be installed on the system. Install with:
curl -LsSf https://astral.sh/uv/install.sh | sh
Check that the environment is ready:
uv run --python {SKILL_DIR}/.venv/bin/python -c "import vibevoice; print('vibevoice OK')"
Before generating, check if {SKILL_DIR}/.venv/ exists. If not, run bash {SKILL_DIR}/scripts/setup.sh first.
Identify the text to synthesize and the number of speakers.
Speaker N: tags): auto-wrapped as single speaker.Speaker N: ... format (N from 1 to 4).For Chinese text, prefer English punctuation (commas , and periods .) to avoid pronunciation instability. Refer to references/script_format.md for detailed format guidance.
Select voice presets for each speaker. Refer to references/voice_presets.md for the full list.
Common choices:
zh-Xinran_womanzh-Bowen_manen-Alice_woman or en-Maya_womanen-Frank_man or en-Carter_manShort aliases like Xinran, Bowen, Alice, Frank are also accepted.
All commands use the skill's virtual environment via uv run:
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py [arguments]
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "你好,欢迎收听本期节目."
Output is saved to {SKILL_DIR}/outputs/tts_TIMESTAMP.wav.
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "Hello and welcome to today's episode." \
--speaker_names en-Alice_woman \
--output /path/to/output.wav
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--txt_path /path/to/script.txt \
--speaker_names zh-Xinran_woman zh-Bowen_man
uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--text "Hello world." \
--voices_dir /path/to/my/custom/voices
| Argument | Default | Description |
|---|---|---|
| ---------- | --------- | ------------- |
--text | (required if no --txt_path) | Plain text string to synthesize |
--txt_path | (required if no --text) | Path to a .txt script file |
--model_path | vibevoice/VibeVoice-1.5B | HuggingFace model id or local path |
--speaker_names | zh-Xinran_woman | Voice preset name(s), space-separated |
--output / -o | auto-generated | Output .wav file path |
--output_dir | {SKILL_DIR}/outputs/ | Directory for auto-named outputs |
--device | auto-detect | cuda, mps, or cpu |
--cfg_scale | 1.3 | CFG guidance scale |
--seed | random | Random seed for reproducibility |
--ddpm_steps | 10 | DDPM denoising steps (more = higher quality, slower) |
--disable_prefill | false | Skip voice cloning / speaker conditioning |
--checkpoint_path | None | Path to fine-tuned LoRA checkpoint directory (optional) |
--voices_dir | bundled voices | Directory containing custom voice .wav files |
After generation, the script prints a summary with output path, duration, and generation time. Inform the user of the output file location.
vibevoice directly from GitHub via uv pip install.float32 + sdpa attention. Requires ~6 GB unified memory for the 1.5B model.bfloat16 + flash_attention_2 for optimal speed. Falls back to sdpa if flash attention is unavailable.--text is for single-line, short text only (shell will swallow newlines). When users provide multi-line text or long paragraphs, always prepare a temporary .txt file in the proper Speaker N: format and use --txt_path instead.--checkpoint_path /path/to/lora/dir to load LoRA adapters trained via the fine-tuning pipeline..wav files in a directory and pass --voices_dir to use them.uv run --python {SKILL_DIR}/.venv/bin/python {SKILL_DIR}/scripts/tts_generate.py \
--txt_path /path/to/script.txt \
--checkpoint_path /path/to/finetuned/checkpoint \
--speaker_names en-Alice_woman
This skill is fully self-contained and portable across machines. Everything lives under the skill directory:
{SKILL_DIR}/
├── .venv/ ← virtual environment (created by setup.sh)
├── voices/ ← 9 bundled voice presets (~10 MB)
├── outputs/ ← generated audio files
├── scripts/
│ ├── setup.sh ← one-time environment setup
│ └── tts_generate.py
├── references/
└── SKILL.md
To use on a new machine:
uv is installed: curl -LsSf https://astral.sh/uv/install.sh | shbash {SKILL_DIR}/scripts/setup.shNo hardcoded paths, no external project dependencies, no git clone needed.
共 2 个版本
暂无安全检测报告