← 返回
未分类 Key 中文

Google Gemini TTS

Generate spoken audio from text using Google's Gemini TTS models (default is Gemini 3.1 Flash TTS Preview, with fallback to Gemini 2.5 Flash/Pro preview TTS)...
使用谷歌Gemini TTS模型从文本生成语音(默认Gemini 3.1 Flash TTS预览版,备用Gemini 2.5 Flash/Pro预览版TTS)
shubhamsaboo
未分类 clawhub v1.0.3 1 版本 100000 Key: 需要
★ 0
Stars
📥 508
下载
💾 0
安装
1
版本
#latest

概述

Gemini TTS

Generate speech audio from text using Gemini TTS models. The default is Gemini 3.1 Flash TTS Preview, and the script still supports Gemini 2.5 preview TTS models when you pass -m.

What this skill does

  • Single-speaker text to speech
  • Two-speaker podcast-style audio
  • Style control with natural language prompts
  • WAV output that can be sent directly in chat or used in apps

Files

  • scripts/gemini_tts.sh: CLI wrapper around the Gemini REST API

Quick start

# Show all options
scripts/gemini_tts.sh --help

# Single speaker, default voice (Kore)
scripts/gemini_tts.sh "Hello, welcome to the show!"

# Pick a voice
scripts/gemini_tts.sh -v Puck "This is Puck speaking."

# With style control
scripts/gemini_tts.sh -s "Say in a warm, calm tone:" "Take a deep breath."

# Save to a specific file
scripts/gemini_tts.sh -o /tmp/greeting.wav "Hey there!"

# Multi-speaker conversation
scripts/gemini_tts.sh --multi "Host:Kore,Guest:Puck" \
  "Host: Welcome to the podcast! Guest: Thanks for having me."

The script prints the output WAV file path.

Models

ModelBest for
------
gemini-3.1-flash-tts-preview (default)Best default now: low-latency, natural output, expressive narration
gemini-2.5-flash-preview-ttsBackward-compatible fast preview model
gemini-2.5-pro-preview-ttsLong-form narration and higher-end creative work

Current note: Gemini 3.1 Flash TTS Preview is live and should be the default path for this skill. Gemini 2.5 preview TTS models remain useful as compatibility fallbacks.

> Preview model note: gemini-3.1-flash-tts-preview is a preview model. If Google renames or retires it, pass -m gemini-2.5-flash-preview-tts as a fallback, or check the current model list.

Switch model examples:

scripts/gemini_tts.sh -m gemini-2.5-pro-preview-tts "Your text here"
scripts/gemini_tts.sh -m gemini-2.5-flash-preview-tts "Your text here"

Voices

Available prebuilt voices:

Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat, Laomedeia, Achernar, Schedar, Rasalgethi, Nashira, Enif

The same 30-voice library is shared between gemini-3.1-flash-tts-preview and the gemini-2.5-flash-preview-tts / gemini-2.5-pro-preview-tts fallbacks, so a voice you pick for the default model will still work if you drop back to a fallback via -m.

Style control

Gemini 3.1 Flash TTS reads plain transcripts naturally, but gives you two complementary ways to steer the delivery when you want more control.

Inline audio tags

Drop bracketed directions into the transcript. They modify what follows, can appear anywhere, and can stack or repeat across a single script:

[excitedly] Massive update today — [whispers] but keep it between us. [laughs]

Tags are open-ended; anything in [ ] is treated as a direction to the model. A useful starting set:

  • Emotion[excitedly], [bored], [reluctantly], [amazed], [curious], [mischievously], [panicked], [sarcastic], [serious], [tired], [trembling]
  • Pace and volume[very fast], [very slowly], [asmr], [deep and loud shouting], [whispers]
  • Non-verbal[gasp], [giggles], [sighs], [snorts], [cough], [laughs], [crying]
  • Character / style[like dracula], [like a dog], [singing], [sarcastically, one painfully slow word at a time]

Structured context prompt

For longer pieces where you want a consistent persona, prepend an AUDIO PROFILE / SCENE / DIRECTOR'S NOTES / TRANSCRIPT block. The four headers are load-bearing — the model uses them to separate performance context from the script it should actually speak:

# AUDIO PROFILE: Jaz, London morning-show radio DJ

## THE SCENE: 10 PM, neon-lit studio, "ON AIR" tally blazing.
Jaz is bouncing on their heels, hands on the faders, infectious energy.

### DIRECTOR'S NOTES
Style: vocal smile always audible; punchy consonants; elongated vowels on excitement words.
Accent: Brixton, London.
Pace: energetic, bouncing cadence, no dead air.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! [shouting] Turn it up!

Inline tags inside #### TRANSCRIPT override the baseline direction when you want a specific beat.

Tips

  • Keep the script and direction coherent — the speaker, what is said, and how it is said should agree.
  • Don't overspecify. Give the model space to fill gaps; it reads better.
  • A simple preamble ("Say cheerfully: ...") still works for quick one-offs, but inline tags give you per-phrase control and structured prompts give you persona consistency.

Full prompting reference: Gemini speech-generation docs.

Multi-speaker

Up to 2 speakers. Use --multi "Name1:Voice1,Name2:Voice2" and make sure the speaker names in the text match.

Supported languages

70+ languages are supported, including Arabic, Bengali, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, Vietnamese, and many more. See the Gemini speech-generation docs for the full locale list.

Limitations

  • Audio output only
  • Maximum 2 speakers in multi-speaker mode
  • Preview model names may change
  • No SSML support
  • No custom voice cloning

Verification

Basic smoke test once your API key is set:

export GEMINI_API_KEY=your_key_here   # GOOGLE_API_KEY is also accepted
scripts/gemini_tts.sh -o /tmp/gemini-test.wav "This is a Gemini TTS smoke test."
file /tmp/gemini-test.wav

Expected result: a playable WAV file is created (24 kHz mono, 16-bit PCM WAV).

版本历史

共 1 个版本

  • v1.0.3 当前
    2026-05-02 15:33 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,353 📥 317,961
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 668 📥 323,983
ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误和纠正,以实现持续改进。使用时机:(1)命令或操作意外失败;(2)用户纠正……
★ 4,058 📥 797,862