← 返回
未分类

Supertonic TTS

On-device multilingual text-to-speech using Supertonic (Supertone). Use when the user needs local/offline TTS, voice generation, speech synthesis, or convert...
使用Supertonic (Supertone)在设备上进行多语言文字转语音。适用于需要本地/离线TTS、语音生成、语音合成或转换的场景。
pratyushchauhan
未分类 clawhub v1.0.1 2 版本 100000 Key: 无需
★ 2
Stars
📥 361
下载
💾 0
安装
2
版本
#latest

概述

Supertonic TTS Skill

Local, multilingual text-to-speech powered by Supertone's Supertonic ONNX model.

Core Features

  • Offline synthesis — Base TTS runs on-device via ONNX with no API key and no cloud calls during generation.
  • Tiny footprint — 66M–99M parameters. Runs on Pi, browser, e-reader, phone.
  • Stupid fast — Up to 167× real-time on consumer hardware. 4s of audio in ~25ms.
  • Studio output — 44.1kHz 16-bit mono WAV, no upsampler needed.
  • 31 languages — Full multilingual support with lang="na" auto-detect fallback.
  • Voice cloning — Clone any voice via online Voice Builder (requires uploading a reference clip to a third-party service), then deploy the exported voice style offline permanently.
  • Expression tags — Only is user-verified to produce audible expression. and are weak/unconfirmed. All others fail silently.

Prerequisites

Requires the Python SDK and model assets. Install once:

pip install supertonic

First run auto-downloads ~400MB of ONNX models from Hugging Face into ~/.cache/supertonic3/.

Quick Use

Python SDK

from supertonic import TTS

tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")

wav, duration = tts.synthesize(
    text="Your text here",
    lang="en",           # language code or "na" for auto-detect
    voice_style=style,
    total_steps=8,       # quality: 5 (low) to 12 (high)
    speed=1.0,           # 0.7 (slow) to 2.0 (fast)
)

tts.save_audio(wav, "output.wav")

CLI (via supertonic package)

# Basic synthesis
supertonic tts "Hello world" -o output.wav

# Pick voice and quality
supertonic tts "Use a different voice." -o output.wav --voice F1 --steps 10

# Custom cloned voice
supertonic tts "Hello in my voice." -o output.wav --custom-style-path voices/my_voice.json

# Multilingual
supertonic tts "こんにちは" -o japanese.wav --lang ja
supertonic tts "Bonjour" -o french.wav --lang fr

Skill Scripts

cd ~/.openclaw/workspace/skills/supertonic-tts/scripts
source ~/.openclaw/workspace/.browser-use-venv/bin/activate

# Quick synthesis
python3 synthesize.py "Hello world" --voice M1 --output ~/hello.wav

# With expression tags (only <laugh> is confirmed to work)
python3 synthesize.py "You did it <laugh> I am so proud." --voice M5 --output laugh.wav

# Custom voice
python3 synthesize.py "Hello" --custom-style my_voice.json --output cloned.wav

# Japanese
python3 synthesize.py "こんにちは" --voice F3 --lang ja

# List voices
python3 list_voices.py

Voices

10 built-in voices: F1–F5 (female), M1–M5 (male).

Voice cloning: Record a short clip → upload to Voice Builder (online service, see privacy note in references/voices.md) → export JSON → load with get_voice_style_from_path().

See references/voices.md for voice descriptions and Voice Builder workflow.

Expression Tags

> ⚠️ Mostly non-functional in practice

>

> Supertonic accepts inline self-closing tags, but only has been user-verified to produce a clearly audible expression (laughter burst). and may insert minor pauses but are not confirmed as audible breathing/sighing sounds.

>

> Do not rely on tags for expression. Tested tags that failed to produce audible effect include: , , , , , , , , , , , , , , , , , , .

Correct syntax (self-closing, inline):

text = "You did it <laugh> I am so proud."

Reliable alternative for emotion: explicit language + speed modulation:

EmotionTechnique
--------------------
HappyUpbeat words + speed=1.1
SadSubdued words + speed=0.85
ExcitedExclamations + speed=1.15
UrgentShort imperatives + speed=1.2

See references/expression-tags.md for full testing results.

Parameters

ParamRangeDefaultWhat It Does
------------------------------------
total_steps5–128Quality vs speed tradeoff
speed0.7–2.01.0Speech rate multiplier
max_chunk_lengthany300Break long text into chunks (120 for Korean)
silence_durationany0.3Pause between chunks (seconds)
langISO 639-1 or "na""en""na" = language-agnostic auto-detect
verboseTrue/FalseFalseShow detailed progress

Languages

31 languages + na (language-agnostic auto-detect). See references/languages.md for all codes.

Output

  • Format: 44.1kHz 16-bit mono WAV
  • Returns: (wav_array, duration_array)
  • wav.shape = (1, num_samples)
  • duration[0] = length in seconds

Multi-Runtime Deployment

Supertonic runs across: Python, Node.js, Browser (WebGPU), Java, C++, C#, Go, Swift, iOS, Rust, Flutter.

Scripts

  • scripts/synthesize.py — CLI for quick text-to-speech (supports custom voices)
  • scripts/list_voices.py — Available voices and metadata

References

  • references/voices.md — Voice descriptions, selection guide, Voice Builder workflow
  • references/expression-tags.md — All tags, examples, caveats
  • references/languages.md — Supported language codes
  • references/deployment.md — Multi-runtime deployment options

版本历史

共 2 个版本

  • v1.0.1 当前
    2026-05-29 13:41
  • v1.0.0
    2026-05-21 14:25 安全 安全

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

Data Chart Builder

pratyushchauhan
从任意数据源(CSV、JSON、FRED API 或内联数据)创建出版级图表,支持折线、柱状、散点、面积填充、指数序列、注释等功能。
★ 1 📥 156

Remotion Animator

pratyushchauhan
一种面向智能体的视频技能,使用 Remotion 编程方式构建动画视频,适用于制作各类动画(如片头、解释视频)时。
★ 0 📥 133

FRED Data Viz

pratyushchauhan
用于在用户需要可视化、比较或分析经济数据时,根据美联储经济数据(FRED)生成可直接发布的对比图表。
★ 1 📥 139