← 返回
未分类 Key 中文

Gladia Documentation Auto

Comprehensive Gladia speech-to-text reference auto-synced from docs.gladia.io. Use as a general-purpose fallback when other specialized skills don't match, o...
全面的 Gladia 语音转文字参考,已自动同步自 docs.gladia.io。作为通用后备方案,适用于其他专业技能不匹配的场景。
gladiaio gladiaio 来源
未分类 clawhub v1.0.2 2 版本 98387.1 Key: 需要
★ 0
Stars
📥 61
下载
💾 1
安装
2
版本
#latest

概述

> SDK-first: always use the official SDK — see gladia-sdk-integration for policy, setup, and fallback criteria.

References

Consult these sibling skills as needed:

  • ../gladia-sdk-integration/SKILL.md -- SDK setup, client initialization, error handling, and SDK vs raw API decision guide
  • ../gladia-sdk-integration/references/sdk-versions.md -- Current SDK versions (auto-synced by CI)
  • ../gladia-troubleshooting/SKILL.md -- Common errors, gotchas, and verification checklist
  • ../gladia-live-transcription/SKILL.md -- Live streaming transcription
  • ../gladia-pre-recorded-transcription/SKILL.md -- Pre-recorded file transcription

name: Gladia

description: Use when building speech transcription features, processing audio/video files, implementing real-time transcription, extracting insights from audio (speaker identification, translation, sentiment), or integrating voice capabilities into applications. Agents should reach for this skill when users request transcription, audio analysis, or voice-driven features.

metadata:

mintlify-proj: gladia

version: "1.0"


Gladia Speech-to-Text API

Product summary

Gladia is a speech-to-text API that transcribes audio and video files in two modes: pre-recorded (asynchronous, file-based) and live (real-time, WebSocket-based). Beyond transcription, it provides audio intelligence features like speaker diarization, translation, sentiment analysis, PII redaction, and custom vocabulary matching. Agents use Gladia to build transcription workflows, extract structured data from audio, and power voice-driven applications.

Key files and commands:

  • SDKs: JavaScript (@gladiaio/sdk) and Python (gladiaio-sdk)
  • Authentication: Pass x-gladia-key header with your API key
  • Pre-recorded endpoint: POST /v2/pre-recorded (create job), GET /v2/pre-recorded/:id (poll results)
  • Live endpoint: POST /v2/live (init session), WebSocket connection for streaming audio
  • Primary docs: https://docs.gladia.io

When to use

Reach for this skill when:

  • A user wants to transcribe audio or video files (meetings, podcasts, calls, interviews)
  • Building real-time transcription (voice agents, live captions, meeting recorders)
  • Extracting speaker information (who said what, speaker count, speaker identification)
  • Translating transcripts to other languages or generating subtitles
  • Detecting sensitive information (PII redaction for GDPR/HIPAA compliance)
  • Improving transcription accuracy with domain-specific vocabulary
  • Analyzing sentiment or emotions in speech
  • Integrating with third-party platforms (Twilio, Vapi, LiveKit, Pipecat, etc.)

Quick reference

Authentication

# All requests require the x-gladia-key header
curl --header 'x-gladia-key: YOUR_API_KEY' https://api.gladia.io/v2/...

Pre-recorded workflow (file-based)

  1. Upload audio: POST /v2/upload → get audio_url
  2. Create job: POST /v2/pre-recorded with audio_url and options
  3. Poll result: GET /v2/pre-recorded/:id until status: "done"

Or use SDK's transcribe() method for end-to-end in one call.

Live workflow (real-time)

  1. Init session: POST /v2/live with audio config (encoding, sample_rate, bit_depth, channels)
  2. Connect WebSocket: Use returned url to open WebSocket connection
  3. Send audio: Stream audio chunks as binary or base64-encoded JSON
  4. Read messages: Receive transcript, translation, sentiment, etc. via WebSocket
  5. Stop: Send stop_recording message; WebSocket closes when post-processing done

Audio formats supported

TypeExamples
----------------
AudioMP3, WAV, FLAC, AAC, OGG, Opus, M4A
VideoMP4, MOV, AVI, WebM, Matroska
OnlineTikTok, Instagram, Facebook, Vimeo, LinkedIn, YouTube (via URL)

File limits

  • Pre-recorded: Max 135 minutes (2h15m) or 1000 MB; enterprise plans support 4h15m
  • Live: Max 3 hours per session
  • Recommendation: Split files >60 minutes for better quality

Common audio intelligence features

FeaturePre-recordedLiveUse case
---------------------------------------
DiarizationIdentify speakers, separate voices
TranslationTranslate to 100+ languages
Subtitles-Generate SRT/VTT files
Custom vocabularyFix domain-specific terms
Custom spellingNormalize misspelled words
Sentiment analysisDetect sentiment & emotions
PII redaction-Mask sensitive data (GDPR/HIPAA)
Named entity recognitionExtract people, places, dates
Summarization-Auto-generate summaries
Chapterization-Split into chapters/segments

Decision guidance

When to use pre-recorded vs. live

ScenarioUse pre-recordedUse live
--------------------------------------
User uploads a file to transcribe-
Real-time transcription (voice agent, meeting)-
Post-processing (subtitles, translation, summarization)-
Low-latency response needed-
Batch processing multiple files-

When to use custom vocabulary vs. custom spelling

SituationUse custom vocabularyUse custom spelling
------------------------------------------------------
Model outputs garbled/phonetically wrong text-
Model outputs recognizable but misspelled word-
Domain-specific terms (brand names, jargon)-
Normalizing variant spellings-

When to use diarization vs. multi-channel audio

ScenarioUse diarizationUse multi-channel
----------------------------------------------
Single audio stream, multiple speakers-
Separate audio tracks per speaker-
Unknown number of speakers-
Known speaker count and channels-

Workflow

Pre-recorded transcription (typical task)

  1. Understand requirements: Confirm audio format, language, desired features (diarization, translation, subtitles, PII redaction).
  1. Check file constraints: Verify file is <1000 MB and <135 minutes (or split if needed).
  1. Upload audio (if local file):

```javascript

const uploadResponse = await gladiaClient.preRecorded().uploadFile("path/to/audio.mp3");

const audioUrl = uploadResponse.audio_url;

```

  1. Create transcription job with options:

```javascript

const job = await gladiaClient.preRecorded().createUntyped({

audio_url: audioUrl,

language_config: { languages: ["en"], code_switching: false },

diarization: true,

diarization_config: { min_speakers: 1, max_speakers: 5 },

custom_vocabulary: true,

custom_vocabulary_config: { vocabulary: ["Gladia", "Solaria"] },

translation: true,

translation_config: { target_languages: ["fr"], model: "base" },

sentiment_analysis: true,

pii_redaction: true,

pii_redaction_config: { entity_types: ["GDPR"] }

});

```

  1. Poll for results (or use webhooks/callbacks):

```javascript

let result = await gladiaClient.preRecorded().get(job.id);

while (result.status !== "done") {

await new Promise(r => setTimeout(r, 2000));

result = await gladiaClient.preRecorded().get(job.id);

}

```

  1. Extract and validate results: Check transcription.utterances, translation, sentiment_analysis, diarization fields.
  1. Verify output: Confirm speaker attribution, translation accuracy, PII masking, and custom vocabulary replacements.

Live transcription (typical task)

  1. Understand audio source: Confirm encoding (wav/pcm, sample_rate, bit_depth, channels).
  1. Initialize session:

```javascript

const liveSession = gladiaClient.liveV2().startSession({

model: "solaria-1",

encoding: "wav/pcm",

sample_rate: 16000,

bit_depth: 16,

channels: 1,

language_config: { languages: ["en"], code_switching: false },

messages_config: { receive_partial_transcripts: true }

});

```

  1. Connect WebSocket and set up handlers:

```javascript

liveSession.on("message", (message) => {

if (message.type === "transcript" && message.data.is_final) {

console.log(message.data.utterance.text);

}

});

```

  1. Stream audio chunks as they arrive:

```javascript

liveSession.sendAudio(audioChunk);

```

  1. Stop recording when done:

```javascript

liveSession.stopRecording();

```

  1. Retrieve final results (optional):

```javascript

const result = await fetch(https://api.gladia.io/v2/live/${sessionId}, {

headers: { "x-gladia-key": apiKey }

});

```

Common gotchas

  • Empty language list with code switching: Do not set languages: [] and code_switching: true together. The detector will evaluate every utterance against 100+ languages, causing misdetections. Always provide a constrained list (3-5 languages max).
  • Forgetting audio metadata: For live transcription, encoding, sample_rate, bit_depth, and channels must match your actual audio stream. Mismatches cause garbled output.
  • Custom vocabulary intensity too high: Start at intensity: 0.4 and raise only if terms are missed. High intensity causes false positives (unrelated words get replaced). Add pronunciations variants before raising intensity.
  • Polling without backoff: Don't hammer the API with rapid polls. Use 2-3 second intervals or webhooks/callbacks instead.
  • Exceeding file limits silently: Pre-recorded files >135 minutes or >1000 MB will fail. Split large files into ~60-minute chunks before uploading.
  • Not setting language when known: If you know the language, set languages: ["en"] explicitly. Omitting it forces detection, adding latency and risk of misdetection.
  • Diarization without hints: If you know the speaker count, set number_of_speakers or min_speakers/max_speakers. Hints improve accuracy.
  • PII redaction only for pre-recorded: PII redaction is not available for live transcription. Plan accordingly for compliance workflows.
  • Webhook/callback URL not reachable: If using webhooks, ensure your callback URL is publicly accessible and returns 2xx status. Gladia will retry failed deliveries.
  • Multi-channel audio billing: Transcribing multi-channel audio is billed by total duration × number of channels. A 1-hour 3-channel stream costs 3 hours of transcription.

Verification checklist

Before submitting transcription work:

  • [ ] Audio file is valid format (MP3, WAV, MP4, etc.) and <1000 MB
  • [ ] File duration is <135 minutes (or split if longer)
  • [ ] API key is valid and has x-gladia-key header set
  • [ ] Language is set explicitly if known; avoid empty languages with code_switching: true
  • [ ] Custom vocabulary entries are tested; intensity is 0.4-0.6 unless tuned
  • [ ] Diarization hints (min/max speakers) are provided if speaker count is known
  • [ ] Webhook/callback URL (if used) is publicly accessible and returns 2xx
  • [ ] Results include expected fields: transcription.utterances, translation, sentiment_analysis, etc.
  • [ ] Speaker attribution is correct (diarization speaker field matches expected speakers)
  • [ ] PII redaction is applied (if required) and sensitive data is masked
  • [ ] Translation accuracy is spot-checked for domain-specific terms
  • [ ] Subtitles (if generated) have correct timing and formatting

Resources

  • Comprehensive page listing: https://docs.gladia.io/llms.txt
  • Getting started: https://docs.gladia.io/chapters/introduction/getting-started
  • Pre-recorded quickstart: https://docs.gladia.io/chapters/pre-recorded-stt/quickstart
  • Live transcription quickstart: https://docs.gladia.io/chapters/live-stt/quickstart
  • Audio intelligence features: https://docs.gladia.io/chapters/audio-intelligence/
  • Recommended parameters by use case: https://docs.gladia.io/chapters/pre-recorded-stt/recommended-parameters
  • API reference: https://docs.gladia.io/api-reference/
  • SDK documentation: https://docs.gladia.io/chapters/integrations/sdk

> For additional documentation and navigation, see: https://docs.gladia.io/llms.txt


> This file is auto-synced from https://docs.gladia.io/.well-known/agent-skills/gladia/skill.md

> Do not edit manually — changes will be overwritten by CI.

> For additional documentation and navigation, see: https://docs.gladia.io/llms.txt

版本历史

共 2 个版本

  • v1.0.2 当前
    2026-06-11 18:49
  • v1.0.1
    2026-06-09 19:33

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

Gladia Pre Recorded Transcription

gladiaio
使用 Gladia 转录预录音频或 URL,支持批量/异步转录、说话人分离、字幕(SRT/VTT)及 PII 脱敏。
★ 0 📥 96

Gladia Audio Intelligence

gladiaio
配置并使用 Gladia音频智能功能:说话人分割、翻译、情感分析、命名实体识别(NER)、个人身份信息(PII)编辑、摘要。
★ 0 📥 96

Gladia Live Transcription

gladiaio
通过WebSocket实现Gladia实时语音转文字流。适用于需要实时转录、构建语音代理、会议录音、呼叫中心等场景。
★ 0 📥 94