> SDK-first: always use the official SDK — see gladia-sdk-integration for policy, setup, and fallback criteria.
Consult these sibling skills as needed:
name: Gladia
description: Use when building speech transcription features, processing audio/video files, implementing real-time transcription, extracting insights from audio (speaker identification, translation, sentiment), or integrating voice capabilities into applications. Agents should reach for this skill when users request transcription, audio analysis, or voice-driven features.
metadata:
mintlify-proj: gladia
version: "1.0"
Gladia is a speech-to-text API that transcribes audio and video files in two modes: pre-recorded (asynchronous, file-based) and live (real-time, WebSocket-based). Beyond transcription, it provides audio intelligence features like speaker diarization, translation, sentiment analysis, PII redaction, and custom vocabulary matching. Agents use Gladia to build transcription workflows, extract structured data from audio, and power voice-driven applications.
Key files and commands:
@gladiaio/sdk) and Python (gladiaio-sdk)x-gladia-key header with your API keyPOST /v2/pre-recorded (create job), GET /v2/pre-recorded/:id (poll results)POST /v2/live (init session), WebSocket connection for streaming audioReach for this skill when:
# All requests require the x-gladia-key header
curl --header 'x-gladia-key: YOUR_API_KEY' https://api.gladia.io/v2/...
POST /v2/upload → get audio_urlPOST /v2/pre-recorded with audio_url and optionsGET /v2/pre-recorded/:id until status: "done"Or use SDK's transcribe() method for end-to-end in one call.
POST /v2/live with audio config (encoding, sample_rate, bit_depth, channels)url to open WebSocket connectionstop_recording message; WebSocket closes when post-processing done| Type | Examples |
|---|---|
| ------ | ---------- |
| Audio | MP3, WAV, FLAC, AAC, OGG, Opus, M4A |
| Video | MP4, MOV, AVI, WebM, Matroska |
| Online | TikTok, Instagram, Facebook, Vimeo, LinkedIn, YouTube (via URL) |
| Feature | Pre-recorded | Live | Use case |
|---|---|---|---|
| --------- | -------------- | ------ | ---------- |
| Diarization | ✓ | ✓ | Identify speakers, separate voices |
| Translation | ✓ | ✓ | Translate to 100+ languages |
| Subtitles | ✓ | - | Generate SRT/VTT files |
| Custom vocabulary | ✓ | ✓ | Fix domain-specific terms |
| Custom spelling | ✓ | ✓ | Normalize misspelled words |
| Sentiment analysis | ✓ | ✓ | Detect sentiment & emotions |
| PII redaction | ✓ | - | Mask sensitive data (GDPR/HIPAA) |
| Named entity recognition | ✓ | ✓ | Extract people, places, dates |
| Summarization | ✓ | - | Auto-generate summaries |
| Chapterization | ✓ | - | Split into chapters/segments |
| Scenario | Use pre-recorded | Use live |
|---|---|---|
| ---------- | ------------------ | ---------- |
| User uploads a file to transcribe | ✓ | - |
| Real-time transcription (voice agent, meeting) | - | ✓ |
| Post-processing (subtitles, translation, summarization) | ✓ | - |
| Low-latency response needed | - | ✓ |
| Batch processing multiple files | ✓ | - |
| Situation | Use custom vocabulary | Use custom spelling |
|---|---|---|
| ----------- | ---------------------- | --------------------- |
| Model outputs garbled/phonetically wrong text | ✓ | - |
| Model outputs recognizable but misspelled word | - | ✓ |
| Domain-specific terms (brand names, jargon) | ✓ | - |
| Normalizing variant spellings | - | ✓ |
| Scenario | Use diarization | Use multi-channel |
|---|---|---|
| ---------- | ----------------- | ------------------- |
| Single audio stream, multiple speakers | ✓ | - |
| Separate audio tracks per speaker | - | ✓ |
| Unknown number of speakers | ✓ | - |
| Known speaker count and channels | - | ✓ |
```javascript
const uploadResponse = await gladiaClient.preRecorded().uploadFile("path/to/audio.mp3");
const audioUrl = uploadResponse.audio_url;
```
```javascript
const job = await gladiaClient.preRecorded().createUntyped({
audio_url: audioUrl,
language_config: { languages: ["en"], code_switching: false },
diarization: true,
diarization_config: { min_speakers: 1, max_speakers: 5 },
custom_vocabulary: true,
custom_vocabulary_config: { vocabulary: ["Gladia", "Solaria"] },
translation: true,
translation_config: { target_languages: ["fr"], model: "base" },
sentiment_analysis: true,
pii_redaction: true,
pii_redaction_config: { entity_types: ["GDPR"] }
});
```
```javascript
let result = await gladiaClient.preRecorded().get(job.id);
while (result.status !== "done") {
await new Promise(r => setTimeout(r, 2000));
result = await gladiaClient.preRecorded().get(job.id);
}
```
transcription.utterances, translation, sentiment_analysis, diarization fields.```javascript
const liveSession = gladiaClient.liveV2().startSession({
model: "solaria-1",
encoding: "wav/pcm",
sample_rate: 16000,
bit_depth: 16,
channels: 1,
language_config: { languages: ["en"], code_switching: false },
messages_config: { receive_partial_transcripts: true }
});
```
```javascript
liveSession.on("message", (message) => {
if (message.type === "transcript" && message.data.is_final) {
console.log(message.data.utterance.text);
}
});
```
```javascript
liveSession.sendAudio(audioChunk);
```
```javascript
liveSession.stopRecording();
```
```javascript
const result = await fetch(https://api.gladia.io/v2/live/${sessionId}, {
headers: { "x-gladia-key": apiKey }
});
```
languages: [] and code_switching: true together. The detector will evaluate every utterance against 100+ languages, causing misdetections. Always provide a constrained list (3-5 languages max).encoding, sample_rate, bit_depth, and channels must match your actual audio stream. Mismatches cause garbled output.intensity: 0.4 and raise only if terms are missed. High intensity causes false positives (unrelated words get replaced). Add pronunciations variants before raising intensity.languages: ["en"] explicitly. Omitting it forces detection, adding latency and risk of misdetection.number_of_speakers or min_speakers/max_speakers. Hints improve accuracy.Before submitting transcription work:
x-gladia-key header setlanguages with code_switching: truetranscription.utterances, translation, sentiment_analysis, etc.speaker field matches expected speakers)> For additional documentation and navigation, see: https://docs.gladia.io/llms.txt
> This file is auto-synced from https://docs.gladia.io/.well-known/agent-skills/gladia/skill.md
> Do not edit manually — changes will be overwritten by CI.
> For additional documentation and navigation, see: https://docs.gladia.io/llms.txt
共 2 个版本