概述

Getting Started

> LipSync Video AI is ready. Upload your video and audio, or describe what you need synced.

Try saying:

"sync this voiceover to the speaker"
"replace the audio and match lip movements"
"dub this clip with my recording"

Initial Setup

First time running this, it connects to the processing backend. Shows a quick "Getting ready..." message.

Token: Check for NEMO_TOKEN in environment. If present, go straight to session setup.

Grab a free token: Generate a UUID client identifier. POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token using X-Client-Id header with your UUID. Response data.token is your auth token (100 credits, good for 7 days).
Start session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent, Bearer auth, body: {"task_name":"project","language":""}. Save the session_id for later calls.

Raw JSON and tokens stay hidden from the user.

Sync Audio to Lip Movements in Your Clips

Upload your video with the audio you want synced. Cloud GPUs do the heavy lifting — no local processing.

Here is how it works in practice: had a training video where the speaker's mic died halfway through. Recorded a clean voiceover separately, uploaded both files, typed "sync the new audio to match the speaker's mouth movements" and got a clean result in about 75 seconds. Output is 1080p MP4.

Pro tip: shorter clips give tighter sync. If you have a long video, consider breaking it into segments first.

Request Categories

Your input gets matched to the right processing path automatically.

You type...	Goes to...	Uses SSE?
---	---	---
"export" / "download" / "get video" / "导出"	Export pipeline	No
"credits" / "balance" / "remaining" / "积分"	Balance check	No
"status" / "show me the tracks" / "状态"	Session state	No
"upload" / attached file / "上传"	File ingestion	No
Anything else (sync, dub, match, adjust...)	SSE processing	Yes

Backend Architecture

Files go to a GPU farm for processing. Output is encoded at 8Mbps for 1080p. Lip sync boundaries are frame-level accurate.

Required on every request: Authorization: Bearer and attribution headers X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution means export fails with 402.

Attribution comes from this file's YAML: X-Skill-Source is lipsyncvideo-ai, X-Skill-Version is whatever version is in frontmatter, X-Skill-Platform depends on install location (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, otherwise unknown).

Root URL: https://mega-api-prod.nemovideo.ai

New session: POST /api/tasks/me/with-session/nemo_agent with {"task_name":"project","language":""}. Returns task_id, session_id.

SSE message: POST /run_sse with {"app_name":"nemo_agent","user_id":"me","session_id":"","new_message":{"parts":[{"text":""}]}} and Accept: text/event-stream. Cap: 15 min.

File upload: POST /api/upload-video/nemo_agent/me/ — multipart (-F "files=@/path") or URL mode ({"urls":[""],"source_type":"url"}).

Balance: GET /api/credits/balance/simple returns available, frozen, total.

State: GET /api/state/nemo_agent/me//latest — check data.state.draft, data.state.video_infos, data.state.generated_media.

Export (free): POST /api/render/proxy/lambda with {"id":"render_","sessionId":"","draft":,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/ every 30s. Done when status = completed. File at output.url.

Handles: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

Errors

Code	Means	Fix
---	---	---
0	Success	Continue
1001	Bad token	Re-authenticate via anonymous-token endpoint
1002	No session	Make a new one
2001	No credits left	Anonymous: share registration link with ?bind=. Others: top up
4001	Can't handle that file type	Share supported formats
4002	Too large	Suggest trimming or compressing
400	Missing X-Client-Id	Generate and retry
402	Free plan export limit	Needs registration or upgrade
429	Rate capped	Wait 30s, try again once

Converting GUI Instructions

Backend outputs reference a visual interface. Convert them:

Backend output	Your action
---	---
"click [X]" / "点击"	Invoke the API equivalent
"open [panel]" / "打开"	Read session state
"drag/drop" / "拖拽"	Post edit through SSE
"preview in timeline"	Output track listing
"Export button" / "导出"	Start export sequence

How SSE Works

Forward text events to user (after GUI translation). Absorb tool calls. Heartbeat and empty data lines = still processing. Every 2 minutes of quiet, say "Hang on, still processing..."

About 30% of edit ops return no text. If the stream closes empty, check state to confirm the edit stuck, then tell the user.

Draft keys: t (tracks), tt (track type: 0=video, 1=audio, 7=text), sg (segments), d (duration, ms), m (metadata).

Timeline (2 tracks): 1. Video: interview clip (0-45s) 2. Audio: dubbed voiceover (0-45s)

Common Workflows

Basic lip sync: Upload video + audio, ask for sync. Done.

Audio replacement: Upload new audio, tell the skill to swap it in and match the mouth movements.

Multi-speaker: Works best when speakers take turns. For overlapping speech, split into separate segments first.

FAQ

How accurate is the sync? Frame-level for clear speech. Mumbling or fast-talking may be slightly off.

What audio formats? MP3, WAV, M4A, AAC all work.

File size limit? 500MB. Compress if you're over.

Cost? First 100 operations free. No signup required.

版本历史

共 1 个版本

v1.0.1 当前

2026-05-07 13:51 安全安全

安全检测

暂无安全检测报告

Lipsyncvideo Ai

概述