runcomfy.com · docs · Image-to-video models
Image-to-video generation on RunComfy. This skill is the canonical image-to-video entry point for the RunComfy Model API: give it a still image and a motion description, and it returns a short video clip. Image-to-video on RunComfy means turning any image — portrait, product photo, environment, illustration — into a video, with the motion driven by your prompt.
Image-to-video (often abbreviated i2v or image2video) is the task of generating a short video starting from a single still image. The image fixes the look — face, wardrobe, product, scene geometry — and the prompt drives the motion. Image-to-video is distinct from text-to-video (no input image) and from video-to-video (which transforms an existing clip).
Image-to-video on RunComfy supports three patterns:
This skill picks the right image-to-video endpoint for the user's intent and calls runcomfy run with the matching schema.
Pick image-to-video on RunComfy whenever:
If the user said "image to video", "i2v", "animate this image", "image2video", "make a video from this", or showed an image and asked for video — route here.
| User intent | Image-to-video model | Why |
|---|---|---|
| --- | --- | --- |
| Default image-to-video — portraits, products, environments | happyhorse-1-0/image-to-video | #1 on Arena (Elo 1392 i2v); strong identity preservation; native synchronized audio in image-to-video output |
| Image-to-video with custom voiceover lip-sync | wan-ai/wan-2-7/text-to-video + audio_url | Drives lip-sync on the image-to-video frame from your audio file |
| Multi-modal image-to-video (image + ref video + ref audio) | bytedance/seedance-v2/pro | Multi-input image-to-video with up to 9 image refs and 3 audio refs |
The agent reads this table, classifies the user's image-to-video intent, and picks the matching endpoint.
npm i -g @runcomfy/cliruncomfy login opens a browser device-code flow.RUNCOMFY_TOKEN=.The default image-to-video endpoint. Use for any general image-to-video task: portrait drift, product reveal, environment motion, character animation. Image-to-video output includes synchronized audio in the same generation pass.
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
| --- | --- | --- | --- | --- |
image_url | string | yes | — | The source still for image-to-video. JPEG/PNG/WebP, min 300px, aspect 1:2.5–2.5:1, ≤10MB. |
prompt | string | yes | — | Motion / camera / lighting description for the image-to-video output. ≤5000 chars. |
resolution | enum | no | 1080P | 720P or 1080P. |
duration | int | no | 5 | 3–15 seconds per image-to-video clip. |
seed | int | no | 0 | Reuse for image-to-video variant comparisons. |
watermark | bool | no | true | Provider watermark on image-to-video output. |
Output aspect of the image-to-video clip equals input image aspect.
runcomfy run happyhorse/happyhorse-1-0/image-to-video \
--input '{
"image_url": "https://.../portrait.jpg",
"prompt": "Gentle camera drift around the subject'\''s face, subtle breathing motion, identity-stable features, soft natural light."
}' \
--output-dir <absolute/path>
When the image-to-video output needs to lip-sync to a custom audio track, use Wan 2.7 with audio_url. The image-to-video clip is generated around your voiceover so mouth movement matches.
| Field | Type | Required | Notes |
|---|---|---|---|
| --- | --- | --- | --- |
prompt | string | yes | Describe the talking-head shot for the image-to-video output. |
audio_url | string | yes | WAV/MP3, 3–30s, ≤15MB. Drives lip-sync on the image-to-video frame. |
aspect_ratio | enum | no | 16:9, 9:16, 1:1, 4:3, 3:4. |
resolution | enum | no | 720p or 1080p. |
duration | enum | no | 2–15 seconds. Match audio length for clean image-to-video lip-sync. |
runcomfy run wan-ai/wan-2-7/text-to-video \
--input '{
"prompt": "Medium close-up, soft key light, locked tripod, shallow DOF.",
"audio_url": "https://.../voiceover-en.mp3",
"duration": 12,
"aspect_ratio": "9:16"
}' \
--output-dir <absolute/path>
For multi-language image-to-video dubs: same prompt, swap audio_url per call, lock seed for visual consistency across all image-to-video outputs.
When the image-to-video output should fuse a subject image with a scene reference and voice reference, use Seedance 2.0 Pro. Multi-modal image-to-video accepts up to 9 image refs.
| Field | Type | Required | Notes |
|---|---|---|---|
| --- | --- | --- | --- |
prompt | string | yes | Description for the image-to-video output. EN ≤1000 words. |
image_url | array | yes | 0–9 source images for image-to-video. First is the primary subject. |
video_url | array | no | 0–3 reference clips (2–15s each) for image-to-video scene cues. |
audio_url | array | no | 0–3 reference audio (2–15s, <15MB each) for image-to-video voice cues. |
duration | int | no | 4–15 seconds. |
resolution | enum | no | 480p or 720p. |
runcomfy run bytedance/seedance-v2/pro \
--input '{
"prompt": "Subject from image 1 walks through the scene from video 1, voice from audio 1.",
"image_url": ["https://.../subject.jpg"],
"video_url": ["https://.../scene.mp4"],
"audio_url": ["https://.../voice.mp3"],
"duration": 8
}' \
--output-dir <absolute/path>
Image-to-video prompts behave differently from text-to-video prompts. The image already fixes the look — your prompt should drive motion, not redescribe the image.
What's the max duration of an image-to-video clip? 15 seconds across all image-to-video routes here. For longer image-to-video sequences, generate multiple clips and stitch.
What image formats does image-to-video accept? JPEG, PNG, WebP. Min 300px, ≤10MB, aspect 1:2.5 to 2.5:1.
Does image-to-video preserve face identity? Yes — the default image-to-video model has strong identity preservation. For best identity hold, the face should fill at least 5% of the frame in the input image.
Can image-to-video include audio? Yes. The default image-to-video model generates synchronized audio in the same pass. The lip-sync image-to-video route accepts your custom audio. The multi-modal image-to-video route accepts reference audio.
Image-to-video vs text-to-video on RunComfy? Image-to-video starts from your image (look fixed). Text-to-video starts from your prompt only (look generated). Use image-to-video when you have an exact reference; use text-to-video for novel content.
Image-to-video output resolution? 720p or 1080p depending on the route.
| code | meaning |
|---|---|
| --- | --- |
| 0 | image-to-video succeeded |
| 64 | bad CLI args |
| 65 | bad input JSON for image-to-video / schema mismatch |
| 69 | upstream 5xx |
| 75 | retryable: timeout / 429 |
| 77 | not signed in or token rejected |
Full reference: docs.runcomfy.com/cli/troubleshooting.
The skill picks one of three image-to-video endpoints based on user intent (general image-to-video, lip-sync image-to-video, or multi-modal image-to-video) and invokes runcomfy run with the matching JSON body. The CLI POSTs to the RunComfy Model API, polls the image-to-video request status every 2 seconds, and downloads the resulting image-to-video file from the .runcomfy.net / .runcomfy.com URL into --output-dir. Ctrl-C cancels the in-flight image-to-video request.
runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600. Set RUNCOMFY_TOKEN env var to bypass the file in CI.--input. The CLI does NOT shell-expand. No shell-injection surface.model-api.runcomfy.net and .runcomfy.net / .runcomfy.com. No telemetry.共 1 个版本