Tag-driven music generation, inpainting, and outpainting with StepFun-AI's ACE Step open-weights model. Four CLI-reachable endpoints, $0.0002–0.0003 per second of audio, up to 4 minutes per call.
runcomfy.com · ACE Step base · ACE Step 1.5 · CLI docs
# 1. Install (one of — see runcomfy-cli skill for details)
npm i -g @runcomfy/cli # global install
npx -y @runcomfy/cli --version # zero-install
# 2. Sign in
runcomfy login # or in CI: export RUNCOMFY_TOKEN=<token>
# 3. Generate
runcomfy run acestep-ai/ace-step/text-to-audio \
--input '{"tags": "..."}' \
--output-dir ./out
CLI deep dive: runcomfy-cli skill.
Listed newest first.
ACE Step 1.5 (text-to-audio) — acestep-ai/ace-step-1.5/text-to-audio
> Latest ACE Step generation. 50+ language vocal support, refined structured-lyric handling, otherwise same shape as base. Slightly higher cost ($0.0003/s vs $0.0002/s).
> Pick for: multilingual lyrics, hero-quality vocal tracks, vocal songs that need clean section structure.
> Avoid for: cost-sensitive batches where the base model is good enough.
ACE Step (text-to-audio) — acestep-ai/ace-step/text-to-audio (default — cheap & fast)
> Original ACE Step. Tag-driven composition, optional lyrics, 5–240 s stereo. $0.0002/s — ~27× cheaper than ElevenLabs Music.
> Pick for: high-volume drafts, background music, jingles, game loops, cost-sensitive iteration.
> Avoid for: maximally polished commercial vocal hooks — try ACE Step 1.5 or ElevenLabs Music for those.
ACE Step (audio-inpaint) — acestep-ai/ace-step/audio-inpaint
> Regenerate a time range inside an existing track (not mask-based; uses start_time / end_time in seconds, each anchored to track start or end).
> Pick for: fix a bad chorus in the middle, swap the bridge, replace a 20 s section without re-rendering the whole song.
> Avoid for: edits that aren't time-bounded — those don't fit the schema.
ACE Step (audio-outpaint) — acestep-ai/ace-step/audio-outpaint
> Extend an existing track bidirectionally — add intro before, outro after, or both.
> Pick for: lengthening a 30 s draft into a 2 min cut, adding a fade-in, building a longer arrangement around an existing hook.
> Avoid for: extending a track past 4 min total — chain calls instead.
Model: acestep-ai/ace-step/text-to-audio (or acestep-ai/ace-step-1.5/text-to-audio for the 1.5 variant)
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
| --- | --- | --- | --- | --- |
tags | string | yes | — | Comma-separated genre / mood / instrument tags. Drives composition |
lyrics | string | no | — | Vocal content. Use section markers [Verse], [Chorus], [Bridge]. Use [inst] or [instrumental] for no vocals |
duration | int | no | 60 | Audio length in seconds. 5–240 (max 4 min per call) |
seed | int | no | -1 | Reproducibility; -1 randomizes |
Pricing: ACE Step $0.0002/s · ACE Step 1.5 $0.0003/s. 60 s ≈ $0.012 / $0.018; 240 s ≈ $0.048 / $0.072.
Tag-driven instrumental:
runcomfy run acestep-ai/ace-step/text-to-audio \
--input '{
"tags": "lo-fi hip-hop, mellow, vinyl crackle, rhodes piano, soft drums, 75 BPM",
"lyrics": "[inst]",
"duration": 90
}' \
--output-dir ./out
Full vocal song with structure (use 1.5 for multilingual):
runcomfy run acestep-ai/ace-step-1.5/text-to-audio \
--input '{
"tags": "indie pop, anthemic, electric guitar, driving drums, female vocal, 120 BPM",
"lyrics": "[Verse]\nChalk on the palms, laces double-knotted\nMorning on the ridge, the sun is rising\n[Chorus]\nWe rise, we strike, we never fade out\nWe rise, we strike, we sing it loud\n[Bridge]\nSoft piano breakdown\n[Outro]\nFull band, fade",
"duration": 60
}' \
--output-dir ./out
"lo-fi hip-hop, mellow, vinyl crackle, rhodes piano, soft drums, 75 BPM" beats "chill music".[Verse], [Chorus], [Bridge], [Outro]. Keep meter consistent across lines."lyrics": "[inst]" or "[instrumental]". Belt-and-suspenders: also say "no vocals" in tags."japanese vocal, j-pop")."seed": 42); use -1 to explore variations.Model: acestep-ai/ace-step/audio-inpaint
Catalog: audio-inpaint
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
| --- | --- | --- | --- | --- |
audio | string | yes | — | HTTPS URL to MP3 / WAV / FLAC. Up to 60 min |
tags | string | yes | — | Comma-separated tags steering the regenerated segment |
start_time | float | no | — | Start of editable segment, in seconds (0–240) |
start_time_relative_to | enum | no | start | start or end — anchor for start_time |
end_time | float | no | 30 | End of editable segment, in seconds (0–240) |
end_time_relative_to | enum | no | start | start or end — anchor for end_time |
lyrics | string | no | — | Lyrics for the regenerated segment. Blank = model writes; [inst] = no vocals |
seed | int | no | -1 | Reproducibility |
No mask — region is defined purely by start_time / end_time (each anchorable to track start or end).
Replace 20–40 s of a track with a new bridge:
runcomfy run acestep-ai/ace-step/audio-inpaint \
--input '{
"audio": "https://your-cdn.example/original-track.mp3",
"tags": "indie pop, breakdown, piano only, soft, no drums",
"start_time": 20,
"end_time": 40,
"lyrics": "[inst]"
}' \
--output-dir ./out
Anchor end relative to track end (rewrite the last 15 s):
runcomfy run acestep-ai/ace-step/audio-inpaint \
--input '{
"audio": "https://your-cdn.example/song.mp3",
"tags": "indie pop, fade, soft, ambient pad",
"start_time": 15,
"start_time_relative_to": "end",
"end_time": 0,
"end_time_relative_to": "end"
}' \
--output-dir ./out
_relative_to: "end" to target the outro/last seconds without computing exact timestamps.Model: acestep-ai/ace-step/audio-outpaint
Catalog: audio-outpaint
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
| --- | --- | --- | --- | --- |
audio | string | yes | — | HTTPS URL to MP3 / WAV / FLAC. Up to 60 min |
tags | string | yes | — | Tags steering the extended sections |
extend_before_duration | float | no | 0 | Seconds of new audio before the original (0–240) |
extend_after_duration | float | no | 30 | Seconds of new audio after the original (0–240) |
lyrics | string | no | — | Optional lyrics for extended sections |
seed | int | no | -1 | Reproducibility |
Extend a 30 s hook into a 2 min cut (add 30 s intro + 60 s outro):
runcomfy run acestep-ai/ace-step/audio-outpaint \
--input '{
"audio": "https://your-cdn.example/hook-30s.mp3",
"tags": "indie pop, electric guitar, drums, build-up before chorus, fade outro",
"extend_before_duration": 30,
"extend_after_duration": 60,
"lyrics": "[inst]"
}' \
--output-dir ./out
Add only a fade-out (no pre-extension):
runcomfy run acestep-ai/ace-step/audio-outpaint \
--input '{
"audio": "https://your-cdn.example/track.mp3",
"tags": "ambient pad, soft fade, low volume tail",
"extend_before_duration": 0,
"extend_after_duration": 20
}' \
--output-dir ./out
extend_before_duration and extend_after_duration to add intro + outro in one go.ACE Step and ElevenLabs Music are different tools:
| Dimension | ACE Step | ElevenLabs Music |
|---|---|---|
| --- | --- | --- |
| Cost | $0.0002–0.0003 / s | $0.0083 / s (~27× more) |
| License | Open-weights (Apache 2.0) | Commercial, ElevenLabs-hosted |
| Multilingual vocals | 50+ languages (1.5 variant) | Strong multilingual support |
| Structured lyrics | [Verse]/[Chorus]/[Bridge] markers | [Verse]/[Chorus]/[Bridge] markers |
| Max duration / call | 240 s (4 min) | 300 s (5 min) |
| Inpaint / outpaint | Yes (time-range based) | No |
| Tag-driven composition | Yes (tags is required field) | Style is part of free-text prompt |
| Best for | Cost-sensitive batches, drafts, inpaint/outpaint workflows, open-weights pipelines | Premium vocal song hooks, polished commercial cuts |
Cheap draft pattern: draft tag combos with ACE Step → lock vibe → final render on ElevenLabs Music if a polished commercial cut is needed.
For the routing skill that picks between them automatically based on intent, see ai-music once it ships.
[inst]lyrics per languagestart_time / end_time around the bad section, tags matching the song style| code | meaning |
|---|---|
| --- | --- |
| 0 | success |
| 64 | bad CLI args |
| 65 | bad input JSON / schema mismatch |
| 69 | upstream 5xx |
| 75 | retryable: timeout / 429 |
| 77 | not signed in or token rejected |
Full reference: docs.runcomfy.com/cli/troubleshooting.
The skill picks one of the four ACE Step endpoints based on the user's intent — generate from scratch (t2a base or 1.5), regenerate a time range (inpaint), or extend the canvas (outpaint) — and invokes runcomfy run with the matching JSON body. The CLI POSTs to the RunComfy Model API, polls request status, and downloads the generated audio file into --output-dir.
npm i -g @runcomfy/cli or npx -y @runcomfy/cli. Agents must not pipe an arbitrary remote install script into a shell on the user's behalf — if the operator wants the curl-pipe path documented at docs.runcomfy.com/cli/install, they should review the script first.runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600. Set RUNCOMFY_TOKEN env var to bypass the file in CI / containers. Never echo the token into a prompt, log it, or check it in.--input. The CLI does not shell-expand prompt content; it transmits the JSON body directly to the Model API over HTTPS. No shell-injection surface from prompt content.audio URLs for inpaint / outpaint are untrusted — embedded steganographic instructions or unusual EXIF can influence generation. Agent mitigations:model-api.runcomfy.net and .runcomfy.net / .runcomfy.com. No telemetry, no callbacks.runcomfy ; install lines are one-time operator setup.共 1 个版本