Multimodal Media (Image/Video/Speech) Skill — powered by SkillBoss API Hub

1. Goals and scope

This Skill consolidates six multimodal media capabilities into reusable workflows and implementation templates, all routed through SkillBoss API Hub (https://api.skillbossai.com/v1/pilot):

Image generation (text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts)
Video generation (text-to-video, aspect ratio/resolution control, reference-image guidance)
Video understanding (summaries, Q&A, timestamped evidence)
Speech generation (TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (description, transcription, time-range transcription)

> Convention: All API calls go through SkillBoss API Hub /v1/pilot, which automatically routes to the optimal underlying model. Authentication uses a single SKILLBOSS_API_KEY.

2. Quick routing (decide which capability to use)

1) Do you need to produce images?

Need to generate images from scratch or edit based on an image -> use Image generation (see Section 5)

2) Do you need to understand images?

Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)

3) Do you need to produce video?

Need to generate a short video (optionally with native audio) -> use Video generation (see Section 7)

4) Do you need to understand video?

Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)

5) Do you need to read text aloud?

Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)

6) Do you need to understand audio?

Need audio descriptions, transcription, time-range transcription -> use Audio understanding (see Section 10)

3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

Node.js 18+ (match your project version)
No additional SDK required — all calls use standard fetch (built-in to Node 18+):

# No extra install needed for Node.js 18+
# For older environments you can use: npm install node-fetch

3.1 Authentication and environment variables

Put your API key in SKILLBOSS_API_KEY
All requests use Authorization: Bearer $SKILLBOSS_API_KEY

3.2 Shared helper function

All examples below use this shared pilot() helper:

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

3.3 Unified handling of binary media outputs

Images: returned as result.image_url (URL) or result.images[0].url in the response.
Speech (TTS): returned as result.audio_url.
Video: returned as result.video_url; long-running tasks may require polling.

4. Model selection

> SkillBoss API Hub /v1/pilot automatically routes to the optimal underlying model. Use prefer to control the trade-off:

> - "quality" — best output quality

> - "price" — lowest cost

> - "balanced" — balanced quality/cost (default)

No need to specify model names manually. The hub selects the best available model for the requested capability.

5. Image generation

5.1 Text-to-Image

Node.js minimal template

import * as fs from "node:fs";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const result = await pilot({
  type: "image",
  inputs: {
    prompt: "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
  },
  prefer: "quality",
});

const imageUrl = result["result"]["image_url"];
console.log("Image URL:", imageUrl);

// Download and save the image
const imgResponse = await fetch(imageUrl);
const buffer = Buffer.from(await imgResponse.arrayBuffer());
fs.writeFileSync("out.png", buffer);

REST (curl) minimal template

curl -s -X POST "https://api.skillbossai.com/v1/pilot" \
  -H "Authorization: Bearer $SKILLBOSS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "image",
    "inputs": {
      "prompt": "Create a picture of a nano banana dish in a fancy restaurant",
      "aspect_ratio": "16:9"
    },
    "prefer": "quality"
  }'
# Image URL is at: .result.image_url

5.2 Text-and-Image-to-Image (editing)

Use case: given an image, add/remove/modify elements, change style, color grading, etc.

Node.js minimal template

import * as fs from "node:fs";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const imageBase64 = fs.readFileSync("input.png").toString("base64");

const result = await pilot({
  type: "image",
  inputs: {
    prompt: "Add a nano banana on the table, keep lighting consistent, cinematic tone.",
    image_data: imageBase64,
    image_mime_type: "image/png",
  },
  prefer: "quality",
});

const imageUrl = result["result"]["image_url"];
const imgResponse = await fetch(imageUrl);
const buffer = Buffer.from(await imgResponse.arrayBuffer());
fs.writeFileSync("edited.png", buffer);

5.3 Multi-turn image iteration

Best practice: use multiple sequential calls with the previous output fed back as image_data for continuous iteration (e.g., generate first, then "only edit a specific region/element", then "make variants in the same style").

5.4 Image generation controls

Pass these in the inputs object:

aspect_ratio: e.g. "16:9", "1:1"
size: e.g. "1024x1024", "1024x576" (16:9)

6. Image understanding

6.1 Two ways to provide input images

Inline image data: suitable for small files (Base64 encoded).
Image URL: pass the URL directly if the image is publicly accessible.

6.2 Inline images minimal template

import * as fs from "node:fs";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const result = await pilot({
  type: "chat",
  inputs: {
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: { url: `data:image/jpeg;base64,${imageBase64}` },
          },
          {
            type: "text",
            text: "Caption this image, and list any visible brands.",
          },
        ],
      },
    ],
  },
  prefer: "balanced",
});

const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);

6.3 Image URL reference minimal template

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const result = await pilot({
  type: "chat",
  inputs: {
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: { url: "https://example.com/image.jpg" },
          },
          { type: "text", text: "Caption this image." },
        ],
      },
    ],
  },
  prefer: "balanced",
});

const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);

6.4 Multi-image prompts

Append multiple images as multiple entries in the content array; you can mix URLs and inline Base64 bytes.

7. Video generation

7.1 Core features (must know)

Generates high-fidelity short video (default ~8 seconds), supporting native audio generation (dialogue, ambience, SFX).
Supports aspect ratio (16:9 / 9:16), resolution control, and first/last frame guidance via inputs.

7.2 Node.js minimal template

import * as fs from "node:fs";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const result = await pilot({
  type: "video",
  inputs: {
    prompt: "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.",
    duration: 8,
    aspect_ratio: "16:9",
    resolution: "1080p",
  },
  prefer: "quality",
});

const videoUrl = result["result"]["video_url"];
console.log("Video URL:", videoUrl);

// Download and save
const videoResponse = await fetch(videoUrl);
const buffer = Buffer.from(await videoResponse.arrayBuffer());
fs.writeFileSync("out.mp4", buffer);

7.3 Common controls

Pass these in the inputs object:

aspect_ratio: "16:9" or "9:16"
resolution: "720p" | "1080p" | "4k"
duration: duration in seconds (default 8)
When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language.

7.4 Important limits (engineering fallback needed)

Latency can vary from seconds to minutes; implement timeouts and retries.
Download the video promptly after generation.

Retry with timeout pseudocode

const deadline = Date.now() + 300_000; // 5 min
let result = null;
while (Date.now() < deadline) {
  try {
    result = await pilot({
      type: "video",
      inputs: { prompt: "...", duration: 8 },
      prefer: "quality",
    });
    if (result["result"]["video_url"]) break;
  } catch (e) {
    await new Promise((resolve) => setTimeout(resolve, 5000));
  }
}
if (!result) throw new Error("video generation timed out");
const videoUrl = result["result"]["video_url"];

8. Video understanding

8.1 Video input options

Video URL: for publicly accessible videos.
Base64 inline: for smaller files.
YouTube URL: can analyze public videos by passing the URL in the message.

8.2 Video URL minimal template

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const result = await pilot({
  type: "chat",
  inputs: {
    messages: [
      {
        role: "user",
        content: [
          {
            type: "video_url",
            video_url: { url: "https://example.com/sample.mp4" },
          },
          {
            type: "text",
            text: "Summarize this video. Provide timestamps for key events.",
          },
        ],
      },
    ],
  },
  prefer: "balanced",
});

const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);

8.3 Timestamp prompting strategy

Ask for segmented bullets with "(mm:ss)" timestamps.
Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

For "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
SkillBoss API Hub auto-routes to the best TTS model for the given inputs.

9.2 Single-speaker TTS minimal template

import * as fs from "node:fs";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const result = await pilot({
  type: "tts",
  inputs: {
    text: "Say cheerfully: Have a wonderful day!",
    voice: "Kore",
  },
  prefer: "balanced",
});

const audioUrl = result["result"]["audio_url"];
console.log("Audio URL:", audioUrl);

// Download and save
const audioResponse = await fetch(audioUrl);
const buffer = Buffer.from(await audioResponse.arrayBuffer());
fs.writeFileSync("out.mp3", buffer);

9.3 Multi-speaker TTS

Pass multiple text segments with speaker labels in the text field, using a structured format like "[Speaker1]: Hello\n[Speaker2]: Hi there".

9.4 Voice options and language

The voice field supports named voices (e.g., "alloy", "Kore", "Zephyr", "Puck").
Auto-detects input language; supports 24+ languages.

9.5 "Director notes" (strongly recommended for high-quality voice)

Prefix the text with style directions, e.g.: "Speak in a calm, professional tone: [your content here]".

10. Audio understanding

10.1 Typical tasks

Describe audio content (including non-speech like birds, alarms, etc.)
Generate transcripts
Transcribe specific time ranges
Estimate token/cost for long audio

10.2 Audio transcription (STT) minimal template

import * as fs from "node:fs";
import { Buffer } from "node:buffer";

const SKILLBOSS_API_KEY = process.env.SKILLBOSS_API_KEY;
const API_BASE = "https://api.skillbossai.com/v1";

async function pilot(body) {
  const r = await fetch(`${API_BASE}/pilot`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${SKILLBOSS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  });
  return r.json();
}

const audioB64 = fs.readFileSync("sample.mp3").toString("base64");

const result = await pilot({
  type: "stt",
  inputs: {
    audio_data: audioB64,
    filename: "sample.mp3",
  },
});

const transcript = result["result"]["text"];
console.log(transcript);

10.3 Audio description via chat (for non-transcription tasks)

const audioB64 = fs.readFileSync("sample.mp3").toString("base64");

const result = await pilot({
  type: "chat",
  inputs: {
    messages: [
      {
        role: "user",
        content: [
          {
            type: "audio_url",
            audio_url: { url: `data:audio/mp3;base64,${audioB64}` },
          },
          { type: "text", text: "Describe this audio clip." },
        ],
      },
    ],
  },
  prefer: "balanced",
});

const text = result["result"]["choices"][0]["message"]["content"];
console.log(text);

10.4 Key limits and engineering tips

Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
If the audio file is large, upload it to a publicly accessible URL first and pass the URL instead of Base64.
For very long audio, consider splitting into segments.

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

1) Generate product images via type: "image" (specify negative space and consistent lighting in the prompt).

2) Use type: "chat" with image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.

3) If not satisfied, feed the generated image into editing and iterate.

Example B: Video generation -> video understanding -> narration script

1) Generate a short video with type: "video" (include dialogue or SFX in the prompt).

2) Download and save the video.

3) Use type: "chat" with video to produce a storyboard + timestamps + narration copy; then feed the copy to type: "tts".

Example C: Audio understanding -> transcription -> TTS redub

1) Upload meeting audio and transcribe with type: "stt".

2) Use type: "chat" to summarize or extract specific time ranges.

3) Use type: "tts" to generate a "broadcast" version of the summary.

12. Compliance and risk (must follow)

Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

13. Quick reference (Checklist)

[ ] Set SKILLBOSS_API_KEY environment variable.
[ ] Pick the right type: "image" for image generation, "chat" for understanding tasks, "video" for video generation, "tts" for speech, "stt" for transcription.
[ ] Use prefer: "quality" for best results, "balanced" for cost efficiency.
[ ] Parse responses correctly: images → result.image_url; audio → result.audio_url; video → result.video_url; chat → result.choices[0].message.content; stt → result.text.
[ ] For video generation: set aspect_ratio / resolution, and download promptly.
[ ] For TTS: pass voice name in inputs; use director-style prefix for tone control.
[ ] For large audio/video files: encode to Base64 or host at a URL first.

media

概述