wjs-overlaying-video

Post-production for a video clip: cover, captions, illustrations, CTA,

custom motion graphics — all composed in ONE HyperFrames project and

rendered in a SINGLE final encode. No cascade of decodes/re-encodes

(each cascade pass degrades quality and burns time).

When to use

Downstream of /wjs-segmenting-video — the segmentation skill

hands you cropped clips + per-clip SRTs; this skill turns them into

upload-ready MP4s with cover/captions/illustrations/CTA.

User has a finished video and wants to dress it up with motion

graphics: opening hook, key-quote callout, closing slogan, chapter

cards, AI-generated cover as first frame.

User wants HTML/CSS-quality captions on a video (kinetic word-by-word

highlighting, custom fonts, large outlined text, seekable per cue).

User wants illustration overlays at specific hook moments — diagrams,

big text emphasis, flow charts.

Don't use for:

Splitting one long video into clips → use /wjs-segmenting-video.
Creating the source SRT → use /wjs-transcribing-audio (then /wjs-translating-subtitles if you need a different language).
Full HyperFrames productions where the source isn't a fixed video →

use hyperframes directly.

微信视频号 / 抖音 upload (no public API for those) → this skill

produces the MP4; upload is manual.

What this skill IS — and IS NOT

Is	Is not
---	---
Everything that goes ON TOP of a video clip: cover, caption, chapter, illustration, CTA	Cutting / cropping a video (that's `/wjs-segmenting-video` + `/wjs-reframing-video`)
One HyperFrames composition per clip = ONE final encode	A multi-step decode/encode cascade
`cover` is the literal first frame of the output (platforms auto-pick it as thumbnail)	A separate thumbnail file the user uploads alongside
Captions are HTML/CSS — `-webkit-text-stroke` for white-on-anything readability	libass burn-in (deprecated)
Illustrations: re-usable `stack` / `hammer` patterns + custom escape hatch	One bespoke HTML/CSS per illustration without re-use
AI covers regenerated at native target aspect (1024×1792 for vertical, 1536×1024 for horizontal)	Single 1024×1536 default that letterboxes or crops on the platform

The pipeline

clip.mp4 + clip.zh-CN.burn.srt   (from /wjs-segmenting-video hand-off)
   ↓
1. (Optional) Generate AI cover via gpt-image-2
   make_cover.py --segments S.json --out output/ --size 1024x1792
   cover_NN_slug.png

2. Scaffold a HyperFrames project per clip
   hf_clip_NN/1080/{index.html, clip.mp4, cover.png, captions.json}

3. Compose: cover scene + body video + caption track + chapter chip
            + 1-2 illustrations at hook moments + CTA scene

4. npm run check (lint + validate + visual inspect)
   npm run render → upload-ready MP4

A 2-minute vertical 1080×1920 composition renders in ~2-3 min on M-series Mac.

Standard overlay types (the 6 building blocks)

Every clip's final composition is built from some combination of these.

The agent picks the right ones per clip — typically all 6 for a

podcast highlight, or just 1-2 for a single annotation overlay.

1. `cover` — full-frame AI image as first frame

The cover IS the first frame (no animation, no zoom) so platforms that

auto-pick the first frame as the thumbnail get your designed cover by

default. Always verify with ffmpeg -ss 0 -vframes 1 — frame 0

must NOT be black or platform thumbnails will be black.

HTML:

<div id="cover" class="clip" data-start="0" data-duration="1.6"
     data-track-index="1" data-layout-allow-overflow>
  <img src="cover.png" alt="" data-layout-allow-overflow />
</div>

CSS:

#cover { position: absolute; inset: 0; background: #0c0d10; overflow: hidden; }
#cover img { position: absolute; inset: 0; width: 100%; height: 100%; object-fit: cover; }

Generation: use /wjs-segmenting-video/scripts/make_cover.py

(wraps gpt-image-2 images edit with the midpoint frame as ref):

# For 1080×1920 vertical output (视频号 / 抖音):
make_cover.py --segments S.json --out output/ --size 1024x1792 [--single N]

# For 1920×1080 horizontal output (YouTube / B站):
make_cover.py --segments S.json --out output/ --size 1536x1024

Aspect must match output frame. --size 1024x1536 (2:3, the

script default) gets letterboxed or cropped on 9:16 output — always

pass 1024x1792 for vertical. The cover image's aspect is what the

viewer sees full-frame, so mismatch is visible. Re-roll one with

--single N; codex provider can transient-fail mid-batch.

Codex auth required: the script calls codex CLI via

gpt-image-2-skill. If ~/.codex/auth.json is missing, the script

errors. See gpt-image-2-skill for setup.

2. `caption` — outlined HTML/CSS captions synced to SRT

White text with thick black stroke, no bubble background, vertically

centered in a fixed zone (so 1-line vs 2-line captions don't make the

visual center jump up and down).

HTML:

<div id="caption" class="clip" data-start="{body_start}"
     data-duration="{body_dur}" data-track-index="4"></div>

CSS (vertical 1080×1920):

#caption {
  position: absolute; left: 0; right: 0; bottom: 240px;
  height: 240px; z-index: 10; overflow: visible;
}
#caption .bubble {
  position: absolute; top: 50%; left: 50%;
  display: inline-block;
  padding: 0 24px;
  font-size: 56px; line-height: 1.18; font-weight: 900;
  color: #ffffff; max-width: 1020px; text-align: center;
  -webkit-text-stroke: 5px #000;
  paint-order: stroke fill;
  text-shadow: 0 6px 12px rgba(0,0,0,0.55), 0 0 4px rgba(0,0,0,0.6);
  letter-spacing: 0.01em;
}

JS (one bubble per cue + GSAP fade in/out, all centered at container midpoint):

// SRT cues are loaded as inline JSON. Each cue's start/end is offset
// by the cover-scene duration (e.g., 1.5s) so the timing aligns with
// the composition timeline (not the body's own t=0).
const captionEl = document.getElementById("caption");
const groups = JSON.parse(document.getElementById("captions-data").textContent);
const bubbles = groups.map((g, i) => {
  const b = document.createElement("span");
  b.className = "bubble"; b.id = "cap-" + i;
  b.textContent = g.text; b.style.opacity = "0";
  captionEl.appendChild(b);
  return b;
});
// GSAP xPercent/yPercent for centering (CSS transform would get
// overwritten the moment we tween y).
gsap.set(bubbles, { xPercent: -50, yPercent: -50 });
groups.forEach((g, i) => {
  const el = bubbles[i];
  tl.fromTo(el, { opacity: 0, y: 12 }, { opacity: 1, y: 0, duration: 0.18, ease: "power2.out" }, g.start);
  const exitStart = Math.max(g.start + 0.18, g.end - 0.12);
  tl.to(el, { opacity: 0, duration: 0.12, ease: "power2.in" }, exitStart);
  tl.set(el, { opacity: 0 }, g.end);
});

Source SRT — slice + shift before inlining. Take

clip_NN.zh-CN.burn.srt from segmentation, parse each cue, add the

cover duration to every start/end, and inline as JSON in a


    Skill工具集 © 2026

Wjs Overlaying Video

概述

wjs-overlaying-video

When to use

What this skill IS — and IS NOT

The pipeline

Standard overlay types (the 6 building blocks)

1. `cover` — full-frame AI image as first frame

2. `caption` — outlined HTML/CSS captions synced to SRT

Wjs Overlaying Video

概述

wjs-overlaying-video

When to use

What this skill IS — and IS NOT

The pipeline

Standard overlay types (the 6 building blocks)

1. cover — full-frame AI image as first frame

2. caption — outlined HTML/CSS captions synced to SRT

1. `cover` — full-frame AI image as first frame

2. `caption` — outlined HTML/CSS captions synced to SRT