Auto-Caption Generation: Whisper + FFmpeg in a Node.js Worker
Captions are no longer optional for short-form video. Studies consistently show 85%+ of social media videos are watched without sound. If your pipeline produces clips without captions, you're shipp...

Source: DEV Community
Captions are no longer optional for short-form video. Studies consistently show 85%+ of social media videos are watched without sound. If your pipeline produces clips without captions, you're shipping an inferior product. This post covers the full implementation: audio extraction, Whisper transcription, timing alignment, and burning captions directly into the video with FFmpeg. This is part of the caption stack used by ClipSpeedAI. The Approach: Hardcoded vs. Soft Captions Two options exist: Soft captions (SRT/VTT): Separate subtitle track, viewer can toggle. Doesn't work reliably on Instagram or TikTok uploads. Hardcoded captions (burned in): Rendered into the video pixels permanently. Works everywhere. For YouTube Shorts, TikTok, and Instagram Reels uploads, hardcoded captions are the only reliable option. That's what we're building. Step 1: Extract Audio From the Clip // extract_audio.js import { execa } from 'execa'; import path from 'path'; export async function extractAudio(video