Z Zachary Lewis
Open source · Free

Turn video into structured data.

A research-grade pipeline that takes a raw video and extracts machine-learnable features — Whisper transcripts, scenes, keyframes, pacing, and more.

Video is opaque. This makes it queryable.

A video file is a wall of pixels and audio. This pipeline breaks it into stages — transcript, scenes, frames, pacing — so you can actually search, compare, and learn from what's inside. Built originally to study preschool animated episodes, it works on any video.

What you get

📝

Whisper transcription

Speech-to-text with Whisper (tiny by default, configurable) plus speaker diarization.

🎬

Scene detection

Find shot boundaries and split episodes into scenes with content-aware detection.

🖼️

Keyframe extraction

Pull representative frames per scene for downstream visual analysis.

🎚️

Normalization

Standardize resolution, framerate, and codec so every input is processed the same way.

📊

Pacing & structure

Surface pacing, motion, color patterns, dialogue structure, and narrative rhythm.

🧬

Embeddings (planned)

CLIP, audio, and text embeddings for comparing episodes — on the roadmap.

The pipeline

Each video flows through staged processing — raw → normalized → scenes → features → embeddings — with artifacts saved per stage.

0

Raw ingestion

Import the source video.

1

Normalization

Standardize resolution, framerate, codec.

2

Audio extraction

Pull the audio track.

3

Transcription

Whisper speech-to-text + speaker diarization.

4

Scene detection

Identify shot boundaries.

5

Frame extraction

Keyframes per scene.

6–8

Features & embeddings

Visual/audio features and embeddings (planned).

Quick start

Python 3.11+ and FFmpeg required. Speaker diarization uses a HuggingFace token.

terminal
# set up the environment
python -m venv .venv
source .venv/bin/activate   # macOS / Linux
pip install -r requirements.txt

# run the pipeline on a video (stages 0 → 3 = transcript)
python pipeline.py

Artifacts land under episodes/<name>/ — one folder per stage (audio, transcript, scenes, frames…).

Free and open source.

Grab it on GitHub — or if you need custom media analysis built for your data, let's talk.

← Back to all free tools