Turn video into structured data.
A research-grade pipeline that takes a raw video and extracts machine-learnable features — Whisper transcripts, scenes, keyframes, pacing, and more.
Video is opaque. This makes it queryable.
A video file is a wall of pixels and audio. This pipeline breaks it into stages — transcript, scenes, frames, pacing — so you can actually search, compare, and learn from what's inside. Built originally to study preschool animated episodes, it works on any video.
What you get
Whisper transcription
Speech-to-text with Whisper (tiny by default, configurable) plus speaker diarization.
Scene detection
Find shot boundaries and split episodes into scenes with content-aware detection.
Keyframe extraction
Pull representative frames per scene for downstream visual analysis.
Normalization
Standardize resolution, framerate, and codec so every input is processed the same way.
Pacing & structure
Surface pacing, motion, color patterns, dialogue structure, and narrative rhythm.
Embeddings (planned)
CLIP, audio, and text embeddings for comparing episodes — on the roadmap.
The pipeline
Each video flows through staged processing — raw → normalized → scenes → features → embeddings — with artifacts saved per stage.
Raw ingestion
Import the source video.
Normalization
Standardize resolution, framerate, codec.
Audio extraction
Pull the audio track.
Transcription
Whisper speech-to-text + speaker diarization.
Scene detection
Identify shot boundaries.
Frame extraction
Keyframes per scene.
Features & embeddings
Visual/audio features and embeddings (planned).
Quick start
Python 3.11+ and FFmpeg required. Speaker diarization uses a HuggingFace token.
# set up the environment python -m venv .venv source .venv/bin/activate # macOS / Linux pip install -r requirements.txt # run the pipeline on a video (stages 0 → 3 = transcript) python pipeline.py
Artifacts land under episodes/<name>/ — one folder per stage (audio, transcript, scenes, frames…).
Free and open source.
Grab it on GitHub — or if you need custom media analysis built for your data, let's talk.