Video Datasets

Not yet GA. Please contact us if you are interested.

Spiral stores video as a source blob plus a compact H.264 index. The source stays in object storage, while the index records enough container and codec metadata to plan sparse frame reads without reparsing the MP4 for every scan.

The current video path is intentionally focused:

H.264 / AVC video in MP4 containers
write-time indexing through se.Video(se.Blob(...))
scan-time frame decode through se.video.read(...)
MP4 preview generation through se.video.remux(...)
optional exact-cadence transcoding for prefix-readable training artifacts

Why Index Video

Random frame access in compressed video is not a direct byte lookup. A requested display frame may depend on earlier or later reference frames in decode order, and those reference frames may live in separate MP4 samples. Spiral’s video index captures that frame graph once at write time.

Video frame graph

The index stores:

sample byte offsets and lengths
display-order and decode-order frame positions
presentation timestamps and durations
sync samples and GOP boundaries
active H.264 reference dependencies
precomputed decode closures for sparse planning

With that index, scans can request a small set of display frames and fetch only the compressed byte ranges needed to decode those frames.

Install


pip install pyspiral

The optional H.264 transcode path uses the external GPL-separated spiral-h264 encoder bridge:


pip install 'pyspiral[h264]'

For local development, SPIRAL_H264 or SPIRAL_H264_BIN can point at a local spiral-h264 binary.

Write Videos

Use se.Video(se.Blob(...)) when writing MP4 payloads. Spiral stores a canonical video struct with two fields:

source: the source blob
index: the generated video index


from spiral import Spiral, expressions as se
 
sp = Spiral()
clips = sp.project("training").table("clips")
 
clips.write(
    {
        "clip_id": clip_ids,
        "label": labels,
        "video": [se.Video(se.Blob(payload)) for payload in mp4_payloads],
    }
)

This shape keeps source media and planning metadata together. Reads can choose frames from video.index, then fetch bytes from video.source.

Decode Frames In A Scan

Use se.video.read(...) in the scan projection. It decodes selected frames on CPU and returns a struct containing RGB frame bytes plus frame metadata.


from spiral import Spiral, expressions as se
 
sp = Spiral()
clips = sp.project("training").table("clips")
 
scan = sp.scan(
    {
        "clip_id": clips["clip_id"],
        "label": clips["label"],
        "video": se.video.read(clips["video"]),
    }
)
 
loader = scan.to_data_loader(batch_size=32)

Each video value contains:

frames: per-frame records with packed uint8 RGB bytes and timing metadata
frame_shape: [height, width, 3]

Map this result to Torch tensors in your loader transform.

Pass a list column or auxiliary expression as the second argument when you want sparse frame selection.


import pyarrow as pa
from spiral import expressions as se
 
frame_indices = se.aux("frame_indices", pa.list_(pa.uint32()))
 
scan = sp.scan(
    {
        "clip_id": clips["clip_id"],
        "frames": se.video.read(clips["video"], frame_indices, height=224, width=224),
    }
)
 
key_table = pa.table(
    {
        "clip_id": [101, 202],
        "frame_indices": pa.array([[0, 8, 16, 24], [4, 9, 15]], type=pa.list_(pa.uint32())),
    }
)
 
batch = scan.to_table(key_table=key_table)

Sparse Access Model

The simplest video layout is a baseline P-chain: later frames depend on earlier references. Accessing one frame can require fetching the nearest preceding key frame and every compressed sample needed to reach the target.

P-chain baseline

Spiral’s index lets the scan planner compute that decode closure explicitly. For random access, the scan fetches only the byte ranges for the selected closure instead of reading the whole MP4.

Random video access

The planner can choose between exact merged sample ranges, a widened contiguous span, or a full touched-GOP span. The right choice depends on storage backend behavior, range-count limits, and the overfetch ratio for the selected frames.

Remux MP4 Previews

Training jobs usually do not want playable MP4s in the hot path. se.video.read(...) returns decoded frames for model input. Remuxing is a convenience feature for previews, debug clips, and API responses.

se.video.remux(...) is a scan projection expression. It takes the selected H.264 decode closure and writes a new MP4 payload without re-encoding pixels.


import subprocess
import sys
from pathlib import Path
 
import pyarrow as pa
 
from spiral import Spiral, expressions as se
 
sp = Spiral()
clips = sp.project("training").table("clips")
 
frame_indices = se.aux("frame_indices", pa.list_(pa.uint32()))
 
scan = sp.scan(
    {
        "clip_id": clips["clip_id"],
        "preview_mp4": se.video.remux(clips["video"], frame_indices),
    }
)
 
key_table = pa.table(
    {
        "clip_id": [101],
        "frame_indices": pa.array([[0, 8, 16, 24]], type=pa.list_(pa.uint32())),
    }
)
 
result = scan.to_table(key_table=key_table)
preview_mp4 = result["preview_mp4"][0].as_py()
 
output_path = Path("preview.mp4")
output_path.write_bytes(preview_mp4)
 
open_cmd = "open" if sys.platform == "darwin" else "xdg-open"
subprocess.run([open_cmd, str(output_path)], check=True)

This is cheap compared with transcoding because it copies the selected compressed samples into a new MP4 container. It is still extra work, so it belongs in preview and export flows rather than the training inner loop.

Exact-Cadence Transcoding

Sparse reads help when the source video already has a favorable frame graph. Training datasets often need a stronger property: keep the full-rate asset, but make an exact lower-rate view cheap to read.

The production-oriented path is an exact-cadence transcode:

keep the full-rate cadence, for example 30 fps
retain one or more exact lower-rate cadences, for example 10 fps
make retained lower-rate frames decode-closed
pack the MP4 so a retained cadence can be read as a file prefix

For the common 30 -> 10 case, the retained frames are the original display positions 0, 3, 6, .... The encoder must avoid making those retained frames depend on frames that are omitted from the 10 fps view.


spiral video transcode input.mp4 \
  --output /tmp/out.transcode.mp4 \
  --fps 10 \
  --quality 25 \
  --size 1280x720 \
  --preset medium

If --output is omitted, the CLI writes next to the input as <stem>.transcode.mp4. The result is a prefix-readable MP4. Lower temporal layers are packed first, so a lower-rate view can be served by reading a prefix rather than a separate file. The full-rate source cadence is always kept; --fps names the lower cadences that should be cheap to read.

Video packing order

Use a deeper ladder only when it is valuable enough to justify the quality and validation cost:


spiral video transcode input.mp4 \
  --output /tmp/out.transcode.mp4 \
  --fps 10,5 \
  --quality 23,27 \
  --multi-track

Very low cadences create larger motion-prediction gaps. They can be useful, but they should be validated on the target dataset instead of treated as a free storage win.

--quality has two modes. A single value is x264 CRF. A comma-separated list is exact QP, aligned with the --fps entries by position. Extra ffmpeg and x264 controls remain available as repeatable raw tokens:


spiral video transcode input.mp4 \
  --fps 10 \
  --quality 24 \
  --ffmpeg-arg=-t \
  --ffmpeg-arg=60 \
  --x264-param aq-mode=1

The patched encoder path lives behind the external spiral-h264 process. Spiral invokes ffmpeg to decode/filter the source into Y4M, passes the schedule to spiral-h264, muxes the returned H.264 samples into MP4, then packs and indexes the result.

Operational Notes

Use spiral video inspect to print the indexed frame graph for an input or transcoded artifact:


spiral video inspect /tmp/out.pyramid.mp4 --show-frames 24 --max-gops 2

The video implementation is built around H.264 specifics rather than a codec-generic abstraction. That is deliberate: the planner needs execution-grade metadata, including reference dependencies and sample byte ranges, not just media file metadata.

For implementation details, see the spql-video crate.