Test-Time Training on Video Streams

Renhao Wang 1*, Yu Sun 1*, Yossi Gandelsman 1, Xinlei Chen 2, Alexei A. Efros 1, Xiaolong Wang 3
1UC Berkeley, 2Meta AI, 3UC San Diego
*Equal Contribution


Abstract

Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.




Results

We experiment with four applications on three real-world datasets: 1) instance and panoptic segmentation on COCO-Videos – a new dataset we annotated; 2) semantic segmentation on KITTI-STEP – a public dataset of urban driving videos; 3) colorization on COCO Videos and a collection of black and white films from the Lumière Brothers.


Task 1: COCO Videos - Instance Segmentation



Restaurant

Input Video

Baseline

TTT


Havana

Input Video

Baseline

TTT


Task 2: COCO Videos - Panoptic Segmentation



School

Input Video

Baseline

TTT


Bangkok

Input Video

Baseline

TTT


Task 3: KITTI-STEP - Semantic Segmentation



Video 0002

Input Video

Baseline

TTT


Video 0018

Input Video

Baseline

TTT


Task 4: Video Colorization



L'Arrivée d'un Train En Gare de La Ciotat ("The Arrival of a Train")

Input Video

Baseline

TTT


La Pêche Aux Poissons Rouges ("Fishing for Goldfish")

Input Video

Baseline

TTT


Repas de Bébé ("Baby's Breakfast")

Input Video

Baseline

TTT


COCO-Videos Example (Havana)

RGB Ground Truth

Baseline

TTT