Prior work has established test-time training (TTT) as a general framework
to further improve a trained model at test time. Before making a prediction on
each test instance, the model is trained on the same instance using a
self-supervised task, such as image reconstruction with masked autoencoders.
We extend TTT to the streaming setting, where multiple test instances - video
frames in our case - arrive in temporal order. Our extension is online TTT: The
current model is initialized from the previous model, then trained on the current
frame and a small window of frames immediately before. Online TTT significantly
outperforms the fixed-model baseline for four tasks, on three real-world datasets.
The relative improvement is 45% and 66% for instance and panoptic segmentation.
Surprisingly, online TTT also outperforms its offline variant that accesses more
information, training on all frames from the entire test video regardless of
temporal order. This differs from previous findings using synthetic videos.
We conceptualize locality as the advantage of online over offline TTT. We analyze
the role of locality with ablations and a theory based on bias-variance trade-off.