Deep Learning of Invariant Spatio−Temporal Features from Video
Bo Chen‚ Jo−Anne Ting‚ Ben Marlin and Nando de Freitas
Abstract
We present a novel hierarchical, distributed model for unsupervised learning of invariant spatio-temporal features from video. Our approach builds on previous deep learning methods and uses the convolutional Restricted Boltzmann machine (CRBM) as a basic processing unit. Our model, called the Space-Time Deep Belief Network (ST-DBN), alternates the aggregation of spatial and temporal information so that higher layers capture longer range statistical dependencies in both space and time. Our experiments show that the ST-DBN has superior performance on discriminative and generative tasks including action recognition and video denoising when compared to convolutional deep belief networks (CDBNs) applied on a per-frame basis. Simultaneously, the ST-DBN has superior feature invariance properties compared to CDBNs and can integrate information from both space and time to fill in missing data in video.