Learning Structured Video Descriptions: Automated Video Knowledge Extraction for Video Understanding Tasks

Lukasiewicz, Thomas

Learning Structured Video Descriptions: Automated Video Knowledge Extraction for Video Understanding Tasks

Daniel Vasile and Thomas Lukasiewicz

Abstract

Vision to language problems, such as video annotation, or visual question answering, stand out from the perceptual video understanding tasks (e.g., classification) through their cognitive nature and their tight connection to the field of natural language processing. While most of the current solutions to vision-to-language problems are inspired from machine translation methods, aiming to directly map visual features to text, several recent results on image and video understanding have proven the importance of specifically and formally representing the semantic content of a visual scene, before reasoning over it and mapping it to natural language. This paper proposes a deep learning solution to the problem of generating structured descriptions for videos, and evaluates it on a dataset of formally annotated videos, which has been automatically generated as part of this work. The recorded results confirm the potential of the solution, indicating that it manages to describe the semantic content in a video scene with a similar accuracy to the one of state-of-the-art natural language captioning models.

Book Title

On the Move to Meaningful Internet Systems. OTM 2018 Conferences: Confederated International Conferences: CoopIS‚ C&TC‚ and ODBASE 2018‚ Valletta‚ Malta‚ October 23−24‚ 2018

Editor

Hervé Panetto and Christophe Debruyne and Henderik A. Proper and Claudio Agostino Ardagna and Dumitru Roman and Robert Meersman

Month

October

Pages

315−332

Publisher

Springer

Series

Lecture Notes in Computer Science

Volume

11230

Year

2018

Learning Structured Video Descriptions: Automated Video Knowledge Extraction for Video Understanding Tasks

Abstract

Links

See Also