Knowledge Graph Extraction from Videos
Louis Mahon‚ Eleonora Giunchiglia‚ Bowen Li and Thomas Lukasiewicz
Abstract
Nearly all existing methods for video annotation describe videos using natural language sentences, which has a number of shortcomings: (i) it is not possible to perform automated information processing on natural language annotations, (ii) natural language contains many syntactic patterns, which must be learnt by such methods, although these patterns are not directly relevant to the task of video annotation, (iii) it is difficult to quantitatively measure performance, as standard metrics (e.g., accuracy and F1-score) are inapplicable, and (iv) annotations are language-specific. In this paper, we propose the task of knowledge graph extraction from videos, i.e., producing a description in the form of a knowledge graph of the contents of a given video. Since no datasets exist for this task, we also include a method to automatically generate them, starting from video-captioning datasets where videos are annotated with natural language. We then describe an initial deep-learning model for knowledge graph extraction from videos, and report results on MSVD* and MSR-VTT*, two datasets obtained from MSVD and MSR-VTT using our method.