Clustering Generative Adversarial Networks for Story Visualization
Bowen Li‚ Philip Torr and Thomas Lukasiewicz
Abstract
Story visualization aims to generate a series of images, semantically matching a given sequence of sentences, one for each, and different output images within a story should be consistent with each other. Current methods generate story images by using a heavy architecture with two generative adversarial networks, one for image quality, and one for story consistency, respectively, and also rely on additional segmentation masks or auxiliary captioning networks. In this paper, we aim to build a concise and single GAN-based network, neither depending on additional semantic information nor captioning networks. To achieve this, we propose a contrastive-learning- and clustering-learning-based approach for story visualization. Our network utilizes contrastive losses between language and visual information to maximize the mutual information between them, and further extends it with clustering learning in the training process to capture semantic similarity across modalities. So, the discriminator in our approach provides comprehensive feedback to the generator, regarding both image quality and story consistency at the same time, allowing to have a single-GAN based network to produce high-quality synthetic results. Extensive experiments on two datasets demonstrate that our single GAN-based network achieves a major step up from previous methods, but it has only 32.7% and 1.9% number of parameters in the generator and the discriminator, respectively. Our approach improves FID from 78.67 to 44.75, and FSD from 93.88 to 40.54 on Pororo-SV, and establish a strong benchmark FID of 68.06 and FSD of 11.24 on Abstract Scenes.