Image−to−Image Translation with Text Guidance
Bowen Li‚ Philip Torr and Thomas Lukasiewicz
Abstract
In this paper, we focus on image-to-image translation with text guidance, where a text description is used to control visual attributes of the synthetic image produced from a given semantic mask. To accomplish this task, we propose a new multi-stage generative adversarial network with three novel components: (1) a discriminator with dual-directional feedback, which provides the generator at the same stage with fine-grained supervisory feedback related to image regions, encouraging it to produce realistic images with finer regional details, and also facilitating generators at following stages to have the ability to complete missing contents and correct inappropriate visual attributes, (2) a compatibility loss guides generators to produce both realistic objects and the background, and also to achieve a good compatibility between them, and (3) a part-of-speech tagging-based spatial attention to better build connection between image regions and corresponding semantic words. Experimental results demonstrate that our model can effectively control the image translation using text descriptions. More importantly, the text input allows our model to produce much diverse results and even new synthetic images that are out-of-distribution of the dataset.