Enhancing Text-to-Image Diffusion Models with 3D-Aware Control
- 11:00 29th November 2024 ( Michaelmas Term 2024 )the Strachey room in the Robert Hooke building
Recent text-to-image diffusion models have shown tremendous capabilities in generating diverse, high-quality images. However, there exist many attributes, such as non-rigid motions, camera poses/parameters, and materials, that cannot be precisely described by just texts. In this talk, I will dive into two main streams of work in my PhD that attempt to enable controls over these 3D-aware attributes. First, we present Continuous 3D Words, a special set of input tokens that provide continuous user control over several 3D-aware attributes (e.g., time-of-day illumination, bird wing pose, dolly zoom effect, and object orientation) for image generation and editing. Second, we present ZeST, a zero-shot, training-free method for image-to-image material transfer. We show that by providing guidance in geometry and lighting, we can successfully transfer material properties such as transparency, metallic from the exemplar without explicit image decomposition. These works shed light on an exciting future direction of enhancing generative models, both for static images and videos, to become flexible renderers with fine-grained, 3D-aware controls.