Skip to main content

Enhancing Text-to-Image Diffusion Models with 3D-Aware Control

Ta-Ying Cheng ( University of Oxford )

Recent text-to-image diffusion models have shown tremendous capabilities in generating diverse, high-quality images. However, there exist many attributes, such as non-rigid motions, camera poses/parameters, and materials, that cannot be precisely described by just texts. In this talk, I will dive into two main streams of work in my PhD that attempt to enable controls over these 3D-aware attributes. First, we present Continuous 3D Words, a special set of input tokens that provide continuous user control over several 3D-aware attributes (e.g., time-of-day illumination, bird wing pose, dolly zoom effect, and object orientation) for image generation and editing. Second, we present ZeST, a zero-shot, training-free method for image-to-image material transfer. We show that by providing guidance in geometry and lighting, we can successfully transfer material properties such as transparency, metallic from the exemplar without explicit image decomposition. These works shed light on an exciting future direction of enhancing generative models, both for static images and videos, to become flexible renderers with fine-grained, 3D-aware controls.

 

 

Share this: