Text2Place : Affordance Aware Human Guided Placement

1Indian Institute of Science, Bengalaru
Text2Place: Affordance Aware Human Guided Placement

(Top): Proposed approach for text-based placement of humans. Given a background image, we predict the plausible semantic region compatible with the text prompt to place humans. Next, given a few subject images, we perform subject-conditioned inpainting to realistically place humans in appropriate poses following the scene affordances. (Bottom): Our method enables a) realistic human placements at diverse locations and poses and several downstream applications b) scene hallucination by generating compatible scenes for the given pose of the human c) text-based editing of the human and d) placing multiple persons. Notably, our method stands as the first of its kind to achieve this precision solely through textual descriptions of the scene.

Abstract

For a given scene, humans can easily reason for the locations and pose to place objects. Designing a computational model to reason about these affordances poses a significant challenge, mirroring the intuitive reasoning abilities of humans. This work tackles the problem of realistic human insertion in a given background scene termed as Semantic Human Placement. This task is extremely challenging given the diverse backgrounds, scale, and pose of the generated person and, finally, the identity preservation of the person. We divide the problem into the following two stages i) learning semantic masks using text guidance for localizing regions in the image to place humans and ii) subject-conditioned inpainting to place a given subject adhering to the scene affordance within the semantic masks. For learning semantic masks, we leverage rich object-scene priors learned from the text-to-image generative models and optimize a novel parameterization of the semantic mask, eliminating the need for large-scale training. To the best of our knowledge, we are the first ones to provide an effective solution for realistic human placements in diverse real-world scenes. The proposed method can generate highly realistic scene compositions while preserving the background and subject identity. Further, we present results for several downstream tasks - scene hallucination from a single or multiple generated persons and text-based attribute editing. With extensive comparisons against strong baselines, we show the superiority of our method in realistic human placement.



Method Overview

Text2Place: Affordance Aware Human Guided Placement Methodology

Our approach consists of two stages: a) Semantic Mask Optimization. Given a background image \( \mathcal{I}_b \), we initialize a blob mask \( \mathcal{M} \) parameterized as Gaussian blobs and a foreground person image \( \mathcal{I}_p \). These two images are combined to form a composite image \( \mathcal{I}_c \), which is used to compute SDS loss with the action prompt. During optimization, only \( \mathcal{M} \) and \( \mathcal{I}_p \) are getting updated via \( \mathcal{I}_c \). After training \( \mathcal{M} \) converge to a plausible human placement region, which is then used for inpainting. b) Subject conditioned inpainting. Given a few subject images, we perform Textual Inversion to obtain its token embedding \( \mathbf{V*} \). Next, we use the inpainting pipeline of T2I models to perform personalized inpainting of the subject.

Results


Applications

Text based Editing

The user selects elements to delete (shown in blue), and provide a text prompt pertaining to the background. Our diffusion decoder can generate content in the missing region (in black).



Pose Variations

The user selects elements to delete (shown in blue), and provide a text prompt pertaining to the background. Our diffusion decoder can generate content in the missing region (in black).



Object Scene Placement

The user selects elements to delete (shown in blue), and provide a text prompt pertaining to the background. Our diffusion decoder can generate content in the missing region (in black).



Person Scene Hallucination

Text2Place: Affordance Aware Human Guided Placement Methodology

Our approach consists of two stages: a) Semantic Mask Optimization. Given a background image \( \mathcal{I}_b \), we initialize a blob mask \( \mathcal{M} \) parameterized as Gaussian blobs and a foreground person image \( \mathcal{I}_p \). These two images are combined to form a composite image \( \mathcal{I}_c \), which is used to compute SDS loss with the action prompt. During optimization, only \( \mathcal{M} \) and \( \mathcal{I}_p \) are getting updated via \( \mathcal{I}_c \). After training \( \mathcal{M} \) converge to a plausible human placement region, which is then used for inpainting. b) Subject conditioned inpainting. Given a few subject images, we perform Textual Inversion to obtain its token embedding \( \mathbf{V*} \). Next, we use the inpainting pipeline of T2I models to perform personalized inpainting of the subject.



Single Person Scene Hallucination

Text2Place: Affordance Aware Human Guided Placement Methodology

Our approach consists of two stages: a) Semantic Mask Optimization. Given a background image \( \mathcal{I}_b \), we initialize a blob mask \( \mathcal{M} \) parameterized as Gaussian blobs and a foreground person image \( \mathcal{I}_p \). These two images are combined to form a composite image \( \mathcal{I}_c \), which is used to compute SDS loss with the action prompt. During optimization, only \( \mathcal{M} \) and \( \mathcal{I}_p \) are getting updated via \( \mathcal{I}_c \). After training \( \mathcal{M} \) converge to a plausible human placement region, which is then used for inpainting. b) Subject conditioned inpainting. Given a few subject images, we perform Textual Inversion to obtain its token embedding \( \mathbf{V*} \). Next, we use the inpainting pipeline of T2I models to perform personalized inpainting of the subject.



Two Person Scene Hallucination

Text2Place: Affordance Aware Human Guided Placement Methodology

Our approach consists of two stages: a) Semantic Mask Optimization. Given a background image \( \mathcal{I}_b \), we initialize a blob mask \( \mathcal{M} \) parameterized as Gaussian blobs and a foreground person image \( \mathcal{I}_p \). These two images are combined to form a composite image \( \mathcal{I}_c \), which is used to compute SDS loss with the action prompt. During optimization, only \( \mathcal{M} \) and \( \mathcal{I}_p \) are getting updated via \( \mathcal{I}_c \). After training \( \mathcal{M} \) converge to a plausible human placement region, which is then used for inpainting. b) Subject conditioned inpainting. Given a few subject images, we perform Textual Inversion to obtain its token embedding \( \mathbf{V*} \). Next, we use the inpainting pipeline of T2I models to perform personalized inpainting of the subject.




BibTeX

@article{rishubh2024text2place,
  title={Text2Place : Affordance Aware Human Guided Placement},
  author={Rishubh Parihar, Harsh Gupta, Sachidanand VS, R. Venkatesh Babu},
  journal={arXiv preprint .....},
  year={2024}
}

Acknowledgements

We thanks Tejan and members of VAL Lab for their valuable suggestions