Chimera

Compositional Image Generation using Part-Based Concepting

Shivam Singh*¹

¹ Arizona State University

² Georgia Institute of Technology

³ Google Deepmind

Paper arXiv Code

Figure 1: Part-aware composition with Chimera: the model takes input images along with their specified part prompts (e.g., “head of a lion,” “body of a horse”) and generates a new entity that combines these parts into a coherent output.

Abstract

Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality

Method

Figure 2:An overview of our generation pipeline. User inputs are processed by Grounded-Segment-Anything (SAMv2) to produce segmented images, which are then encoded into the IP+ embedding space. This embedding conditions our finetuned PiT model, acting as a rectified flow prior, to generate a target latent that is subsequently rendered into the final image by the SDXL decoder.

Results

Figure 3: Scores assigned by human annotators for object generations of increasing complexity. For example, “2-part” refers to generating objects using two provided image-text pairs.

Table 1: Generation quality comparison across Animals and Vehicles with different part counts. Lower is better for FID and KID.

Table 2: Comparison across Animals and Vehicles using PartEval (automatic part-level fidelity metric) and Human Evaluation. Higher is better.

Figure 4: Some results with Chimera for 4-part compositions.

BibTeX

@article{singh2025chimera,
    title={Chimera: Compositional Image Generation using Part-Based Concepting},
    author={Singh, Shivam and Chen, Yiming and Chatterjee, Agneet and Raj, Amit and Hays, James and Yang, Yezhou and Baral, Chitta},
    journal={arXiv preprint arXiv:TBD},
    year={2025}
}