CSGO: Content-Style Composition in Text-to-Image Generation

Peng Xing, Haofan Wang, Yanpeng Sun, wangqixun, Baixu, Hao Ai, Jen-Yuan Huang, Zechao Li

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

The advancement of image style transfer has been fundamentally constrained by the absence of large-scale, high-quality datasets with explicit content-style-stylized supervision. Existing methods predominantly adopt training-free paradigms (e.g., image inversion), which limit controllability and generalization due to the lack of structured triplet data. To bridge this gap, we design a scalable and automated pipeline that constructs and purifies high-fidelity content-style-stylized image triplets. Leveraging this pipeline, we introduce IMAGStyle—the first large-scale dataset of its kind, containing 210K diverse and precisely aligned triplets for style transfer research. Empowered by IMAGStyle, we propose CSGO, a unified, end-to-end trainable framework that decouples content and style representations via independent feature injection. CSGO jointly supports image-driven style transfer, text-driven stylized generation, and text-editing-driven stylized synthesis within a single architecture. Extensive experiments show that CSGO achieves state-of-the-art controllability and fidelity, demonstrating the critical role of structured synthetic data in unlocking robust and generalizable style transfer. Source code: \url{https://github.com/instantX-research/CSGO}