🎉

ICCV’25] Four papers have been accepted

Four papers have been accepted to ICCV 2025

International Conference on Computer Vision (ICCV) is the premier international computer vision event comprising the main conference and several co-located workshops and tutorials (accept. rate 24%).

Title: VSC: Visual Search Compositional Text-to-Image Diffusion Model

Authors: Do Huu Dat (VinUniversity), Nam Hyeon-Woo (POSTECH), Po-Yuan Mao (Academia Sinica), Tae-Hyun Oh (KAIST)

Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.

Title: JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Authors: Kwon Byung-Ki (POSTECH & MSRA), Qi Dai (MSRA), Lee Hyoseok (POSTECH), Chong Luo (MSRA), Tae-Hyun Oh (KAIST)

We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation.

Title: VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Authors: Kim Sung-Bin (POSTECH & The University of Texas at Austin), Jeongsoo Choi (KAIST), Puyuan Peng (The University of Texas at Austin), Joon Son Chung (KAIST), Tae-Hyun Oh (KAIST), David Harwath (The University of Texas at Austin)

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

Title: DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding

Authors: Jungbin Cho (Yonsei University), Junwan Kim (Yonsei University), Jisoo Kim (Yonsei University), Minseo Kim (Yonsei University), Mingu Kang (Sungkyunkwan University), Sungeun Hong (Sungkyunkwan University), Tae Hyun Oh (KAIST), Youngjae Yu (POSTECH)

Human motion is inherently continuous and dynamic, posing significant challenges for generative models. While discrete generation methods are widely used, they suffer from limited expressiveness and frame-wise noise artifacts. In contrast, continuous approaches produce smoother, more natural motion but often struggle to adhere to conditioning signals due to high-dimensional complexity and limited training data. To resolve this discord between discrete and continuous representations, we introduce DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding, a novel method that leverages rectified flow to decode discrete motion tokens in the continuous, raw motion space. Our core idea is to frame token decoding as a conditional generation task, ensuring that DisCoRD captures fine-grained dynamics and achieves smoother, more natural motions. Compatible with any discrete-based framework, our method enhances naturalness without compromising faithfulness to the conditioning signals on diverse settings.