ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation
Authors: Kim Youwang (POSTECH), Lee Hyoseok (KAIST), Park Subin (UNIST), Gerard Pons-Moll (University of Tübingen), Tae-Hyun Oh (KAIST)
We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Authors: Sohwi Lim (KAIST), Lee Hyoseok (KAIST), Jungjoon Park (KAIST), Tae-Hyun Oh (KAIST)
Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves state-of-the-art retrieval accuracy and notable computational efficiency compared to previous works.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
Authors: Chengxin Liu (KAIST), Wonseok Choi (POSTECH), Chenshuang Zhang (KAIST), Tae-Hyun Oh (KAIST)
Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent works show that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines.
PAVAS: Physics-Aware Video-to-Audio Synthesis
Authors: Oh Hyun-Bin (POSTECH),Yuhta Takida (Sony AI), Toshimitsu Uesaka (Sony AI),Tae-Hyun Oh (KAIST),Yuki Mitsufuji (Sony AI)
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations.
How Far Can We Go With Synthetic Data for Audio-Visual Sound Source Localization?
Authors: Arda Senocak, Sooyoung Park, Tae-Hyun Oh, Joon Son Chung
SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models
Authors: Chenshuang Zhang (KAIST), Kyeong Seon Kim (KAIST), Chengxin Liu (KAIST), Tae-Hyun Oh (KAIST)
Unlike environmental sounds that mainly indicate event occurrence (e.g., dog barking), human speech carries rich semantics and temporal structures. Despite the advancement of audio-visual large-language models (LLMs) in video understanding, it remains unexplored whether current models can accurately align speech contents with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech–vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech–vision hallucinations from two complementary perspectives: semantic and temporal. Experimental results demonstrate that most advanced audio-visual LLMs struggle with aligning speech content with corresponding visual signals. Our work uncovers a fundamental limitation of current audio-visual LLMs and highlights the need for speech-aware and grounded speech-video perception and comprehension.






