🎉

ICCVw’25] Eleven papers have been accepted!

Tags
Academic
Time
2025/09/17
2 more properties

  Eleven papers have been accepted to ICCVw 2025

Workshop on Computer Vision for Fashion, Art, and Design

Title: Dress-up: Generating Animatable Clothed 3D Humans via Latent Modeling of 3D Gaussian Texture Maps
Oral presentation
We present Dress-up, a generative framework for creating diverse, animatable 3D human avatars with novel identities and clothing. Unlike prior methods that mainly produce in- domain results with limited variation, Dress-up synthesizes high-fidelity 3D humans with diverse identities and clothing, achieved via efficient latent generative modeling and leveraging multi-view 3D captures spanning a wide range of identities, poses, and outfits. Specifically, we design a latent space modeling of clothed 3D human Gaussian texture maps with a latent diffusion model for realistic clothing and appearance generation. The framework ensures multi-view geometric and texture consistency, while remaining robust to novel poses for animation. Dress-up can generate realistic, fully animatable avatars within 5 seconds, supporting real-time deployment. This can enable a wide range of creative and immersive applications, from virtual production to interactive VR/AR experiences.
Title: MUSE: A Training-free Multimodal Unified Semantic Embedder for Structure-Aware Retrieval of Scalable Vector Graphics and Images
Authors: Kyeongseon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh
While Scalable Vector Graphic (SVG) codes appear as either plain text or visually as images, they are structured representations that encode geometric and layout information. However, existing methods typically convert SVGs into raster images, discarding their structural details. Similarly, previous sentence embedding methods generate high-quality text embeddings but do not extend to structured or visual modalities such as SVGs. To address these challenges, we propose the first training-free multimodal embedding method that uses a Multimodal Large Language Model (MLLM) to project text, images, and SVG code into an aligned space. Our method consists of two main components: (1) multimodal Explicit One-word Limitation (mEOL), which produces compact, semantically grounded embeddings across modalities without training; and (2) a semantic SVG module that rewrites SVG code by generating missing or non-descriptive components through visual reasoning. This lets the model embed structural signals overlooked in prior work. Our approach not only introduces the first SVG retrieval setting but also achieves strong empirical performance, surpassing prior methods including training-based models by up to +20.5% Recall@1 on a repurposed VGBench dataset. These results demonstrate that structural cues can significantly enhance semantic alignment in multimodal embeddings, enabling effective retrieval without any fine-tuning.
Title: Patch-wise Retrieval: An Interpretable Instance-Level Image Matching
Authors: Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Dong-ju Jeong, Jinyoung Hwang, Tae-Hyun Oh
Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance.
Title: Text-Embedded 3DGS as Enhanced Language Embeddings
Authors: Dahye Lee, Kim Yu-Ji, GeonU Kim, Jaesung Choe, Tae-Hyun Oh
We introduce enhanced language embeddings with tex features into 3D Gaussian Splatting (3DGS) for open-vocabulary 3D scene understanding. Existing approaches, commonly referred to as language-embedded 3DGS, primarily rely on fixed CLIP image embeddings. This choice leads to modality gaps and degraded retrieval performance. In this work, we introduce 3D text embedding spaces as an enhanced form of language embedding that aligns naturally with text queries and supports intra-domain retrieval. By exploring alternative text embedding models, our approach mitigates inter-domain modality gaps and achieves improved 3D text-querying performance over prior image-based feature spaces. This study establishes a new direction for constructing language-embedded 3D representations, leading toward more effective search spaces for open-vocabulary 3D scene understanding.
Title: ViMP: Visual Motion Prompting
Authors: Baek Seong-Eun, Nam Hyeon-Woo, Lee Jung-Mok, Tae-Hyun Oh
Recent Multimodal Large Language Models (MLLMs) have exhibited remarkable generalization performance across a variety of tasks. Despite promising performance, the performance of video understanding falls behind that of image. To better understand the current status of MLLMs in video understanding, we first investigate extreme video input representations that summarize the video frames into a single image. Our preliminary experiment shows that an imagebased video representation with a simple arrow overlay representing a key motion of interest is surprisingly effective; we named it Visual Motion Prompting (ViMP). In ViMP, we draw a visual arrow on a sampled frame of a video by hand. ViMP conveys motion information with visual contexts, reducing the need for multiple frames. By comparing ViMP with standard video inputs, we find that MLLMs with ViMP surpass MLLMs with multi-frame video inputs despite using more tokens and computations without additional training. Motivated by this observation, we propose Automatic and Dynamic ViMP (AD-ViMP) without human intervention. Our AD-ViMP demonstrates noticeable performance improvements across diverse scenarios.
Title: Dynamic HDR Radiance Fields via Neural Scene Flow
Reliving transient moments captured by a single camera requires reconstructing accurate radiance, geometry, and 3D motion. While significant progress has been made in dynamic 3D scene reconstruction, high-dynamic-range (HDR) radiance fields of dynamic scenes remain difficult to reconstruct. This work introduces HDR-NSFF, a novel approach to reconstructing dynamic HDR radiance fields from a monocular camera with varying exposures. HDR imaging requires multiple LDR images captured at different exposures, but capturing dynamic scenes with alternating exposures introduces challenges such as the correspondence problem, motion inconsistency, color discrepancies, and low frame rates. Here, Neural Scene Flow Fields (NSFF) is used to jointly model scene flow with neural radiance fields, enabling both novel view synthesis and temporal interpolation. NSFF is extended to HDR radiance field reconstruction by modeling learnable explicit camera response functions so that the NSFF and camera response functions can be jointly estimated in challenging dynamic scenes. Since multi-exposure images disrupt applying standard optical flow estimation due to color inconsistency, we mitigate this issue by incorporating DINOv2 semantic features, which provide exposure-invariant object-level priors for motion estimation. By integrating these components, HDR-NSFF effectively reconstructs dynamic HDR radiance fields from single-camera footage, overcoming the limitations of the previous methods and enabling novel view synthesis and high-quality time interpolation in challenging HDR scenarios.
Title: Self-Supervised Collaborative Distillation: Enhancing Lighting Robustness and 3D Awareness
Authors: Wonjun Jo, Hyunwoo Ha, Kim Ji-Yeon, Hawook Jeong, Tae-Hyun Oh
As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders face two key challenges: nighttime lighting conditions and limited 3D awareness, which are required for robust perception and 3D understanding of reliable vision-based systems. To address these issues, we propose a novel self-supervised approach, Collaborative Distillation, which improves light-invariance and 3D awareness in 2D image encoders while retaining semantic context, integrating the strengths of 2D image and 3D LiDAR data. Our method significantly outperforms competing methods in various downstream tasks across diverse lighting conditions and exhibits strong generalization ability. This advancement highlights our method's practicality and adaptability in real-world scenarios.
Title: Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration
Authors: Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, jaesung Choe, Tae-Hyun Oh
We introduce Dr. Splat, a novel approach for open-vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language-embedded 3DGS methods, which rely on a rendering process, our method directly associates language-aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel-ray. Moreover, we integrate Product Quantization (PQ) trained on general large scale image data to compactly represent embeddings without per-scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open-vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks.
Title: Concept-guided Image-to-Image Retrieval via conditioned similarity in Vision-Language Model
Authors: Sohwi Lim, Lee Hyoseok, Tae-Hyun Oh
Existing image retrieval systems (e.g. Composed Image Retrieval) aim to align various user intentions utilizing VisionLanguage Models (VLMs). However, current VLMs cannot extract visual representations based on varying instructions when computing image-to-image similarity, limiting interpretability and flexibility. To address this, we propose a concept-based image retrieval pipeline to focus on different aspects within the given condition, enabling computation of similarities accordingly. Using the text-image joint space of VLMs, we construct a textual subspace under the condition and compute conditioned similarity by projecting the original visual features onto it. Experimental results demonstrate that the proposed method effectively extracts condition-aware representations, leading to substantial improvements in retrieval accuracy.
Title: Multimodal Laughter Understanding in Video with Large Language Models
Authors: Lee Jung-Mok, Kim Sung-Bin, Lee Hyun, Tae-Hyun Oh
This paper presents a new benchmark and model architecture for laughter understanding, which introduces a Laugh-Expert LLM trained to detect, classify, and reason about human laughter. While existing multimodal large language models (AV-LLMs, V-LLMs) are capable of processing visual and audio signals, they often struggle to interpret subtle social cues and ambiguous nonverbal expressions. To address this limitation, we propose representing multimodal input comprising visual context, speech, and semantic information as explicit textual cues, enabling language models to reason more effectively and interpret laughter with greater precision.
Title: FPGS: Feed-Forward Semantic-aware Photorealistic Style Transfer of Large-Scale Gaussian Splatting
Authors: GeonU Kim, Kim Youwang, Lee Hyoseok, Tae-Hyun Oh
We present FPGS, a feed-forward photorealistic style transfer method of large-scale radiance fields represented by Gaussian Splatting. FPGS stylizes large-scale 3D scenes with arbitrary, multiple style reference images without additional optimization while preserving multi-view consistency and real-time rendering speed of 3D Gaussians. Prior arts required tedious per-style optimization or time-consuming per-scene training stage and were limited to small-scale 3D scenes. FPGS efficiently stylizes large-scale 3D scenes by introducing a style-decomposed 3D feature field, which inherits AdaIN’s feed-forward stylization machinery, supporting arbitrary style reference images. Furthermore, FPGS supports multi-reference stylization with the semantic correspondence matching and local AdaIN, which adds diverse user control for 3D scene styles. FPGS also preserves multi-view consistency by applying semantic matching and style transfer processes directly onto queried features in 3D space. In experiments, we demonstrate that FPGS achieves favorable photorealistic quality scene stylization for large-scale static and dynamic 3D scenes with diverse reference images.