Title: Training-free Multimodal Embedding for Structure-Aware Retrieval of Scalable Vector Graphics and Images
Authors: Kyeongseon Kim (KAIST), Baek Seong-Eun (POSTECH), Lee Jung-Mok (POSTECH), Tae-Hyun Oh (KAIST)
Accepted as a Round 1 paper (6.4% acceptance rate)
While Scalable Vector Graphic (SVG) codes appear as either plain text or visually as images, they are structured representations that encode geometric and layout information. However, existing methods typically convert SVGs into raster image, discarding their structural details. Similarly, previous sentence embedding methods generate high-quality text embeddings but do not extend to structured or visual modalities such as SVGs. To address these challenges, we propose the first training-free multimodal embedding method that uses a Multimodal Large Language Model (MLLM) to project text, images, and SVG code into an aligned space. Our method consists of two main components: (1) multimodal Explicit One-word Limitation (mEOL), which produces compact, semantically grounded embeddings across modalities without training; and (2) a semantic SVG module that rewrites SVG code by generating missing or non-descriptive components through visual reasoning. This lets the model embed structural signals overlooked in prior work. Our approach not only introduces the first SVG retrieval setting but also achieves strong empirical performance, surpassing prior methods including training-based models by up to +20.5% Recall@1 on a repurposed VGBench dataset. These results demonstrate that structural cues can significantly enhance semantic alignment in multimodal embeddings, enabling effective retrieval without any fine-tuning.
Title: Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching
Authors: Wonseok Choi (POSTECH), Sohwi Lim (KAIST), Nam Hyeon-Woo (POSTECH), Moon Ye-Bin (POSTECH), Dong-ju Jeong (Samsung Research), Jinyoung Hwang (Samsung Research), Tae-Hyun Oh (KAIST)
Instance-level image retrieval aims to find images containing the same object as a given query, despite variations in size, position, or appearance. To address this challenging task, we propose Patchify, a simple yet effective patch-wise retrieval framework that offers high performance, scalability, and interpretability without requiring fine-tuning. Patchify divides each database image into a small number of structured patches and performs retrieval by comparing these local features with a global query descriptor, enabling accurate and spatially grounded matching. To assess not just retrieval accuracy but also spatial correctness, we introduce LocScore, a localization-aware metric that quantifies whether the retrieved region aligns with the target object. This makes LocScore a valuable diagnostic tool for understanding and improving retrieval behavior. We conduct extensive experiments across multiple benchmarks, backbones, and region selection strategies, showing that Patchify outperforms global methods and complements state-of-the-art reranking pipelines. Furthermore, we apply Product Quantization for efficient large-scale retrieval and highlight the importance of using informative features during compression, which significantly boosts performance.
Title: Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts
Authors: Jaehun Bang (UNIST), Moon Ye-Bin (POSTECH), Kyungdon Joo (UNIST), Tae-Hyun Oh (KAIST)
When searching for videos, users often rely on surrounding context such as background elements or temporal details beyond salient content. However, existing video models struggle with fine-grained spatio-temporal understanding, particularly surrounding contexts, and there are no datasets that effectively evaluate their performance. We introduce SS Datasets, three video retrieval datasets with detailed salient and surrounding captions. To capture rich, temporally localized contexts aligned with meaningful scene changes, we segment videos by scene transitions and generate captions with a vision-language model. Analyzing current models reveals difficulties in handling surrounding queries and temporally complex videos. To address this, we propose simple yet effective baselines that improve retrieval across diverse query types, enabling more robust generalization to real-world scenarios.
Acceptance Stats (Round 1)
WACV’25 received 1,329 valid submissions (excluding desk rejects and withdrawals). Of these, 85 papers were accepted (including mEOL
), 507 were rejected, and 732 were invited to resubmit to Round 2. Over 4,100 reviews were completed by 2,637 reviewers and 274 area chairs — many thanks to them for their hard work and dedication!
Acceptance Stats (Round 2)
WACV’25 received 1,809 valid paper submissions in Round 2, excluding papers that were desk rejected or withdrawn. Of these, 440 resubmissions were accepted, 242 resubmissions were rejected, 303 new submissions were accepted, and 824 new submissions were rejected. — many thanks to them for their hard work and dedication!
