🎉

ICMLw’26] Eight papers have been accepted!

Eight papers have been accepted to ICMLw 2026

Workshop on SCALE: Scalable Learning and Optimization for Efficient Multimodal AI Agents

Title: SMART: Selective Multimodal Aggregation and Refinement over Time for Video Summarization

Authors: Joohyun Chang (KAIST), Minsung Kim (KAIST), Kim Sung-Bin (POSTECH), Chenshuang Zhang (KAIST), Tae-Hyun Oh (KAIST)

The growth of video content has created a strong demand for video summarization that finds key moments from numerous frames. Recent methods mainly use pretrained language models to estimate frame importance based on visual captions, but they often ignore non-visual cues such as speech and audio events. To address this, we propose SMART, a selective multimodal aggregation and temporal refinement framework for visual summarization. SMART introduces visual-guided modality selection to attend to relevant auxiliary cues, as well as progressive window attention to refine timestep-level representations over a broader temporal context. Experiments show that SMART outperforms 13 state-of-the-art methods on the widely-applied TVSum dataset, demonstrating the effectiveness of selective multimodal integration for video summarization.

Workshop on Culture x AI: Evaluating AI as a Cultural Technology

Title: CuPS: Measuring Cultural Preference Signatures in LLM/VLM Agents and Their Steering by Profile Memories

Authors: Kyeongseon Kim, GeonU Kim, Joohyun Chang, Hyeyeon Kim, Tae-Hyun Oh (KAIST)

Cultural background shapes how people read the same signal differently. In this context, we ask a simple question. Do LLM/VLM agents also read these signals differently? We call this a cultural preference signature. We further ask whether this signature can be shifted by user information contained in pre-execution instruction documents that agents commonly consult, such as memory.md or agent.md. We introduce CuPS, a benchmark designed to measure such signatures. CuPS covers gesture interpretation, triadic categorization, and time-space mapping, with each domain measured across input forms that agents can receive, including text, emoji tokens, and rendered emoji images. Across Qwen and Llama agents, we observe that, much like people, each model carries its own way of reading these signals. In profile-memory experiments, the initial signature shifts in country-specific ways depending on user information documents constructed from personas sampled from NVIDIA Nemotron-Personas. These country-specific shifts appear not only when the user information is given explicitly, but also when it is given only implicitly.

Workshop on Compositional Learning: Safety, Interpretability, and Agents

Title: Reveal-and-Click: A Benchmark for VLM Agents under Hidden GUI Targets

Authors: Kyeongseon Kim* (KAIST), Jiyeon Son* (KAIST), Tae-Hyun Oh (KAIST)

GUI grounding benchmarks evaluate whether a VLM agent can localize a target element from a single static screenshot, under the assumption that the target is already visible. Real-world interfaces, however, often involve visibility constraints such as off-screen targets, hover-resolved visual ambiguity, occlusions, and delayed activation, so a correct click first requires an interaction that reveals the target. In these settings, grounding requires not only visual matching but also interaction to recover target visibility. We introduce Reveal-and-Click, a benchmark for grounding under limited observability. In Reveal-and-Click, agents must first expose a hidden or non-actionable target, then click it. Reveal-and-Click defines a minimal action space and seven visibility-constraint sub-types spanning viewport constraint, multi-state constraint, and temporal and manipulation-based settings. Across 16 controllable mock GUI environments and 601 tasks, humans achieve 94.6% grounding accuracy, while strong open-source VLM agents achieve only around 30%, revealing a large gap in visibility recovery rather than ordinary target localization.

Workshop on Compositional Learning: Safety, Interpretability, and Agents

Title: When GUI Grounding Fails: Entropy-Based Analysis and Training-Free Refinement

Authors: Chengxin Liu (KAIST), Moon Ye-Bin (POSTECH), Tae-Hyun Oh (KAIST)

Graphical User Interface (GUI) agents have recently emerged as powerful tools for enabling automated operation across diverse platforms, with the potential to alleviate human workload. Despite promising progress, GUI grounding remains a critical bottleneck for achieving precise interaction, as incorrect localization of UI elements can lead to task failure. Therefore, it is crucial to identify when the model is likely to fail. In this work, we demonstrate that the entropy of output tokens is strongly correlated with model failures. Based on this observation, we propose an entropy-aware, training-free method to improve GUI grounding performance. Experiments on two GUI grounding benchmarks show that the proposed method achieves consistent improvements over state-of-the-art models.

Workshop on Compositional Learning: Safety, Interpretability, and Agents

Title: Spatially Stable GUI Grounding via Zoom Consistency Loss

Authors: Moon Ye-bin (POSTECH), Jiyeon Son (KAIST), Tae-Hyun Oh (KAIST)

GUI grounding, the task of localizing target UI elements from natural language instructions on a screenshot, is a core capability for GUI agents, yet remains challenging due to dense layouts and small elements in high-resolution interfaces. While inference-time zoom methods improve accuracy by re-running inference on cropped regions, they require multiple forward passes per grounding call, making them costly for multi-step agent deployment. Through controlled experiments, we find that models already possess sufficient visual understanding of target elements; what they lack is stable spatial focus under cluttered, high-resolution inputs, a problem we term spatial instability. To address this, we propose a Zoom Consistency Loss, a lightweight auxiliary training objective that enforces agreement between predictions on the original screenshot and on zoomed crops of the same image. At inference time, the model requires only a single forward pass with no additional overhead. Experiments across multiple benchmarks show consistent improvements, with particularly strong gains on the high-resolution ScreenSpot-Pro dataset (+3.80), demonstrating zoom consistency as an effective regularizer for spatially stable grounding.

Workshop on Compositional Learning: Safety, Interpretability, and Agents

Title: From Numbers to Narratives: Goal-Oriented Summarization of Machine Learning Model Differences

Authors: Nam Hyeon-Woo (POSTECH), Tae-Hyun Oh (KAIST), Zeynep Akata (Technical University of Munich), Stephan Alaniz (Télécom Paris, Institut Polytechnique de Paris)

Non-experts can now obtain natural-language explanations of how two ML models differ by feeding numerical results to an LLM-based agent. However, naively prompting an LLM often omits critical conditions, and non-experts often cannot easily detect these omissions, which can mislead downstream conclusions. We formulate this as goal-oriented summarization and propose Condenser, an iterative method that optimizes an explanation against two objectives: Completeness (faithfulness to observed differences) and Density (informativeness per unit length). Condenser+ extends Condenser with an LLM-based Explorer that actively selects conditions to evaluate. On four settings of increasing complexity (Colored MNIST, Fitzpatrick17k, Dollar Street, and open-condition gender classification), our methods produce concise and complete explanations. They also support downstream LLM tasks, prompt refinement on CelebA and subset benchmarking on Flickr30K, where measurable improvements further validate the effectiveness of our method. Goal-oriented summarization thus yields explanations that are concise, complete, and useful for downstream LLM tasks.

Workshop on Foundations of Deep Generative Models

Title: Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

Authors: Lee Hyoseok (KAIST), Sohwi Lim (KAIST), Eunju Cha (Sookmyung Women's University), Tae-Hyun Oh (KAIST)

While latent diffusion models (LDMs) have emerged as powerful priors for inverse problems, existing LDM-based solvers frequently suffer from instability. In this work, we first identify the instability as a discrepancy between the solver dynamics and stable reverse diffusion dynamics learned by the diffusion model, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play stabilization module that remedies the LDM-based inverse problem solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often fail to hold in latent space, MCLC provides a principled stabilization mechanism, leading to more stable and reliable behavior in latent space.

Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

Title: A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

Authors: Baek Seong-Eun (POSTECH), Lee Jung-Mok (KAIST), Kim Sung-Bin (POSTECH), Tae-Hyun Oh (KAIST)

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) is resource-efficient, but its performance is highly sensitive to hyperparameter choices, making exhaustive search expensive. To address this, we propose a framework that integrates pre-trained LLM knowledge into Bayesian Optimization (BO) for efficient LoRA hyperparameter optimization. Our method uses an LLM as a discrete-to-continuous mapping module that converts hyperparameter configurations and domain-aware prompts into continuous embeddings, where BO is performed. The prompts describe the roles and relationships of LoRA hyperparameters, while an additional learnable token captures information not easily expressed in text. We further introduce proxy evaluation on a data subset, exploiting its strong correlation with full-data training to reduce evaluation cost. Experiments show that our method finds strong hyperparameters within about 30 iterations, achieving over 20% improvement over standard hyperparameters selected from roughly 45,000 combinations.