Eight papers have been accepted to ICMLw 2026 
Title: SMART: Selective Multimodal Aggregation and Refinement over Time for Video Summarization
Authors: Joohyun Chang (KAIST), Minsung Kim (KAIST), Kim Sung-Bin (POSTECH), Chenshuang Zhang (KAIST), Tae-Hyun Oh (KAIST)
The growth of video content has created a strong demand for video summarization that finds key moments from numerous frames. Recent methods mainly use pretrained language models to estimate frame importance based on visual captions, but they often ignore non-visual cues such as speech and audio events. To address this, we propose SMART, a selective multimodal aggregation and temporal refinement framework for visual summarization. SMART introduces visual-guided modality selection to attend to relevant auxiliary cues, as well as progressive window attention to refine timestep-level representations over a broader temporal context. Experiments show that SMART outperforms 13 state-of-the-art methods on the widely-applied TVSum dataset, demonstrating the effectiveness of selective multimodal integration for video summarization.
Title: CuPS: Measuring Cultural Preference Signatures in LLM/VLM Agents and Their Steering by Profile Memories
Authors: Kyeongseon Kim, GeonU Kim, Joohyun Chang, Hyeyeon Kim, Tae-Hyun Oh (KAIST)
Cultural background shapes how people read the same signal differently. In this context, we ask a simple question. Do LLM/VLM agents also read these signals differently? We call this a cultural preference signature. We further ask whether this signature can be shifted by user information contained in pre-execution instruction documents that agents commonly consult, such as memory.md or agent.md. We introduce CuPS, a benchmark designed to measure such signatures. CuPS covers gesture interpretation, triadic categorization, and time-space mapping, with each domain measured across input forms that agents can receive, including text, emoji tokens, and rendered emoji images. Across Qwen and Llama agents, we observe that, much like people, each model carries its own way of reading these signals. In profile-memory experiments, the initial signature shifts in country-specific ways depending on user information documents constructed from personas sampled from NVIDIA Nemotron-Personas. These country-specific shifts appear not only when the user information is given explicitly, but also when it is given only implicitly.
Title: IGG: A Benchmark for Interactive GUI Grounding under Visibility Constraints
Authors: Kyeongseon Kim* (KAIST), Jiyeon Son* (KAIST), Tae-Hyun Oh (KAIST)
GUI grounding benchmarks evaluate whether a VLM agent can localize a target element from a single static screenshot, typically assuming that the target is already visible. Real-world interfaces, however, often involve visibility constraints such as off-screen targets, hover-based descriptions, occlusions, and delayed activation, so a correct click first requires an interaction that reveals the target. In these settings, grounding requires not only visual matching but also interaction to recover target visibility. We introduce Interactive GUI Grounding (IGG), a benchmark for grounding under limited observability. In IGG, the target is not directly localizable or actionable from the initial screenshot, and agents must expose the target before localization. We define a minimal action space for visibility recovery and a three-level taxonomy of GUI constraints spanning single-state, multi-state, and temporal and advanced settings, with seven sub-types, enabling systematic evaluation of GUI agents under diverse visibility constraints.
Title: Entropy-Aware GUI Grounding: From Failure Analysis to Improved Localization
Authors: Chengxin Liu (KAIST), Moon Ye-Bin (POSTECH), Tae-Hyun Oh (KAIST)
Graphical User Interface (GUI) agents have recently emerged as powerful tools for enabling automated operation across diverse platforms, with the potential to alleviate human workload. Despite promising progress, GUI grounding remains a critical bottleneck for achieving precise interaction, as incorrect localization of UI elements can lead to task failure. Therefore, it is crucial to identify when the model is likely to fail. In this work, we demonstrate that the entropy of output tokens is strongly correlated with model failures. Based on this observation, we propose an entropy-aware, training-free method to improve GUI grounding performance. Experiments on two GUI grounding benchmarks show that the proposed method achieves state-of-the-art results.
Title: Spatially Stable GUI Grounding via Zoom Consistency Loss
Authors: Moon Ye-bin (POSTECH), Jiyeon Son (KAIST), Tae-Hyun Oh (KAIST)
GUI grounding, the task of localizing target UI elements from natural language instructions on a screenshot, is a core capability for GUI agents, yet remains challenging due to dense layouts and small elements in high-resolution interfaces. While inference-time zoom methods improve accuracy by re-running inference on cropped regions, they require multiple forward passes per grounding call, making them costly for multi-step agent deployment. Through controlled experiments, we find that models already possess sufficient visual understanding of target elements; what they lack is stable spatial focus under cluttered, high-resolution inputs, a problem we term spatial instability. To address this, we propose a Zoom Consistency Loss, a lightweight auxiliary training objective that enforces agreement between predictions on the original screenshot and on zoomed crops of the same image. At inference time, the model requires only a single forward pass with no additional overhead. Experiments across multiple benchmarks show consistent improvements, with particularly strong gains on the high-resolution ScreenSpot-Pro dataset (+3.80), demonstrating zoom consistency as an effective regularizer for spatially stable grounding.
Title: From Numbers to Narratives: Goal-Oriented Summarization of Machine Learning Model Differences
Authors: Nam Hyeon-Woo (POSTECH), Tae-Hyun Oh (KAIST), Zeynep Akata (Technical University of Munich), Stephan Alaniz (Télécom Paris, Institut Polytechnique de Paris)
Non-experts can now obtain natural-language explanations of how two ML models differ by feeding numerical results to an LLM-based agent. However, naively prompting an LLM often omits critical conditions, and non-experts often cannot easily detect these omissions, which can mislead downstream conclusions. We formulate this as goal-oriented summarization and propose Condenser, an iterative method that optimizes an explanation against two objectives: Completeness (faithfulness to observed differences) and Density (informativeness per unit length). Condenser+ extends Condenser with an LLM-based Explorer that actively selects conditions to evaluate. On four settings of increasing complexity (Colored MNIST, Fitzpatrick17k, Dollar Street, and open-condition gender classification), our methods produce concise and complete explanations. They also support downstream LLM tasks, prompt refinement on CelebA and subset benchmarking on Flickr30K, where measurable improvements further validate the effectiveness of our method. Goal-oriented summarization thus yields explanations that are concise, complete, and useful for downstream LLM tasks.
