🎉

TPAMI’26] One paper has been accepted!

One paper has been accepted to IEEE TPAMI 2026

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), published by the IEEE Computer Society since its founding in 1979, is recognized as the world’s most prestigious academic journal in the fields of Artificial Intelligence (AI), Machine Learning, and Computer Vision. According to the 2024 JCR, it boasts an Impact Factor (IF) of 18.6 and a 5-year IF of 20.4, ranking third among 368 SCI journals in the Engineering, Electrical & Electronic category, following Nature Electronics and Proceedings of the IEEE.

CLIP-Actor-X: Text-driven 4D Human Avatar Generation via Cross-modal Synthesis-through-Optimization

Authors: Kim Youwang* (POSTECH), Taehyun Byun* (Korea University), Kim Ji-Yeon (POSTECH), Sungjoon Choi (Korea University), Tae-Hyun Oh (KAIST)

We propose CLIP-Actor-X, a text-driven motion generation and neural mesh stylization system for 4D human avatar generation. CLIP-Actor-X generates a detailed 3D human mesh, motion animation, and texture to conform to a given text prompt input from a user. CLIP-Actor-X system mainly consists of two modules. First, for generating realistic human motion, we build a text-driven human motion synthesis module modeled by a retrieval-augmented generative model, powered by a text-to-motion diffusion model. Second, our novel zero-shot neural style optimization module detailizes and texturizes the sampled sequence of a neutral human mesh template, such that the resulting mesh and appearance comply with the input text prompt in a temporally-consistent and pose-agnostic manner. In contrast to the prior arts that use an artist-designed, non-animatable mesh as an input, our output representation is animatable and better aligned between an input text and the generated avatar without additional post-processes, e.g., re-alignment, retargeting, or rigging. We further propose the ways to stabilize the optimization process: spatio-temporal view augmentation and visibility-aware embedding attention, which deals with poorly rendered views. We demonstrate that CLIP-Actor-X produces perceptually plausible and human-recognizable human avatar in motion with detailed geometry and texture solely from a natural language prompt.