AToM: Aligning Text-to-Motion Model at Event-Level
with GPT-4Vision Reward

Haonan Han1,*, Xiangzuo Wu1,*, Huan Liao1,*, Zunnan Xu1, Zhongyuan Hu1,
Ronghui Li1, Yachao Zhang2,†, Xiu Li1,†

1Shenzhen International Graduate School, Tsinghua University, 2School of Informatics, Xiamen University

Abstract

Recently, text-to-motion models have opened new possibilities for creating realistic human motion with greater efficiency and flexibility. However, aligning motion generation with event-level textual descriptions presents unique challenges due to the complex relationship between textual prompts and desired motion outcomes. To address this, we introduce AToM, a framework that enhances the alignment between generated motion and text prompts by leveraging reward from GPT-4Vision. AToM comprises three main stages: Firstly, we construct a dataset MotionPrefer that pairs three types of event-level textual prompts with generated motions, which cover the integrity, temporal relationship and frequency of motion. Secondly, we design a paradigm that utilizes GPT-4Vision for detailed motion annotation, including visual data formatting, task-specific instructions and scoring rules for each sub-task. Finally, we fine-tune an existing text-to-motion model using reinforcement learning guided by this paradigm. Experimental results demonstrate that AToM significantly improves the event-level alignment quality of text-to-motion generation.

Pipeline

Pipeline Image

Main results

Quantitative experiment

We conducted a quantitative experiment to evaluate the performance of AToM compared to baseline and existing methods. The results demonstrate that AToM achieves superior text-motion alignment, motion quality, and generative realism.

Quantitative Experiment Results

Qualitative experiment

Frequency

Pretrain

Qualitative Experiment GIF 1

Finetune

Qualitative Experiment GIF 2

A person paces diagonally two times

Pretrain

Qualitative Experiment GIF 3

Finetune

Qualitative Experiment GIF 4

A person jumps forward one time

Pretrain

Qualitative Experiment GIF 5

Finetune

Qualitative Experiment GIF 6

A person who has his arms raised head high raises his arms
above his head and lowers them three times

Pretrain

Qualitative Experiment GIF 7

Finetune

Qualitative Experiment GIF 8

A man raises a bar over his head two times

Integrity

Pretrain

Qualitative Experiment GIF 5

Finetune

Qualitative Experiment GIF 6

A person steps back, jumps up , and walks forward

Pretrain

Qualitative Experiment GIF 7

Finetune

Qualitative Experiment GIF 8

The figure bends down, gets on their hands and knees,
and begins crawling to the left

Pretrain

Qualitative Experiment GIF 5

Finetune

Qualitative Experiment GIF 6

A person walks diagonally, stomping, uses hands
in a karate motion, and walks backward similarly.

Pretrain

Qualitative Experiment GIF 7

Finetune

Qualitative Experiment GIF 8

A person walks up stairs, turns right, and walks back down

Temporal

Pretrain

Qualitative Experiment GIF 5

Finetune

Qualitative Experiment GIF 6

A person walks forward, then is pushed to their
right and then returns to walking in the line

Pretrain

Qualitative Experiment GIF 7

Finetune

Qualitative Experiment GIF 8

a person takes a seat, appears to make a throwing
motion, and then stands up

Pretrain

Qualitative Experiment GIF 5

Finetune

Qualitative Experiment GIF 6

A person walks forwards doing ballet, then raising
one leg, then skipping and raising the other leg

Pretrain

Qualitative Experiment GIF 7

Finetune

Qualitative Experiment GIF 8

A person walks forwards doing ballet, skipping, then raising one leg, and then skipping and raising the other leg