Human-human motion generation is essential for understanding humans as social beings. Although several transformer-based methods have been proposed, they typically model each individual separately and overlook the causal relationships in temporal motion sequences. Furthermore, the attention mechanism in transformers exhibits quadratic computational complexity, significantly reducing their efficiency when processing long sequences. In this paper, we introduce TIM (Temporal and Interactive Modeling), an efficient and effective approach that presents the pioneering human-human motion generation model utilizing RWKV. Specifically, we first propose Causal Interactive Injection to leverage the temporal properties of motion sequences and avoid non-causal and cumbersome modeling. Then we present Role-Evolving Mixing to adjust to the ever-evolving roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. Notably, TIM has achieved state-of-the-art results using only 32% of InterGen's trainable parameters.
Quantitative evaluation on the InterHuman test set.
Quantitative evaluation on the InterX test set.
We compare against Intergen for human-human motion generation. The synthesized motion by our proposed method are more consistent with the description.
two persons perform a synchronized dancing move together.
one person lifts the magazine in front of themselves with both hands, while the other person kicks up their right leg to assault the magazine.
one person embraces the other person's back with both arms, while the other person reciprocates the gesture.
two individuals are sparring with each other.