Human Motion and Animation Open access

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen and 3 more

arXiv (Cornell University) | May 28, 2026

Abstract

Abstract

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

Direct answer

What can I do from this paper page?

Use this page to scan "AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling" quickly: start with the summary and abstract, then check the authors, source, topics, and related papers. From here, open Scollr to follow Human Motion and Animation research, save the paper, or map adjacent work.

Authors

Researchers on this paper

Yiheng Li

first

Zhuo Li

middle | ORCID 0009-0003-4139-6259

Ruibing Hou

middle | ORCID 0000-0003-2480-6538

Yingjie Chen

middle

Hong Chang

middle

Hao Liu

middle

Shiguang Shan

last

Research areas

Follow related topics

Citation

BibTeX

@article{Li2026AnyMo,
  title = {AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling},
  author = {Yiheng Li and Zhuo Li and Ruibing Hou and Yingjie Chen and Hong Chang and Hao Liu and Shiguang Shan},
  journal = {arXiv (Cornell University)},
  year = {2026},
  doi = {10.48550/arxiv.2605.29488},
  url = {https://doi.org/10.48550/arxiv.2605.29488}
}

FAQ

Using this paper in a discovery workflow

How do I find related work for this paper?

Use the related papers and topic links on this page as starting points. In Scollr, you can also open the paper and build a literature map around its references, citing papers, and related work.

How can I keep up with new Human Motion and Animation research papers?

Follow Human Motion and Animation research in Scollr. New papers from the topic flow into a personalized feed, and you can save useful studies to revisit later.

Can I cite this paper from this page?

This page includes a static BibTeX block for AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling. Always verify the DOI, source, and publication details against the publisher record before submitting a manuscript.

Follow this research in Scollr

Follow the topics and authors behind this paper, save useful studies, and build a literature map when you are ready to go deeper.

Get the app