Abstract
Multimodal learning of motion and text tries to find the correspondence between skeletal time-series data acquired by motion capture and the text that describes the motion. In this field, good associations can realize both motion-to-text and text-to-motion applications. However, the previous methods failed to associate motion with text, taking into account details of descriptions, for example, whether to move the left or right arm. In this paper, we propose a motion-text contrastive learning method for making correspondences between motion and text in a shared embedding space. We showed that our model outperforms the previous studies in the task of action recognition. We also qualitatively show that, by using a pre-trained text encoder, our model can perform motion retrieval with detailed correspondences between motion and text.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The BABEL dataset used in the experiment is publicly available.
References
Ahuja C, Morency L-P (2019) Language2pose: natural language grounded pose forecasting. In: 2019 international conference on 3D vision (3DV). IEEE, pp 719–728
Ghosh A, Cheema N, Oguz C, Theobalt C, Slusallek P (2021) Synthesis of compositional animations from textual descriptions. CoRR abs/2103.14675
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763
Goh G, Cammarata N, Voss C, Carter S, Petrov M, Schubert L, Radford A, Olah C (2021) Multimodal neurons in artificial neural networks. Distill. https://distill.pub/2021/multimodal-neurons
Tevet G, Gordon B, Hertz A, Bermano AH, Cohen-Or D (2022) Motionclip: exposing human motion generation to clip space. arXiv preprint arXiv:2203.08063
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F (2020) A survey on contrastive self-supervised learning. Technologies 9(1):2
Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. Adv Neural Inf Process Syst 33:18661–18673
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Schlichtkrull M, Kipf TN, Bloem P, van den Berg R, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607
Yu B, Yin H, Zhu Z (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In: Proceedings of the 27th international joint conference on artificial intelligence (IJCAI)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Radford A, Jeffrey W, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. Open AI blog 1(8):9
Wang J, Song Y, Leung T, Rosenberg C, Wang J, Philbin J, Chen B, Wu Y (2014) Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1386–1393
Sohn K (2016) Improved deep metric learning with multi-class n-pair loss objective. Adv Neural Inf Process Syst 29:1857–1865
Punnakkal AR, Chandrasekaran A, Athanasiou N, Quiros-Ramirez A, Black MJ (2021) BABEL: bodies, action and behavior with English labels. In: Proceedings IEEE/CVF conf. on computer vision and pattern recognition (CVPR), June 2021, pp 722–731
Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: a skinned multi-person linear model. ACM Trans Graph (Proc. SIGGRAPH Asia) 34(6):248:1–248:16
Mahmood N, Ghorbani N, Troje NF, Pons-Moll G, Black MJ (2019) AMASS: archive of motion capture as surface shapes. In: International conference on computer vision, October 2019, pp 5442–5451
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization
Marrakchi Y, Makansi O, Brox T (2021) Fighting class imbalance with contrastive learning. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C (eds) Medical image computing and computer assisted intervention–MICCAI 2021. Springer International Publishing, Cham, pp 466–476
Wang P, Han K, Wei X-S, Zhang L, Wang L (2021) Contrastive learning based hybrid networks for long-tailed image classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 943–952
Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, Kalantidis Y (2020) Decoupling representation and classifier for long-tailed recognition. In: Eighth international conference on learning representations (ICLR)
Toyoda M, Suzuki K, Mori H, Hayashi Y, Ogata T (2021) Embodying pre-trained word embeddings through robot actions. IEEE Robot Automat Lett 6(2):4225–4232
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was submitted and accepted for the Journal Track of the joint symposium of the 28th International Symposium on Artificial Life and Robotics, the 8th International Symposium on BioComplexity, and the 6th International Symposium on Swarm Behavior and Bio-Inspired Robotics (Beppu, Oita, January 25–27, 2023).
About this article
Cite this article
Horie, J., Noguchi, W., Iizuka, H. et al. Learning shared embedding representation of motion and text using contrastive learning. Artif Life Robotics 28, 148–157 (2023). https://doi.org/10.1007/s10015-022-00840-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-022-00840-0