Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

Preprint (2024)
Author(s)

Bowen Chen (Harbin Institute of Technology)

Zhiyong Wang (Harbin Institute of Technology)

Benjamin Filtjens (Katholieke Universiteit Leuven)

Chunzhuo Wang (Katholieke Universiteit Leuven)

Weihong Ren (Harbin Institute of Technology)

Bart Vanrumste (Katholieke Universiteit Leuven)

Honghai Liu (Harbin Institute of Technology)

Affiliation
External organisation
DOI related publication
https://doi.org/10.48550/arXiv.2410.06353
More Info
expand_more
Publication Year
2024
Language
English
Affiliation
External organisation
Publisher
ArXiv

Abstract

Skeleton-based Temporal Action Segmentation involves the dense action classification of variable-length skeleton sequences. Current approaches primarily apply graph-based networks to extract framewise, whole-body-level motion representations, and use one-hot encoded labels for model optimization. However, whole-body motion representations do not capture fine-grained part-level motion representations and the one-hot encoded labels neglect the intrinsic semantic relationships within the language-based action definitions. To address these limitations, we propose a novel method named Language-assisted Human Part Motion Representation Learning (LPL), which contains a Disentangled Part Motion Encoder (DPE) to extract dual-level (i.e., part and whole-body) motion representations and a Language-assisted Distribution Alignment (LDA) strategy for optimizing spatial relations within representations. Specifically, after part-aware skeleton encoding via DPE, LDA generates dual-level action descriptions to construct a textual embedding space with the help of a large-scale language model. Then, LDA motivates the alignment of the embedding space between text descriptions and motions. This alignment allows LDA not only to enhance intra-class compactness but also to transfer the language-encoded semantic correlations among actions to skeleton-based motion learning. Moreover, we propose a simple yet efficient Semantic Offset Adapter to smooth the cross-domain misalignment. Our experiments indicate that LPL achieves state-of-the-art performance across various datasets (e.g., +4.4\% Accuracy, +5.6\% F1 on the PKU-MMD dataset). Moreover, LDA is compatible with existing methods and improves their performance (e.g., +4.8\% Accuracy, +4.3\% F1 on the LARa dataset) without additional inference costs.

No files available

Metadata only record. There are no files for this record.