A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

Yupeng Huo; Jie Shen; Xu Chen; Keming Yu

doi:10.1117/12.3008203

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

Yupeng Huo, Jie Shen, Xu Chen, Keming Yu

COLLEGE OF ELECTRICAL ENGINEERING AND CONTROL SCIENCE

Nanjing Tech University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Scopus citations

Abstract

Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

Original language	English
Title of host publication	Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023
Editors	Yulin Wang, Ata Jahangir Moshayedi
Publisher	SPIE
ISBN (Electronic)	9781510671720
DOIs	https://doi.org/10.1117/12.3008203
State	Published - 2023
Event	3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 - Nanjing, China Duration: 16 Jun 2023 → 18 Jun 2023

Publication series

Name	Proceedings of SPIE - The International Society for Optical Engineering
Volume	12934
ISSN (Print)	0277-786X
ISSN (Electronic)	1996-756X

Conference

Conference	3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023
Country/Territory	China
City	Nanjing
Period	16/06/23 → 18/06/23

Keywords

Feature extraction
Gesture recognition
Transformer

Access to Document

10.1117/12.3008203

Cite this

Huo, Y., Shen, J., Chen, X., & Yu, K. (2023). A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. In Y. Wang, & A. J. Moshayedi (Eds.), Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 Article 1293417 (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12934). SPIE. https://doi.org/10.1117/12.3008203

@inproceedings{6c924f69144c4e2591a4e57b6374abbe,

title = "A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network",

abstract = "Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.",

keywords = "Feature extraction, Gesture recognition, Transformer",

author = "Yupeng Huo and Jie Shen and Xu Chen and Keming Yu",

note = "Publisher Copyright: {\textcopyright} 2023 SPIE.; 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 ; Conference date: 16-06-2023 Through 18-06-2023",

year = "2023",

doi = "10.1117/12.3008203",

language = "英语",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Yulin Wang and Moshayedi, {Ata Jahangir}",

booktitle = "Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023",

address = "美国",

}

Huo, Y, Shen, J, Chen, X & Yu, K 2023, A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. in Y Wang & AJ Moshayedi (eds), Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023., 1293417, Proceedings of SPIE - The International Society for Optical Engineering, vol. 12934, SPIE, 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023, Nanjing, China, 16/06/23. https://doi.org/10.1117/12.3008203

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. / Huo, Yupeng; Shen, Jie; Chen, Xu et al.
Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023. ed. / Yulin Wang; Ata Jahangir Moshayedi. SPIE, 2023. 1293417 (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12934).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

AU - Huo, Yupeng

AU - Shen, Jie

AU - Chen, Xu

AU - Yu, Keming

PY - 2023

Y1 - 2023

N2 - Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

AB - Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

KW - Feature extraction

KW - Gesture recognition

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85177867103&partnerID=8YFLogxK

U2 - 10.1117/12.3008203

DO - 10.1117/12.3008203

M3 - 会议稿件

AN - SCOPUS:85177867103

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023

A2 - Wang, Yulin

A2 - Moshayedi, Ata Jahangir

PB - SPIE

T2 - 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023

Y2 - 16 June 2023 through 18 June 2023

ER -

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this