A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

Yupeng Huo; Jie Shen; Xu Chen; Keming Yu

doi:10.1117/12.3008203

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

Yupeng Huo, Jie Shen, Xu Chen, Keming Yu

电气工程与控制科学学院

Nanjing Tech University

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

1 引用（Scopus）

摘要

Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

源语言	英语
主期刊名	Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023
编辑	Yulin Wang, Ata Jahangir Moshayedi
出版商	SPIE
ISBN（电子版）	9781510671720
DOI	https://doi.org/10.1117/12.3008203
出版状态	已出版 - 2023
活动	3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 - Nanjing, 中国期限: 16 6月 2023 → 18 6月 2023

出版系列

姓名	Proceedings of SPIE - The International Society for Optical Engineering
卷	12934
ISSN（印刷版）	0277-786X
ISSN（电子版）	1996-756X

会议

会议	3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023
国家/地区	中国
市	Nanjing
时期	16/06/23 → 18/06/23

访问文件

10.1117/12.3008203

其它文件与链接

链接到 Scopus 的出版物

引用此

Huo, Y., Shen, J., Chen, X., & Yu, K. (2023). A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. 在 Y. Wang, & A. J. Moshayedi (编辑), Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 文章 1293417 (Proceedings of SPIE - The International Society for Optical Engineering; 卷 12934). SPIE. https://doi.org/10.1117/12.3008203

@inproceedings{6c924f69144c4e2591a4e57b6374abbe,

title = "A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network",

abstract = "Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.",

keywords = "Feature extraction, Gesture recognition, Transformer",

author = "Yupeng Huo and Jie Shen and Xu Chen and Keming Yu",

note = "Publisher Copyright: {\textcopyright} 2023 SPIE.; 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023 ; Conference date: 16-06-2023 Through 18-06-2023",

year = "2023",

doi = "10.1117/12.3008203",

language = "英语",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Yulin Wang and Moshayedi, {Ata Jahangir}",

booktitle = "Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023",

address = "美国",

}

Huo, Y, Shen, J, Chen, X & Yu, K 2023, A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. 在 Y Wang & AJ Moshayedi (编辑), Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023., 1293417, Proceedings of SPIE - The International Society for Optical Engineering, 卷 12934, SPIE, 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023, Nanjing, 中国, 16/06/23. https://doi.org/10.1117/12.3008203

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network. / Huo, Yupeng; Shen, Jie; Chen, Xu 等.
Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023. 编辑 / Yulin Wang; Ata Jahangir Moshayedi. SPIE, 2023. 1293417 (Proceedings of SPIE - The International Society for Optical Engineering; 卷 12934).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

AU - Huo, Yupeng

AU - Shen, Jie

AU - Chen, Xu

AU - Yu, Keming

PY - 2023

Y1 - 2023

N2 - Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

AB - Efficient spatial-temporal feature extraction from input video streams is crucial for dynamic gesture recognition. In the task of video classification, convolutional neural networks (CNNs) are widely used as feature extractors, while methods based on recurrent neural networks (RNNs) are commonly employed for sequence modeling. However, RNNs lack the ability to model global dependencies and have a limited attention span in the temporal dimension. This becomes a performance bottleneck for dynamic gestures that require sensitivity to temporal correlations. To address this issue, this paper proposes a dynamic gesture recognition model called R(2+1)D-Transformer. It is a Transformer-based approach that focuses on global modeling. Firstly, the R(2+1)D network is employed as a spatial-temporal feature extractor to capture the spatiotemporal information. Then, self-attention-based Transformer is used to map the spatiotemporal feature sequence to the semantic representation of gesture movements, considering both the temporal and spatial context. Finally, the gesture recognition results are obtained through an MLP classification head. Experimental results demonstrate the effectiveness and potential of the proposed R(2+1)D-Transformer model on two publicly available dynamic gesture datasets, IPN-Hand and NvGesture. The promising performance of the proposed approach provides valuable insights and reference for further research and applications in dynamic gesture recognition.

KW - Feature extraction

KW - Gesture recognition

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85177867103&partnerID=8YFLogxK

U2 - 10.1117/12.3008203

DO - 10.1117/12.3008203

M3 - 会议稿件

AN - SCOPUS:85177867103

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - Third International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023

A2 - Wang, Yulin

A2 - Moshayedi, Ata Jahangir

PB - SPIE

T2 - 3rd International Conference on Computer Graphics, Image, and Virtualization, ICCGIV 2023

Y2 - 16 June 2023 through 18 June 2023

ER -

A Dynamic Gesture Recognition Method Based on R(2+1)D-Transformer Network

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此