| Abstract |
5-G/6G technology improves skeleton-based human action recognition (HAR) by delivering ultra-low latency and high data throughput for real-time and accurate security analysis of human actions. Despite its growing popularity, current HAR methods frequently fail to capture the skeleton sequence's complexities. This study proposes a novel multimodal method that synergizes the Spatial-Temporal Attention LSTM (STA-LSTM) Network with the Convolutional Neural Network (CNN) to extract nuanced features from the skeleton sequence. The STA-LSTM network dives deep into inter- and intra-frame relations, while the CNN model uncovers geometric correlations within the human skeleton. Significantly, by integrating the Choquet fuzzy integral, we achieve a harmonized fusion of classifiers for each feature vector. Adopting Kullback Leibler and Jensen-Shannon divergences further ensures the complementary nature of these feature vectors. STA-LSTM Network and CNN in the proposed multimodal method significantly advance human action recognition. Impressive accuracy was demonstrated by our approach after evaluating benchmark skeletal datasets such as NTU-60, NTU-120, HDM05, and UT-DMHAD. Specifically, it achieved C-subject 90.75%, 84.50%, and C-setting 96.7% and 86.70% on NTU-60 and NTU-120, respectively. Furthermore, HDM05 and UT-DMHAD datasets recorded accuracies of 93.5% and 97.43%, indicating that our model outperforms current techniques and has excellent potential for sentiment analysis platforms that combine textual and visual signals. © 2024 IEEE. |