| Abstract |
This study introduces an efficient deep neural network (DNN) framework for multimodal data fusion, targeting the integration of text, video, behavioral logs, and optical sensor data in smart education systems and talent cultivation platforms. The framework incorporates a dynamic feature selection module to prioritize critical multimodal features while reducing redundancy, alongside a hybrid compression pipeline that synergizes pruning and quantization to achieve a 62% reduction in floating-point operations (FLOPs) without compromising accuracy. A cross-modal contrastive alignment mechanism is further employed to bridge semantic gaps between heterogeneous modalities, leading to an F1-score of 91.5%. Incorporating optical sensing data enhances the detection of engagement levels, such as attention span and focus, which are critical for adaptive learning in smart cities. © COPYRIGHT SPIE. Downloading of the abstract is permitted for personal use only. |