| Abstract |
For thousands of years, communication has played a crucial role in human existence, development, and globalization. Speech recognition has several uses, including biometric analysis, education, security, health care, and smart cities. Many scientists have spent years studying how machine learning may be applied to speech processing, particularly voice recognition. But in recent years, researchers have concentrated on ways to apply deep learning to problems involving human speech. In this post, we discuss our work using deep neural networks like CRNN and GRU to recognize audio samples in spoken language. Seven different classes of audio samples (Walk & footsteps, Kids speaking, Filling with water, Bass drum, Scissors, Clock, and Cough) were employed in Free Sound Datasets. Mel-spectral coefficients, along with other spectral and intensity-related factors, are among the feature parameters utilized for recognition. White noise and a retuned voice were employed as data augmentation. An average recognition rate of accuracy 93.25% and WER—Word Error Rate—of 7.84% were obtained by the GRU model, according to the findings. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024. |