Member-only story
Deep Learning, Natural Language Processing
Speech Emotion Recognition (SER)Using CNN And LSTMs
Emotions that are expressed through speech carry extra insights into human actions and reasoning.

Emotions are a basic part of human psychology that is translated directly into human actions. And an amazing instrument that can reflect many of these emotions is the human voice. Emotions that are expressed through speech carry extra insights into human actions and reasonings. Studying these relationships in depth can help us better understand the motives of people. Therefore, emotion recognition plays an important role in human-computer interaction.
My interest in this subject has to lead me to create a model that can help classify basic human emotions. In this article, I will share how I did that.
The model has created on an English Language dataset from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Based on recent studies, Mel-Spectrogram helps extract important features from audio data and those features were used in the CNN+LSTM model.
I have saved all of my code on GitHub, https://github.com/msaleem18/Speech_Emotion_Recognition
Dataset
For my model, I used the following dataset:
In order to read and process audio data in Python, I used the Librosa library; the final data is stored as NumPy array.
import numpy as np
import pandas as pd
import librosa as lib
import librosa.displaypath = "/Users/saad/Saad/Education/Ryerson/MRP/Dataset/Audio_Speech_Actors_01-24/ALL"#READ ENGLISH FILES
files = []
modality =[]
vocal =[]
emotion =[]
intensity =[]
statement =[]
repetition =[]
actor =[]
gender = []
time = []
audio_data = []
sr = []
max_row = 0
max_col = 0
min_row = 1000
min_col = 1000
n_fft = 2048
hop_length = 512
n_mels = 200
for file_name in…