Objective: To create a process that can classify any audio data; in particular, show it’s usefulness by classifying environmental sounds.
Data Source:: https://www.kaggle.com/datasets/mmoreaux/environmental-sound-classification-50
The dataset consists in 50 WAV files sampled at 16KHz for 50 different classes. To each one of the classes, corresponds 40 audio sample of 5 seconds each. All of these audio files have been concatenated by class in order to have 50 wave files of 3 min. 20sec.
Visualizing Audio Files: Mel-spectrograms and Mel-frequency Cepstral Coefficients (MFCCs): Building machine learning models to classify, describe, or generate audio typically concerns modeling tasks where the input data are audio samples. Audio samples are usually represented as time series, where the y-axis measurement is the amplitude of the waveform (see image below). However, the waveform itself may not necessarily yield clear class identifying information. It turns out one of the best features to extract from audio waveforms (and digital signals in general) has been the Mel Frequency Cepstral Coefficients (MFCCs).
Analysis: Steps for Calculating MFCCs for Audio Samples: (1) Slice the signal into short frames (of time); (2) Compute the periodogram estimate of the power spectrum for each frame; (3) Apply the mel filterbank to the power spectra and sum the energy in each filter; and (4) Take the discrete cosine transform (DCT) of the log filterbank energies
Output: A subset of the classified sounds’ waveforms (left) and corresponding Mel Spectrograms (right) are found below.