Classify sounds in audio signal
Audio Toolbox / Deep Learning
The Sound Classifier block uses YAMNet to classify audio segments into sound classes described by the AudioSet ontology. The Sound Classifier block combines necessary audio preprocessing and YAMNet network inference. The block returns predicted sound labels, predicted scores from the sounds, and class labels for predicted scores.
audioIn— Sound data
Sound data to classify, specified as a one-channel signal (column vector). If Sample rate of input signal (Hz) is 16e3, there are no restrictions on the input frame length. If Sample rate of input signal (Hz) is different from 16e3, then the input frame length must be a multiple of the decimation factor of the resampling operation that the block performs. If the input frame length does not satisfy this condition, the block throws an error message with information on the decimation factor.
sound— Predicted sound label
Predicted sound label, returned as an enumerated scalar.
scores— Predicted activations or scores
Predicted activation or score values for each supported sound label, returned as a 1-by-521 vector, where 521 is the number of classes in YAMNet.
labels— Class labels for predicted scores
Class labels for predicted scores, returned as a 1-by-521 vector.
Sample rate of input signal (Hz)— Sample rate of input signal in Hz
16e3(default) | positive scalar
Specify the sample rate of the input signal as a positive scalar in Hz. If the sample rate is different from 16e3, then the block resamples the signal to 16e3, which is the sample rate that YAMNet supports.
Overlap percentage (%)— Overlap percentage between consecutive mel spectrograms
50(default) | [0 100)
Specify the overlap percentage between consecutive mel spectrograms as a scalar in the range [0 100).
Classification— Select to output sound classification
Enable the output port sound, which outputs the classified sound.
Predictions— Output all scores and associated labels
Enable the output ports scores and labels, which output all predicted scores and associated class labels.
The Sound Classifier block algorithm consists of two steps:
Preprocessing –– YAMNet specific preprocessing. Generates mel spectrograms.
Prediction –– Predicting the sounds, scores, and labels of the input signal using the YAMNet sound classification network.
Cast audioIn to single and resample to 16 kHz.
Compute the one-sided short-time Fourier transform (STFT) using a 25 ms periodic Hann window (400 samples) with a 10 ms hop (160 samples) and a 512-point DFT.
Convert the complex spectral values to magnitude and discard phase information.
Pass the one-sided magnitude STFTs through a 64-band mel-spaced filter bank. Doing so converts the 257-length STFT vectors to 64-length vectors in the mel scale.
Convert the 64-length vectors to a log scale.
Buffer the vectors into outputs of size 96-by-64, where 96 is the number of 10 ms frames in each mel spectrogram and 64 is the number of mel bands. The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter.
These 96-by-64 spectrograms are passed to the YAMNet block. The YAMNet block has a maximum of three outputs:
sound: The label of the most likely sound. You get one "sound" for each 96-by-64 spectrogram input.
scores: 1-by-512 vectors, with a score value for each supported sound label.
labels: 1-by-521 vectors containing the sound labels.
Usage notes and limitations:
The Language parameter in the
Configuration Parameters > Code
Generation general category must be set to
For ERT-based targets, the Support: variable-size signals parameter in the Code Generation> Interface pane must be enabled.
For a list of networks and layers supported for code generation, see Networks and Layers Supported for Code Generation (MATLAB Coder).