YAMNet Preprocess

Preprocess audio for YAMNet classification

Since R2021b

Libraries:
Audio Toolbox / Deep Learning

Description

The YAMNet Preprocess block generates mel spectrograms from audio input that can be fed to the YAMNet pretrained network or to a network that accepts the same inputs as YAMNet.

Examples

Compare Sound Classifier Block with Equivalent YAMNet Blocks

Show that Sound Classifier block is equivalent to the cascade of YAMNet Preprocess block and YAMNet block.

Open Script

Detect Air Compressor Sounds in Simulink Using YAMNet

Use a pre-trained network in Simulink^® that is generated using transfer learning.

Open Script

Ports

Input

expand all

audioIn — Sound data
column vector

Sound data to classify, specified as a one-channel signal (column vector). If Sample rate of input signal (Hz) is 16e3, there are no restrictions on the input frame length. If Sample rate of input signal (Hz) is different from 16e3, then the input frame length must be a multiple of the decimation factor of the resampling operation that the block performs. If the input frame length does not satisfy this condition, the block throws an error message with information on the decimation factor.

Data Types: single | double

Output

expand all

features — Mel spectrograms that can be fed to YAMNet pretrained network
96-by-64 matrix

Mel spectrograms generated from audioIn, returned as a 96-by-64 matrix, where:

96 –– Represents the number of 25 ms frames in each mel spectrogram
64 –– Represents the number of mel bands spanning 125 Hz to 7.5 kHz

The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter.

Each 96-by-64 matrix represents a single mel spectrogram. For more details on how this block generates mel spectrograms, see Algorithms.

Data Types: single

Parameters

expand all

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Sample rate of the input signal in Hz, specified as a positive scalar.

Data Types: single | double

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

Specify the overlap percentage between consecutive mel spectrograms as a scalar in the range [0 100).

Data Types: single | double

Block Characteristics

Data Types	`double` \| `single`
Direct Feedthrough	`no`
Multidimensional Signals	`no`
Variable-Size Signals	`no`
Zero-Crossing Detection	`no`

Algorithms

expand all

The YAMNet Preprocess block generates mel spectrograms from audio input. These mel spectrograms can be fed to a YAMNet pretrained network or to a network that accepts the same inputs as YAMNet.

Preprocessing steps

Cast audioIn to single and resample to 16 kHz.
Compute one-sided short-time Fourier transform using a 25 ms periodic Hann window (400 samples) with a 10 ms hop (160 samples) and a 512-point DFT.
Convert the complex spectral values to magnitude and discard phase information.
Pass the one-sided magnitude STFTs through a 64-band mel-spaced filter bank. Doing so converts the 257-length STFT vectors to 64-length vectors in the mel scale.
Convert the 64-length vectors to a log scale.
Buffer the vectors into outputs of size 96-by-64, where 96 is the number of spectra in the mel spectrogram and 64 is the number of mel bands. The overlap between consecutive 96-by-64 mel spectrograms is determined by the value of the Overlap percentage (%) parameter.

References

[1] Gemmeke, Jort F., Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952261.

[2] Hershey, Shawn, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.

YAMNet Preprocess

Description

Examples

Compare Sound Classifier Block with Equivalent YAMNet Blocks

Detect Air Compressor Sounds in Simulink Using YAMNet

Ports

Input

audioIn — Sound data
column vector

Output

features — Mel spectrograms that can be fed to YAMNet pretrained network
96-by-64 matrix

Parameters

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

Block Characteristics

Algorithms

Preprocessing steps

References

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.

Version History

See Also

Apps

Blocks

Functions

YAMNet Preprocess

Description

Examples

Compare Sound Classifier Block with Equivalent YAMNet Blocks

Detect Air Compressor Sounds in Simulink Using YAMNet

Ports

Input

audioIn — Sound data column vector

Output

features — Mel spectrograms that can be fed to YAMNet pretrained network 96-by-64 matrix

Parameters

Sample rate of input signal (Hz) — Sample rate of input signal in Hz 16e3 (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms 50 (default) | [0 100)

Block Characteristics

Algorithms

Preprocessing steps

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using Simulink® Coder™.

Version History

See Also

Apps

Blocks

Functions

audioIn — Sound data
column vector

features — Mel spectrograms that can be fed to YAMNet pretrained network
96-by-64 matrix

Sample rate of input signal (Hz) — Sample rate of input signal in Hz
`16e3` (default) | positive scalar

Overlap percentage (%) — Overlap percentage between consecutive mel spectrograms
`50` (default) | [0 100)

C/C++ Code Generation
Generate C and C++ code using Simulink® Coder™.