Main Content


Postprocess output of CREPE deep learning network

Since R2021a



    f0 = crepePostprocess(activations) converts the output of a CREPE pretrained network to pitch estimates in Hz.

    f0 = crepePostprocess(activations,'ConfidenceThreshold',TH) specifies the confidence threshold as a nonnegative scalar value less than 1.

    For example, f0 = crepePostprocess(actiations,'ConfidenceThreshold',0.75) specifies a confidence threshold of 0.75.


    collapse all

    The CREPE network requires you to preprocess your audio signals to generate buffered, overlapped, and normalized audio frames that can be used as input to the network. This example walks through audio preprocessing using crepePreprocess and audio postprocessing with pitch estimation using crepePostprocess. The pitchnn function performs these steps for you.

    Read in an audio signal for pitch estimation. Visualize and listen to the audio. There are nine vocal utterances in the audio clip.

    [audioIn,fs] = audioread('SingingAMajor-16-mono-18secs.ogg');
    T = 1/fs;
    t = 0:T:(length(audioIn)*T) - T;
    grid on
    axis tight
    xlabel('Time (s)')
    title('Singing in A Major')

    Use crepePreprocess to partition the audio into frames of 1024 samples with an 85% overlap between consecutive mel spectrograms. Place the frames along the fourth dimension.

    [frames,loc] = crepePreprocess(audioIn,fs);

    Create a CREPE network with ModelCapacity set to tiny.

    netTiny = audioPretrainedNetwork("crepe",ModelCapacity="tiny");

    Predict the network activations.

    activationsTiny = predict(netTiny,frames);

    Use crepePostprocess to produce the fundamental frequency pitch estimation in Hz. Disable confidence thresholding by setting ConfidenceThreshold to 0.

    f0Tiny = crepePostprocess(activationsTiny,ConfidenceThreshold=0);

    Visualize the pitch estimation over time.

    grid on
    axis tight
    xlabel('Time (s)')
    ylabel('Pitch Estimation (Hz)')
    title('CREPE Network Frequency Estimate - Thresholding Disabled')

    With confidence thresholding disabled, crepePostprocess provides a pitch estimate for every frame. Increase the ConfidenceThreshold to 0.8.

    f0Tiny = crepePostprocess(activationsTiny,ConfidenceThreshold=0.8);

    Visualize the pitch estimation over time.

    grid on
    axis tight
    xlabel('Time (s)')
    ylabel('Pitch Estimation (Hz)')
    title('CREPE Network Frequency Estimate - Thresholding Enabled')

    Create a new CREPE network with ModelCapacity set to full.

    netFull = audioPretrainedNetwork("crepe",ModelCapacity="full");

    Predict the network activations.

    activationsFull = predict(netFull,frames);
    f0Full = crepePostprocess(activationsFull,ConfidenceThreshold=0.8);

    Visualize the pitch estimation. There are nine primary pitch estimation groupings, each group corresponding with one of the nine vocal utterances.

    grid on
    xlabel('Time (s)')
    ylabel('Pitch Estimation (Hz)')
    title('CREPE Network Frequency Estimate - Full')

    Find the time elements corresponding to the last vocal utterance.

    roundedLocVec = round(loc,2);
    lastUtteranceBegin = find(roundedLocVec == 16);
    lastUtteranceEnd = find(roundedLocVec == 18);

    For simplicity, take the most frequently occurring pitch estimate within the utterance group as the fundamental frequency estimate for that timespan. Generate a pure tone with a frequency matching the pitch estimate for the last vocal utterance.

    lastUtteranceEstimation = mode(f0Full(lastUtteranceBegin:lastUtteranceEnd))

    The value for lastUtteranceEstimate of 217.3 Hz. corresponds to the note A3. Overlay the synthesized tone on the last vocal utterance to audibly compare the two.

    lastVocalUtterance = audioIn(fs*16:fs*18);
    newTime = 0:T:2;
    compareTone = cos(2*pi*lastUtteranceEstimation*newTime).';
    soundsc(lastVocalUtterance + compareTone,fs);

    Call spectrogram to more closely inspect the frequency content of the singing. Use a frame size of 250 samples and an overlap of 225 samples or 90%. Use 4096 DFT points for the transform. The spectrogram reveals that the vocal recording is actually a set of complex harmonic tones composed of multiple frequencies.


    Input Arguments

    collapse all

    Audio frames generated from a CREPE pretrained network, specified as an N-by-360 matrix, where N is the number of generated frames.

    Data Types: single | double

    Confidence threshold for each value of f0, specified as the comma-separated pair consisting of 'ConfidenceThreshold' and a scalar in the range [0,1).

    To disable thresholding, set TH to 0.


    If the maximum value of the corresponding activations vector is less than TH, f0 is NaN.

    Data Types: single | double

    Output Arguments

    collapse all

    Estimated fundamental frequency in Hertz, returned as an N-by-1 vector, where N is the number of generated frames.

    Data Types: single


    [1] Kim, Jong Wook, Justin Salamon, Peter Li, and Juan Pablo Bello. “Crepe: A Convolutional Representation for Pitch Estimation.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 161–65. Calgary, AB: IEEE, 2018.

    Extended Capabilities

    C/C++ Code Generation
    Generate C and C++ code using MATLAB® Coder™.

    Version History

    Introduced in R2021a