3-D Speech Enhancement Using Trained Filter and Sum Network

This example uses:

In this example, you perform speech enhancement using a pretrained deep learning model. For details about the model and how it was trained, see Train 3-D Speech Enhancement Network Using Deep Learning. The speech enhancement model is an end-to-end deep beamformer that takes B-format ambisonic audio recordings and outputs enhanced mono speech signals.

Download Pretrained Network

Download the pretrained speech enhancement (SE) network, ambisonic test files, and labels. The model architecture is based on [1] and [4], as implemented in the baseline system for the L3DAS21 challenge task 1 [2]. The data the model was trained on and the ambisonic test files are provided as part of [2].

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","speechEnhancement/FaSNet.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
netFolder = fullfile(dataFolder,"speechEnhancement");
addpath(netFolder)

Load and Inspect Data

Load the clean speech and listen to it.

[cleanSpeech,fs] = audioread("cleanSpeech.wav");

soundsc(cleanSpeech,fs)

In the L3DAS21 challenge, "clean" speech files were taken from the LibriSpeech dataset and augmented to obtain synthetic tridimensional acoustic scenes containing a randomly placed speaker and other sound sources typical of background noise in an office environment. The data is encoded as B-format ambisonics. Load the ambisonic data. First order B-format ambisonic channels correspond to the sound pressure captured by an omnidirectional microphone (W) and sound pressure gradients X, Y, and Z that correspond to front/back, left/right, and up/down captured by figure-of-eight capsules oriented along the three spatial axes.

[ambisonicData,fs] = audioread("ambisonicRecording.wav");

Listen to a channel of the ambisonic data.

channel = 1;
soundsc(ambisonicData(:,channel),fs)

To plot the clean speech and the noisy ambisonic data, use the supporting function compareAudio.

compareAudio(cleanSpeech,ambisonicData,SampleRate=fs)

To visualize the spectrograms of the clean speech and the noisy ambisonic data, use the supporting function compareSpectrograms.

compareSpectrograms(cleanSpeech,ambisonicData)

Mel spectrograms are auditory-inspired transformations of spectrograms that emphasize, de-emphasize, and blur frequencies similar to how the auditory system does. To visualize the mel spectrograms of the clean speech and the noisy ambisonic data, use the supporting function compareSpectrograms and set Warp to mel.

compareSpectrograms(cleanSpeech,ambisonicData,Warp="mel")

Perform 3-D Speech Enhancement

Use the supporting object, seModel, to perform speech enhancement. The seModel class definition is in the current folder when you open this example. The object encapsulates the SE model developed in Train 3-D Speech Enhancement Network Using Deep Learning. Create the model, then call enhanceSpeech on the ambisonic data to perform speech enhancement.

model = seModel(netFolder);
enhancedSpeech = enhanceSpeech(model,ambisonicData);

Listen to the enhanced speech. You can compare the enhanced speech listening experience with the clean speech or noisy ambisonic data by selecting the desired sound source from the dropdown.

soundSource = enhancedSpeech;
soundsc(soundSource,fs)

Compare the clean speech, noisy speech, and enhanced speech in the time domain, as spectrograms, and as mel spectrograms.

compareAudio(cleanSpeech,ambisonicData,enhancedSpeech)

compareSpectrograms(cleanSpeech,ambisonicData,enhancedSpeech)

compareSpectrograms(cleanSpeech,ambisonicData,enhancedSpeech,Warp="mel")

Speech Enhancement for Speech-to-Text Applications

Compare the performance of the speech enhancement system on a downstream speech-to-text system. Use the wav2vec 2.0 speech-to-text model. This model requires a one-time download of pretrained weights to run. If you have not downloaded the wav2vec weights, the first call to speechClient will provide a download link.

Create the wav2vec 2.0 speech client to perform transcription.

transcriber = speechClient("wav2vec2.0",segmentation="none");

Perform speech-to-text transcription using the clean speech, the ambisonic data, and the enhanced speech.

cleanSpeechResults = speech2text(transcriber,cleanSpeech,fs)

cleanSpeechResults = 
"i tell you it is not poison she cried"

noisySpeechResults = speech2text(transcriber,ambisonicData(:,channel),fs)

noisySpeechResults = 
"i tell you it is not parzona she cried"

enhancedSpeechResults = speech2text(transcriber,enhancedSpeech,fs)

enhancedSpeechResults = 
"i tell you it is not poisen she cried"

Speech Enhancement for Telecommunications Applications

Compare the performance of the speech enhancement system using the short-time objective intelligibility (STOI) measurement [5]. STOI has been shown to have a high correlation with the intelligibility of noisy speech and is commonly used to evaluate speech enhancement systems.

Calculate STOI for the omnidirectional channel of the ambisonics, and for the enhanced speech. Perfect intelligibility has a score of 1.

stoi(ambisonicData(:,channel),cleanSpeech,fs)

ans = 0.6941

stoi(enhancedSpeech,cleanSpeech,fs)

ans = single
    0.8418

References

[1] Luo, Yi, Cong Han, Nima Mesgarani, Enea Ceolini, and Shih-Chii Liu. "FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing." In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 260–67. SG, Singapore: IEEE, 2019. https://doi.org/10.1109/ASRU46091.2019.9003849.

[2] Guizzo, Eric, Riccardo F. Gramaccioni, Saeid Jamili, Christian Marinoni, Edoardo Massaro, Claudia Medaglia, Giuseppe Nachira, et al. "L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing." In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 1–6. Gold Coast, Australia: IEEE, 2021. https://doi.org/10.1109/MLSP52302.2021.9596248.

[3] Roux, Jonathan Le, et al. "SDR – Half-Baked or Well Done?" ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 626–30. DOI.org (Crossref), https://doi.org/10.1109/ICASSP.2019.8683855.

[4] Luo, Yi, et al. "Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation." ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 46–50. DOI.org (Crossref), https://doi.org/10.1109/ICASSP40776.2020.9054266.

[5] Taal, Cees H., Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. "An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech." IEEE Transactions on Audio, Speech, and Language Processing 19, no. 7 (September 2011): 2125–36. https://doi.org/10.1109/TASL.2011.2114881.

Supporting Functions

Compare Audio

function compareAudio(target,x,y,parameters)
%compareAudio Plot clean speech, B-format ambisonics, and predicted speech
% over time

arguments
    target
    x
    y = []
    parameters.SampleRate = 16e3
end

numToPlot = 2 + ~isempty(y);

f = figure;
tiledlayout(4,numToPlot,TileSpacing="compact",TileIndexing="columnmajor")
f.Position = [f.Position(1),f.Position(2),f.Position(3)*numToPlot,f.Position(4)];

t = (0:(size(x,1)-1))/parameters.SampleRate;

xmax = max(x(:));
xmin = min(x(:));

nexttile(1,[4,1])
plot(t,target,Color=[0 0.4470 0.7410])
axis tight
ylabel("Amplitude")
xlabel("Time (s)")
title("Clean Speech (Target Data)")
grid on

nexttile(5)
plot(t,x(:,1),Color=[0.8500 0.3250 0.0980])
title("Noisy Speech (B-Format Ambisonic Data)")
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("W",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

nexttile(6)
plot(t,x(:,2),Color=[0.8600 0.3150 0.0990])
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

nexttile(7)
plot(t,x(:,3),Color=[0.8700 0.3050 0.1000])
axis([t(1),t(end),xmin,xmax])
set(gca,Xticklabel=[],YtickLabel=[])
grid on
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

nexttile(8)
plot(t,x(:,4),Color=[0.8800 0.2950 0.1100])
axis([t(1),t(end),xmin,xmax])
xlabel("Time (s)")
set(gca,YtickLabel=[])
grid on
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

if numToPlot==3
    nexttile(9,[4,1])
    plot(t,y,Color=[0 0.4470 0.7410])
    axis tight
    xlabel("Time (s)")
    title("Enhanced Speech")
    grid on
    set(gca,YtickLabel=[])
end

end

Compare Spectrograms

function compareSpectrograms(target,x,y,parameters)
%compareSpectrograms Plot spectrograms of clean speech, B-format
% ambisonics, and predicted speech over time

arguments
    target
    x
    y = []
    parameters.SampleRate = 16e3
    parameters.Warp = "linear"
end
fs = parameters.SampleRate;

switch parameters.Warp
    case "linear"
        fn = @(x)spectrogram(x,hann(round(0.03*fs),"periodic"),round(0.02*fs),round(0.03*fs),fs,"onesided","power","yaxis");
    case "mel"
        fn = @(x)melSpectrogram(x,fs);
end

numToPlot = 2 + ~isempty(y);

f = figure;
tiledlayout(4,numToPlot,TileSpacing="tight",TileIndexing="columnmajor")
f.Position = [f.Position(1),f.Position(2),f.Position(3)*numToPlot,f.Position(4)];

nexttile(1,[4,1])
fn(target)
fh = gcf;
fh.Children(1).Children(1).Visible="off";
title("Clean Speech")

nexttile(5)
fn(x(:,1))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("W",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")
title("Noisy Speech (B-Format Ambisonic Data)")

nexttile(6)
fn(x(:,2))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("X",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

nexttile(7)
fn(x(:,3))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[],XtickLabel=[],Xlabel=[])
yL = ylabel("Y",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

nexttile(8)
fn(x(:,4))
fh = gcf;
fh.Children(1).Children(1).Visible="off";
set(gca,Yticklabel=[])
yL = ylabel("Z",FontWeight="bold");
set(yL,Rotation=0,VerticalAlignment="middle",HorizontalAlignment="right")

if numToPlot==3
    nexttile(9,[4,1])
    fn(y)
    fh = gcf;
    fh.Children(1).Children(1).Visible="off";
    set(gca,Yticklabel=[],Ylabel=[])
    title("Enhanced Speech")
end
end