voiceActivityDetector

Detect presence of speech in audio signal

Description

The voiceActivityDetector System object™ detects the presence of speech in an audio segment. You can also use the voiceActivityDetector System object to output an estimate of the noise variance per frequency bin.

To detect the presence of speech:

Create the voiceActivityDetector object and set its properties.
Call the object with arguments, as if it were a function.

To learn more about how System objects work, see What Are System Objects?

Creation

Syntax

VAD = voiceActivityDetector

VAD = voiceActivityDetector(Name,Value)

Description

VAD = voiceActivityDetector creates a System object, VAD, that detects the presence of speech independently across each input channel.

VAD = voiceActivityDetector(Name,Value) sets each property Name to the specified Value. Unspecified properties have default values.

Example: VAD = voiceActivityDetector('InputDomain','Frequency') creates a System object, VAD, that accepts frequency-domain input.

Properties

expand all

Unless otherwise indicated, properties are nontunable, which means you cannot change their values after calling the object. Objects lock when you call them, and the release function unlocks them.

If a property is tunable, you can change its value at any time.

For more information on changing property values, see System Design in MATLAB Using System Objects.

`InputDomain` — Domain of input signal
`'Time'` (default) | `'Frequency'`

Domain of the input signal, specified as 'Time' or 'Frequency'.

Tunable: No

Data Types: char | string

`FFTLength` — FFT length
`[]` (default) | positive scalar

FFT length, specified as a positive scalar. The default is [], which means that the FFTLength is equal to the number of rows of the input.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: single | double

`Window` — Window function for FFT
`'Hann'` (default) | `'Chebyshev'` | `'Flat Top'` | `'Hamming'` | `'Kaiser'` | `'Rectangular'`

Time-domain window function applied before calculating the discrete-time Fourier transform (DTFT), specified as 'Hann', 'Rectangular', 'Flat Top', 'Hamming', 'Chebyshev', or 'Kaiser'.

The window function is designed using the algorithms of the following functions:

Hann –– hann
Chebyshev –– chebwin
Flat Top –– flattopwin
Hamming –– hamming
Kaiser –– kaiser

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: char | string

`SidelobeAttenuation` — Sidelobe attenuation of window (dB)
`60` (default) | real positive scalar

Sidelobe attenuation of the window in dB, specified as a real positive scalar.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time' and Window to 'Chebyshev' or 'Kaiser'.

Data Types: single | double

`SilenceToSpeechProbability` — Probability of transition from a frame of silence to a frame of speech
`0.2` (default) | scalar in the range [0,1]

Probability of transition from a frame of silence to a frame of speech, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

`SpeechToSilenceProbability` — Probability of transition from a frame of speech to a frame of silence
`0.1` (default) | scalar in the range [0,1]

Probability of transition from a frame of speech to a frame of silence, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

Usage

Syntax

[probability,noiseEstimate]
= VAD(audioIn)

Description

[probability,noiseEstimate] = VAD(audioIn) applies a voice activity detector on the input, audioIn, and returns the probability that speech is present. It also returns the estimated noise variance per frequency bin.

example

Input Arguments

expand all

`audioIn` — Audio input to voice activity detector
scalar | vector | matrix

Audio input to the voice activity detector, specified as a scalar, vector, or matrix. If audioIn is a matrix, the columns are treated as independent audio channels.

The size of the audio input is locked after the first call to the voiceActivityDetector object. To change the size of audioIn, call release on the object.

If InputDomain is set to 'Time', audioIn must be real-valued. If InputDomain is set to 'Frequency', audioIn can be real-valued or complex-valued.

Data Types: single | double
Complex Number Support: Yes

Output Arguments

expand all

`probability` — Probability that speech is present
scalar | row vector

Probability that speech is present, returned as a scalar or row vector with the same number of columns as audioIn.

Data Types: single | double

`noiseEstimate` — Estimate of noise variance per frequency bin
column vector | matrix

Estimate of the noise variance per frequency bin, returned as a column vector or matrix with the same number of columns as audioIn.

Data Types: single | double

Object Functions

To use an object function, specify the System object as the first input argument. For example, to release system resources of a System object named obj, use this syntax:

release(obj)

expand all

Common to All System Objects

`clone`	Create duplicate System object
`isLocked`	Determine if System object is in use
`release`	Release resources and allow changes to System object property values and input characteristics
`reset`	Reset internal states of System object
`step`	Run System object algorithm

Examples

collapse all

Detect Voice Activity

Open Script

Use the default voiceActivityDetector System object™ to detect the presence of speech in a streaming audio signal.

Create an audio file reader to stream an audio file for processing. Define parameters to chunk the audio signal into 10 ms non-overlapping frames.

fileReader = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = fileReader.SampleRate;
fileReader.SamplesPerFrame = ceil(10e-3*fs);

Create a default voiceActivityDetector System object to detect the presence of speech in the audio file.

VAD = voiceActivityDetector;

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audio device writer to play the audio through your sound card.

scope = timescope( ...
    'NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpanSource','Property','TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5 1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});
deviceWriter = audioDeviceWriter('SampleRate',fs);

In an audio stream loop:

Read from the audio file.
Calculate the probability of speech presence.
Visualize the audio signal and speech presence probability.
Play the audio signal through your sound card.

while ~isDone(fileReader)
    audioIn = fileReader();
    probability = VAD(audioIn);
    scope(audioIn,probability*ones(fileReader.SamplesPerFrame,1))
    deviceWriter(audioIn);
end

Detect Voice Activity Using Overlapped Frames

Open Script

Use a voice activity detector to detect the presence of speech in an audio signal. Plot the probability of speech presence along with the audio samples.

Create a dsp.AudioFileReader System object™ to read a speech file.

afr = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = afr.SampleRate;

Chunk the audio into 20 ms frames with 75% overlap between successive frames. Convert the frame time in seconds to samples. Determine the hop size (the increment of new samples). In the audio file reader, set the samples per frame to the hop size. Create a default dsp.AsyncBuffer object to manage overlapping between audio frames.

frameSize = ceil(20e-3*fs);
overlapSize = ceil(0.75*frameSize);
hopSize = frameSize - overlapSize;
afr.SamplesPerFrame = hopSize;

inputBuffer = dsp.AsyncBuffer('Capacity',frameSize);

Create a voiceActivityDetector System object. Specify an FFT length of 1024.

VAD = voiceActivityDetector('FFTLength',1024);

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audioDeviceWriter System object to play audio through your sound card.

scope = timescope('NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpanSource','Property','TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5,1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});

player = audioDeviceWriter('SampleRate',fs);

Initialize a vector to hold the probability values.

pHold = ones(hopSize,1);

In an audio stream loop:

Read a hop worth of samples from the audio file and save the samples into the buffer.
Read a frame from the buffer with specified overlap from the previous frame.
Call the voice activity detector to get the probability of speech for the frame under analysis.
Set the last element of the probability vector to the new probability decision. Visualize the audio and speech presence probability using the time scope.
Play the audio through your sound card.
Set the probability vector to the most recent result for plotting in the next loop.

while ~isDone(afr)
    x = afr();
    n = write(inputBuffer,x);

    overlappedInput = read(inputBuffer,frameSize,overlapSize);

    p = VAD(overlappedInput);

    pHold(end) = p;
    scope(x,pHold)

    player(x);

    pHold(:) = p;
end

Release the player once the audio finishes playing.

release(player)

Determine Pitch Contour of Streaming Audio

Open Live Script

Create a dsp.AudioFileReader object to read in audio frame-by-frame.

fileReader = dsp.AudioFileReader("SingingAMajor-16-mono-18secs.ogg");

Create a voiceActivityDetector object to detect the presence of voice in streaming audio.

VAD = voiceActivityDetector;

While there are unread samples, read from the file and determine the probability that the frame contains voice activity. If the frame contains voice activity, call pitch to estimate the fundamental frequency of the audio frame. If the frame does not contain voice activity, declare the fundamental frequency as NaN.

f0 = [];
while ~isDone(fileReader)
    x = fileReader();
    
    if VAD(x) > 0.99
        decision = pitch(x,fileReader.SampleRate, ...
            WindowLength=size(x,1), ...
            OverlapLength=0, ...
            Range=[200,340]);
    else
        decision = NaN;
    end
    f0 = [f0;decision];
end

Plot the detected pitch contour over time.

t = linspace(0,(length(f0)*fileReader.SamplesPerFrame)/fileReader.SampleRate,length(f0));
plot(t,f0)
ylabel("Fundamental Frequency (Hz)")
xlabel("Time (s)")
grid on

Figure contains an axes object. The axes object with xlabel Time (s), ylabel Fundamental Frequency (Hz) contains an object of type line.

Algorithms

The voiceActivityDetector implements the algorithm described in [1].

If InputDomain is specified as 'Time', the input signal is windowed and then converted to the frequency domain according to the Window, SidelobeAttenuation, and FFTLength properties. If InputDomain is specified as frequency, the input is assumed to be a windowed discrete time Fourier transform (DTFT) of an audio signal. The signal is then converted to the power domain. Noise variance is estimated according to [2]. The posterior and prior SNR are estimated according to the Minimum Mean-Square Error (MMSE) formula described in [3]. A log likelihood ratio test and Hidden Markov Model (HMM)-based hang-over scheme determine the probability that the current frame contains speech, according to [1].

References

[1] Sohn, Jongseo., Nam Soo Kim, and Wonyong Sung. "A Statistical Model-Based Voice Activity Detection." Signal Processing Letters IEEE. Vol. 6, No. 1, 1999.

[2] Martin, R. "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics." IEEE Transactions on Speech and Audio Processing. Vol. 9, No. 5, 2001, pp. 504–512.

[3] Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, No. 6, 1984, pp. 1109–1121.

Extended Capabilities

expand all

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Usage notes and limitations:

System Objects in MATLAB Code Generation (MATLAB Coder)

Version History

Introduced in R2018a

voiceActivityDetector

Description

Creation

Syntax

Description

Properties

InputDomain — Domain of input signal 'Time' (default) | 'Frequency'

FFTLength — FFT length [] (default) | positive scalar

Dependencies

Window — Window function for FFT 'Hann' (default) | 'Chebyshev' | 'Flat Top' | 'Hamming' | 'Kaiser' | 'Rectangular'

Dependencies

SidelobeAttenuation — Sidelobe attenuation of window (dB) 60 (default) | real positive scalar

Dependencies

SilenceToSpeechProbability — Probability of transition from a frame of silence to a frame of speech 0.2 (default) | scalar in the range [0,1]

SpeechToSilenceProbability — Probability of transition from a frame of speech to a frame of silence 0.1 (default) | scalar in the range [0,1]

Usage

Syntax

Description

Input Arguments

audioIn — Audio input to voice activity detector scalar | vector | matrix

Output Arguments

probability — Probability that speech is present scalar | row vector

noiseEstimate — Estimate of noise variance per frequency bin column vector | matrix

Object Functions

Common to All System Objects

Examples

Detect Voice Activity

Detect Voice Activity Using Overlapped Frames

Determine Pitch Contour of Streaming Audio

Algorithms

References

Extended Capabilities

C/C++ Code Generation Generate C and C++ code using MATLAB® Coder™.

Version History

See Also

`InputDomain` — Domain of input signal
`'Time'` (default) | `'Frequency'`

`FFTLength` — FFT length
`[]` (default) | positive scalar

`Window` — Window function for FFT
`'Hann'` (default) | `'Chebyshev'` | `'Flat Top'` | `'Hamming'` | `'Kaiser'` | `'Rectangular'`

`SidelobeAttenuation` — Sidelobe attenuation of window (dB)
`60` (default) | real positive scalar

`SilenceToSpeechProbability` — Probability of transition from a frame of silence to a frame of speech
`0.2` (default) | scalar in the range [0,1]

`SpeechToSilenceProbability` — Probability of transition from a frame of speech to a frame of silence
`0.1` (default) | scalar in the range [0,1]

`audioIn` — Audio input to voice activity detector
scalar | vector | matrix

`probability` — Probability that speech is present
scalar | row vector

`noiseEstimate` — Estimate of noise variance per frequency bin
column vector | matrix

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.