Main Content

voiceActivityDetector

Detect presence of speech in audio signal

Description

The voiceActivityDetector System object™ detects the presence of speech in an audio segment. You can also use the voiceActivityDetector System object to output an estimate of the noise variance per frequency bin.

To detect the presence of speech:

  1. Create the voiceActivityDetector object and set its properties.

  2. Call the object with arguments, as if it were a function.

To learn more about how System objects work, see What Are System Objects?

Creation

Description

VAD = voiceActivityDetector creates a System object, VAD, that detects the presence of speech independently across each input channel.

VAD = voiceActivityDetector(Name,Value) sets each property Name to the specified Value. Unspecified properties have default values.

Example: VAD = voiceActivityDetector('InputDomain','Frequency') creates a System object, VAD, that accepts frequency-domain input.

Properties

expand all

Unless otherwise indicated, properties are nontunable, which means you cannot change their values after calling the object. Objects lock when you call them, and the release function unlocks them.

If a property is tunable, you can change its value at any time.

For more information on changing property values, see System Design in MATLAB Using System Objects.

Domain of the input signal, specified as 'Time' or 'Frequency'.

Tunable: No

Data Types: char | string

FFT length, specified as a positive scalar. The default is [], which means that the FFTLength is equal to the number of rows of the input.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: single | double

Time-domain window function applied before calculating the discrete-time Fourier transform (DTFT), specified as 'Hann', 'Rectangular', 'Flat Top', 'Hamming', 'Chebyshev', or 'Kaiser'.

The window function is designed using the algorithms of the following functions:

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time'.

Data Types: char | string

Sidelobe attenuation of the window in dB, specified as a real positive scalar.

Tunable: No

Dependencies

To enable this property, set InputDomain to 'Time' and Window to 'Chebyshev' or 'Kaiser'.

Data Types: single | double

Probability of transition from a frame of silence to a frame of speech, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

Probability of transition from a frame of speech to a frame of silence, specified as a scalar in the range [0,1].

Tunable: Yes

Data Types: single | double

Usage

Description

[probability,noiseEstimate] = VAD(audioIn) applies a voice activity detector on the input, audioIn, and returns the probability that speech is present. It also returns the estimated noise variance per frequency bin.

example

Input Arguments

expand all

Audio input to the voice activity detector, specified as a scalar, vector, or matrix. If audioIn is a matrix, the columns are treated as independent audio channels.

The size of the audio input is locked after the first call to the voiceActivityDetector object. To change the size of audioIn, call release on the object.

If InputDomain is set to 'Time', audioIn must be real-valued. If InputDomain is set to 'Frequency', audioIn can be real-valued or complex-valued.

Data Types: single | double
Complex Number Support: Yes

Output Arguments

expand all

Probability that speech is present, returned as a scalar or row vector with the same number of columns as audioIn.

Data Types: single | double

Estimate of the noise variance per frequency bin, returned as a column vector or matrix with the same number of columns as audioIn.

Data Types: single | double

Object Functions

To use an object function, specify the System object as the first input argument. For example, to release system resources of a System object named obj, use this syntax:

release(obj)

expand all

cloneCreate duplicate System object
isLockedDetermine if System object is in use
releaseRelease resources and allow changes to System object property values and input characteristics
resetReset internal states of System object
stepRun System object algorithm

Examples

collapse all

Use the default voiceActivityDetector System object™ to detect the presence of speech in a streaming audio signal.

Create an audio file reader to stream an audio file for processing. Define parameters to chunk the audio signal into 10 ms non-overlapping frames.

fileReader = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = fileReader.SampleRate;
fileReader.SamplesPerFrame = ceil(10e-3*fs);

Create a default voiceActivityDetector System object to detect the presence of speech in the audio file.

VAD = voiceActivityDetector;

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audio device writer to play the audio through your sound card.

scope = timescope( ...
    'NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpanSource','Property','TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5 1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});
deviceWriter = audioDeviceWriter('SampleRate',fs);

In an audio stream loop:

  1. Read from the audio file.

  2. Calculate the probability of speech presence.

  3. Visualize the audio signal and speech presence probability.

  4. Play the audio signal through your sound card.

while ~isDone(fileReader)
    audioIn = fileReader();
    probability = VAD(audioIn);
    scope(audioIn,probability*ones(fileReader.SamplesPerFrame,1))
    deviceWriter(audioIn);
end

Use a voice activity detector to detect the presence of speech in an audio signal. Plot the probability of speech presence along with the audio samples.

Create a dsp.AudioFileReader System object™ to read a speech file.

afr = dsp.AudioFileReader('Counting-16-44p1-mono-15secs.wav');
fs = afr.SampleRate;

Chunk the audio into 20 ms frames with 75% overlap between successive frames. Convert the frame time in seconds to samples. Determine the hop size (the increment of new samples). In the audio file reader, set the samples per frame to the hop size. Create a default dsp.AsyncBuffer object to manage overlapping between audio frames.

frameSize = ceil(20e-3*fs);
overlapSize = ceil(0.75*frameSize);
hopSize = frameSize - overlapSize;
afr.SamplesPerFrame = hopSize;

inputBuffer = dsp.AsyncBuffer('Capacity',frameSize);

Create a voiceActivityDetector System object. Specify an FFT length of 1024.

VAD = voiceActivityDetector('FFTLength',1024);

Create a scope to plot the audio signal and corresponding probability of speech presence as detected by the voice activity detector. Create an audioDeviceWriter System object to play audio through your sound card.

scope = timescope('NumInputPorts',2, ...
    'SampleRate',fs, ...
    'TimeSpanSource','Property','TimeSpan',3, ...
    'BufferLength',3*fs, ...
    'YLimits',[-1.5,1.5], ...
    'TimeSpanOverrunAction','Scroll', ...
    'ShowLegend',true, ...
    'ChannelNames',{'Audio','Probability of speech presence'});

player = audioDeviceWriter('SampleRate',fs);

Initialize a vector to hold the probability values.

pHold = ones(hopSize,1);

In an audio stream loop:

  1. Read a hop worth of samples from the audio file and save the samples into the buffer.

  2. Read a frame from the buffer with specified overlap from the previous frame.

  3. Call the voice activity detector to get the probability of speech for the frame under analysis.

  4. Set the last element of the probability vector to the new probability decision. Visualize the audio and speech presence probability using the time scope.

  5. Play the audio through your sound card.

  6. Set the probability vector to the most recent result for plotting in the next loop.

while ~isDone(afr)
    x = afr();
    n = write(inputBuffer,x);

    overlappedInput = read(inputBuffer,frameSize,overlapSize);

    p = VAD(overlappedInput);

    pHold(end) = p;
    scope(x,pHold)

    player(x);

    pHold(:) = p;
end

Release the player once the audio finishes playing.

release(player)

Create a dsp.AudioFileReader object to read in audio frame-by-frame.

fileReader = dsp.AudioFileReader("SingingAMajor-16-mono-18secs.ogg");

Create a voiceActivityDetector object to detect the presence of voice in streaming audio.

VAD = voiceActivityDetector;

While there are unread samples, read from the file and determine the probability that the frame contains voice activity. If the frame contains voice activity, call pitch to estimate the fundamental frequency of the audio frame. If the frame does not contain voice activity, declare the fundamental frequency as NaN.

f0 = [];
while ~isDone(fileReader)
    x = fileReader();
    
    if VAD(x) > 0.99
        decision = pitch(x,fileReader.SampleRate, ...
            WindowLength=size(x,1), ...
            OverlapLength=0, ...
            Range=[200,340]);
    else
        decision = NaN;
    end
    f0 = [f0;decision];
end

Plot the detected pitch contour over time.

t = linspace(0,(length(f0)*fileReader.SamplesPerFrame)/fileReader.SampleRate,length(f0));
plot(t,f0)
ylabel("Fundamental Frequency (Hz)")
xlabel("Time (s)")
grid on

Figure contains an axes object. The axes object with xlabel Time (s), ylabel Fundamental Frequency (Hz) contains an object of type line.

Algorithms

The voiceActivityDetector implements the algorithm described in [1].

If InputDomain is specified as 'Time', the input signal is windowed and then converted to the frequency domain according to the Window, SidelobeAttenuation, and FFTLength properties. If InputDomain is specified as frequency, the input is assumed to be a windowed discrete time Fourier transform (DTFT) of an audio signal. The signal is then converted to the power domain. Noise variance is estimated according to [2]. The posterior and prior SNR are estimated according to the Minimum Mean-Square Error (MMSE) formula described in [3]. A log likelihood ratio test and Hidden Markov Model (HMM)-based hang-over scheme determine the probability that the current frame contains speech, according to [1].

References

[1] Sohn, Jongseo., Nam Soo Kim, and Wonyong Sung. "A Statistical Model-Based Voice Activity Detection." Signal Processing Letters IEEE. Vol. 6, No. 1, 1999.

[2] Martin, R. "Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics." IEEE Transactions on Speech and Audio Processing. Vol. 9, No. 5, 2001, pp. 504–512.

[3] Ephraim, Y., and D. Malah. "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator." IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 32, No. 6, 1984, pp. 1109–1121.

Extended Capabilities

Version History

Introduced in R2018a