speech2text

Transcribe speech signal to text

Since R2022b

collapse all in page

Syntax

transcript = speech2text(audioIn,fs)

transcript = speech2text(audioIn,fs,Client=clientObj)

[transcript,rawOutput] = speech2text(___)

Description

example

transcript = speech2text(audioIn,fs) transcribes speech in the input audio signal to text using a pretrained wav2vec 2.0 model.

Note

Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.

example

transcript = speech2text(audioIn,fs,Client=clientObj) transcribes speech using the specified pretrained deep learning model or third-party speech service.

Note

Using the Emformer pretrained model requires Deep Learning Toolbox and Audio Toolbox™ Interface for SpeechBrain and Torchaudio Libraries. You can download this support package from the Add-On Explorer. For more information, see Get and Manage Add-Ons.

To use third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

[transcript,rawOutput] = speech2text(___) also returns the unprocessed server output from the third-party speech service.

Examples

collapse all

Download wav2vec 2.0 Network

Open Live Script

Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.

Type speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.

Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.

downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip");
wav2vecLocation = fullfile(tempdir,"wav2vec");
unzip(downloadFile,wav2vecLocation)
addpath(wav2vecLocation)

Check that the installation is successful by typing speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a Wav2VecSpeechClient object.

speechClient("wav2vec2.0")

ans = 
  Wav2VecSpeechClient with properties:

    Segmentation: 'word'
      TimeStamps: 0

Perform Speech-to-Text Transcription

Open Live Script

Read in an audio file containing speech and listen to it.

[y,fs] = audioread("speech_dft.wav");
sound(y,fs)

Use speech2text to transcribe the audio signal using the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.

transcript = speech2text(y,fs)

transcript = 
"the discreet forier transform of a real valued signal is conjugate symmetric"

Use Emformer for Streaming Speech-to-Text

Open Live Script

Create a speechClient object that uses the Emformer pretrained model.

emformerSpeechClient = speechClient("emformer");

Create a dsp.AudioFileReader object to read in an audio file. In a streaming loop, read in frames of the audio file and transcribe the speech using speech2text with the Emformer speechClient. The Emformer speechClient object maintains an internal state to perform the streaming speech-to-text transcription.

afr = dsp.AudioFileReader("Counting-16-44p1-mono-15secs.wav");
txtTotal = "";
while ~isDone(afr)
    x = afr();
    txt = speech2text(x,afr.SampleRate,Client=emformerSpeechClient);
    txtTotal = txtTotal + txt;
end

txtTotal

txtTotal = 
"one two three four five six seven eight nine"

Input Arguments

collapse all

`audioIn` — Audio input
column vector

Audio input signal, specified as a column vector (single channel).

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate in Hz, specified as a positive scalar.

Data Types: single | double

`clientObj` — Client object
`speechClient("wav2vec2.0")` (default) | `speechClient` object

Client object, specified as an object returned by speechClient. The object is an interface to a pretrained model or to a third-party speech service. By default, speech2text uses a wav2vec 2.0 client object.

Using speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not installed, calling speechClient with "wav2vec2.0" provides a link to download and install the model.

Using the Emformer model requires Deep Learning Toolbox and Audio Toolbox Interface for SpeechBrain and Torchaudio Libraries. If this support package is not installed, calling speechClient with "emformer" provides a link to the Add-On Explorer, where you can download and install the support package.

To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.

Example: speechClient("wav2vec2.0")

Output Arguments

collapse all

`transcript` — Speech transcript
table | string

Speech transcript of the input audio signal, returned as a table with a column containing the transcript and another column containing the associated confidence metrics. If the Segmentation property of clientObj is "none", speech2text returns the transcript as a string.

The returned table can have additional columns depending on the speechClient properties and server options.

Data Types: table | string

`rawOutput` — Unprocessed server output
`ResponseMessage` | structure

Unprocessed server output, returned as a matlab.net.http.ResponseMessage object containing the HTTP response from the third-party speech service. If the third-party speech service is Amazon^®, speech2text returns the server output as a structure.

This output argument does not apply if clientObj interfaces with a pretrained model.

References

[1] Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.

Version History

Introduced in R2022b

speech2text

Syntax

Description

Examples

Download wav2vec 2.0 Network

Perform Speech-to-Text Transcription

Use Emformer for Streaming Speech-to-Text

Input Arguments

audioIn — Audio input column vector

fs — Sample rate (Hz) positive scalar

clientObj — Client object speechClient("wav2vec2.0") (default) | speechClient object

Output Arguments

transcript — Speech transcript table | string

rawOutput — Unprocessed server output ResponseMessage | structure

References

Version History

See Also

`audioIn` — Audio input
column vector

`fs` — Sample rate (Hz)
positive scalar

`clientObj` — Client object
`speechClient("wav2vec2.0")` (default) | `speechClient` object

`transcript` — Speech transcript
table | string

`rawOutput` — Unprocessed server output
`ResponseMessage` | structure