speech2text into sentences

4 views (last 30 days)
Felix
Felix on 7 Nov 2022
Answered: Walter Roberson on 7 Nov 2022
Hello!
I`m trying to use the 'wav2vec2.0' model provided by Matlab in voice transcription.
However, the output of the transcript is a list of individual words instead of sentences.
I am aware that Google speech API returns sepearte sentences but that is not a free solution and cannot run offline.
Is it possible to return sentences instead of individual words in speech2text using Matlab native functions?
Thank you in advance.

Accepted Answer

Walter Roberson
Walter Roberson on 7 Nov 2022
No, this is something that requires sentiment analysis at the very least, possibly more complicated than that.
You can get clues from timing, but it is a mistake to treat a pause as the end of a sentence. People pause mid-sentence to consider their words, or to listen to something else going on, or pay attention to an event briefly, or for dramatic effect, or for emphasis.
Did you ever watch the old Saturday Night Live episodes with Dana's "Church Lady", where Church Lady talks about something and then asks, "Why is this happening? Could it be.... Satan?!". There is a distinct pause before the Satan, but it would be understood that it is part of the same sentence.
But if someone is having a bad day and asks, "Could someone tell me why this is happening to me? Why XXX oh why?" where the XXX represents a pause of varying length, then the duration of the pause could determine whether the average person would understand it to be "Why? Oh, why?" (two sentences) or "Why, oh why?" (one sentence) but regardless of the duration of the pause either version might be the correct version. If the person spends the time between the "Why" and what follows crying then they are more likely to be determined to not have finished the sentence yet and so the comma version to be inferred even if it is several minutes break. You cannot use fixed pauses length boundaries to punctuate properly.
"Let me think... I think I will party!" could be one sentence or two, depending on the intent of the speaker, and the speaker probably does not have any distinct intent about the breakdown and so neither version is definitely right or wrong. "Let me think... an apple?" vs "Let me think... An apple?" in written form, but since "I" is capitalized either way, the party version does not clearly distinguish one versus two even in written form, and spoken form is likely to be ambiguous.
So you cannot determine sentence boundaries by formants or (reliably) by pauses, and that means that you need AI type techniques to do it properly. Which would likely be too large to deploy offline.

More Answers (0)

Categories

Find more on Language Support in Help Center and File Exchange

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!