Improve speech recognition and remove postprocessing #837

josancamon19 · 2024-09-14T05:07:50Z

Refactoring STT system

https://artificialanalysis.ai/speech-to-text

Points to https://www.speechmatics.com/ as the winner in WER %

Deepgram has a worst WER by 40%, which it's forcing us to do a postprocessing using whisper-x.

Also tried assembly AI, unfortunately streaming only works for english language, so it's discarded.

Speechmatics is marginally better than assembly ai, but works with all languages, and has interesting features future proof.

NOTE I will do the exact same pipeline first in Soniox first, we already have 10k in credits, but I'm unsure if I trust their accuracy for some reason, as the WER comparison was made by themselves.
Also they made the research before the releases of latest models.

Still the reason of testing soniox first, is because we have already a good % of the pipeline integrated, so it shouldn't take long.

Setup speechmatics websocket concurrently with existing deepgram websocket.
From the app use a settings dropdown, that allows to select transcript model (only while testing)
Test both options in 10 scenarios. (Deepgram + postprocessing) (Speechmatics + postprocessing)
Script to view line by line comparison between each one of them
- Prompt GPT to compare the 3 transcripts at each scenario, which one has better accuracy.
- (Maybe) Use groq whisper v3 as source of truth and perform WER in comparison
If tests point that speechmatics <= whisper-x results by 5-10%, skip and remove postprocessing.

Important:

Need to double check scalability
Need to ask for free credits, it's 4x more expensive than deepgram.
Speechmatics will only be supported for opus, for 1.0.2, will continue using deepgram.

Add ons:

VAD Implementation will be needed. Finish ticket specially for Opus.
Push more users to migrate, initiate "campaign" to help users migrate from 1.0.2 to 1.0.4 in < 30days so we can deprecate pcm8.
- Understand the data (how many are still on pcm8?)
Improve speech recognition, make sure the file is being sent correctly (use the raw audio .wav instead of the saved opus encoded bytes), double check the duration at which performs 90% of the time.

The text was updated successfully, but these errors were encountered:

josancamon19 · 2024-09-14T05:19:24Z

How WER tests were made by artificialanalysis

josancamon19 · 2024-09-14T05:39:34Z

https://soniox.com/media/SonioxEnglishBenchmarks2023.pdf
https://soniox.com/benchmarks/

kodjima33 · 2024-09-14T21:49:36Z

@josancamon19 can you pls specify what languages are required to complete the task? This will allow me to quicker understand whom to ask to do it

josancamon19 self-assigned this Sep 14, 2024

josancamon19 assigned josancamon19 and unassigned josancamon19 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speech recognition and remove postprocessing #837

Improve speech recognition and remove postprocessing #837

josancamon19 commented Sep 14, 2024 •

edited

Loading

josancamon19 commented Sep 14, 2024

josancamon19 commented Sep 14, 2024 •

edited

Loading

kodjima33 commented Sep 14, 2024

Improve speech recognition and remove postprocessing #837

Improve speech recognition and remove postprocessing #837

Comments

josancamon19 commented Sep 14, 2024 • edited Loading

Refactoring STT system

josancamon19 commented Sep 14, 2024

josancamon19 commented Sep 14, 2024 • edited Loading

kodjima33 commented Sep 14, 2024

josancamon19 commented Sep 14, 2024 •

edited

Loading

josancamon19 commented Sep 14, 2024 •

edited

Loading