Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speech recognition and remove postprocessing #837

Open
2 of 14 tasks
josancamon19 opened this issue Sep 14, 2024 · 3 comments
Open
2 of 14 tasks

Improve speech recognition and remove postprocessing #837

josancamon19 opened this issue Sep 14, 2024 · 3 comments
Assignees

Comments

@josancamon19
Copy link
Contributor

josancamon19 commented Sep 14, 2024

Refactoring STT system

https://artificialanalysis.ai/speech-to-text

Points to https://www.speechmatics.com/ as the winner in WER %

Image

Deepgram has a worst WER by 40%, which it's forcing us to do a postprocessing using whisper-x.

Also tried assembly AI, unfortunately streaming only works for english language, so it's discarded.
Image
Image

Speechmatics is marginally better than assembly ai, but works with all languages, and has interesting features future proof.

NOTE I will do the exact same pipeline first in Soniox first, we already have 10k in credits, but I'm unsure if I trust their accuracy for some reason, as the WER comparison was made by themselves.
Also they made the research before the releases of latest models.
Image

Still the reason of testing soniox first, is because we have already a good % of the pipeline integrated, so it shouldn't take long.


  • Setup speechmatics websocket concurrently with existing deepgram websocket.
  • From the app use a settings dropdown, that allows to select transcript model (only while testing)
  • Test both options in 10 scenarios. (Deepgram + postprocessing) (Speechmatics + postprocessing)
  • Script to view line by line comparison between each one of them
    • Prompt GPT to compare the 3 transcripts at each scenario, which one has better accuracy.
    • (Maybe) Use groq whisper v3 as source of truth and perform WER in comparison
  • If tests point that speechmatics <= whisper-x results by 5-10%, skip and remove postprocessing.

Important:

  • Need to double check scalability
    Image

  • Need to ask for free credits, it's 4x more expensive than deepgram.

  • Speechmatics will only be supported for opus, for 1.0.2, will continue using deepgram.

Add ons:

  • VAD Implementation will be needed. Finish ticket specially for Opus.
  • Push more users to migrate, initiate "campaign" to help users migrate from 1.0.2 to 1.0.4 in < 30days so we can deprecate pcm8.
    • Understand the data (how many are still on pcm8?)
  • Improve speech recognition, make sure the file is being sent correctly (use the raw audio .wav instead of the saved opus encoded bytes), double check the duration at which performs 90% of the time.
@josancamon19 josancamon19 self-assigned this Sep 14, 2024
@josancamon19
Copy link
Contributor Author

How WER tests were made by artificialanalysis

Image

@josancamon19
Copy link
Contributor Author

josancamon19 commented Sep 14, 2024

@kodjima33
Copy link
Collaborator

@josancamon19 can you pls specify what languages are required to complete the task? This will allow me to quicker understand whom to ask to do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants