Speech

Automatic speech recognition has undergone a great development in the recent years. There are many new model architectures based on deep neural networks like Wav2letter, Quartznet, ESPNet or Deepspeech. From these, Deepspeech, originally developed by Mozilla, seems to have by far the most developed software infrastructure and is relatively easy to adapt to a new language. After last year's budget cuts in Mozilla the Deepspeech project has all but stagnated, but it has been picked up by Coqui, an organization formed by some of the original Deepspeech team.

Traditional HMM-based ASR systems like CMU Sphinx are only suitable for limited vocabulary speech recognition. The exception is Kaldi (HMM+DNN), which can successfully compete with newer systems; however as it is primarily an academic system, its level of complexity is immense. It is challenging to create a useful model even adapting existing ones; creating a model with all the caveats and possible enhancements is too hard for one outside academic circles who does not happen to specialize in the technologies used.

Deepspeech/Coqui is therefore a good fit for a general, reasonably performant system with many example applications.

The model (Source: Mozilla deepspeech page)

My early Czech (not too much data and the language model can be improved) is available for download at https://github.com/comodoro/deepspeech-cs.

links

social