openai/whisper-large-v3
Convert speech in audio to text

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.
This version runs only the most recent Whisper model, large-v3. It’s optimized for high performance and simplicity.
Model Versions
| Model Size | Version |
|---|---|
| large-v3 | link |
| large-v2 | link |
| all others | link |
While this implementation only uses the large-v3 model, we maintain links to previous versions for reference.
Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.
