Just saw Simon test the new open-source VibeVoice-ASR from Microsoft on Mac, and this model is impressive.


9 billion parameters, processes 60 minutes of continuous audio in one go, and can output who is speaking, when they speak, and what they say.
Traditional solutions require combining Whisper and pyannote, but now one model handles everything, supporting over 50 languages and code-switching between Chinese and English.
He used the 4-bit quantized version (5.71GB) on the M5 Max to transcribe a 1-hour podcast, taking 8 minutes and 45 seconds, with a peak memory of 61.5GB, which a normal 32GB laptop can't handle.
Interestingly, the model identified a two-person conversation as three people because Lenny speaks in different recording environments.
Running locally requires at least 64GB of RAM; for podcast transcription and meeting notes, multi-step processes can now be compressed into a single inference.

What do you think of this model?
View Original
post-image
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments