Dublin Core
Title
Quantitative Analysis of Voice Recognition Models
Abstract
With the growing adoption of virtual communication and voice-driven applications, the need for accurate, real-time, and privacy-conscious transcription tools has become critical. Existing solutions largely rely on cloud infrastructure, introducing concerns around latency, cost, and data privacy. This project investigates whether modern speech recognition models can perform competitively in fully offline environments while maintaining accuracy and responsiveness.
To this end, we conducted a comparative evaluation of four voice transcription model, Whisper, Faster-Whisper, Wav2Vec2, and Vosk, using the AMI Meeting Corpus. Each model was assessed based on four key metrics: Word Error Rate (WER), Character Error Rate (CER), BLEU, and ROUGE-L. Our findings demonstrate that Faster-Whisper outperforms the others in accuracy and latency, making it a strong candidate for edge deployment.
Building upon this analysis, a lightweight desktop application was developed using Python and PyQt5. The app captures microphone input in real time, applies VAD (Voice Activity Detection) and loudness filtering to reduce noise, and transcribes valid segments using Faster-Whisper. Additionally, the tool integrates Ollam, a local LLM engine to optionally generate intelligent responses to transcribed text.
This work contributes a dual outcome: a detailed empirical evaluation of modern transcription models on realistic meeting audio, and a functional, privacy-preserving voice assistant prototype for local systems. The results highlight the feasibility and value of running sophisticated voice AI tools on personal machines without cloud
dependency, paving the way for secure adoption in sensitive domains such as legal, healthcare, and enterprise communication.
To this end, we conducted a comparative evaluation of four voice transcription model, Whisper, Faster-Whisper, Wav2Vec2, and Vosk, using the AMI Meeting Corpus. Each model was assessed based on four key metrics: Word Error Rate (WER), Character Error Rate (CER), BLEU, and ROUGE-L. Our findings demonstrate that Faster-Whisper outperforms the others in accuracy and latency, making it a strong candidate for edge deployment.
Building upon this analysis, a lightweight desktop application was developed using Python and PyQt5. The app captures microphone input in real time, applies VAD (Voice Activity Detection) and loudness filtering to reduce noise, and transcribes valid segments using Faster-Whisper. Additionally, the tool integrates Ollam, a local LLM engine to optionally generate intelligent responses to transcribed text.
This work contributes a dual outcome: a detailed empirical evaluation of modern transcription models on realistic meeting audio, and a functional, privacy-preserving voice assistant prototype for local systems. The results highlight the feasibility and value of running sophisticated voice AI tools on personal machines without cloud
dependency, paving the way for secure adoption in sensitive domains such as legal, healthcare, and enterprise communication.
Keywords
speech recognition, Whisper, Faster-Whisper, transcription models, real-time, privacy, WER, PyQt5