Voxtral Transcribe 2: MistralのAI文字起こしモデルが意味すること
Mistral AI released Voxtral Transcribe 2 on February 5, 2026, introducing two speech-to-text models that push the boundaries of transcription accuracy and speed. The release includes Voxtral Mini Transcribe V2 for batch processing and Voxtral Realtime for live, ultra-low-latency transcription.
According to Mistral’s announcement, the models support 13 languages and achieve the lowest word error rate at the lowest price point of any transcription API. Voxtral Realtime ships under the Apache 2.0 open-weights license, meaning developers can deploy it on-device for privacy-sensitive applications.
This matters for anyone who records meetings, interviews, lectures, or podcasts. The transcription market just got more competitive, and tools like ScreenApp’s transcription, Otter.ai, and Fireflies now face a powerful open-source alternative. Here is what changed and what it means for your workflow.
Related guides: Best free audio to text converters, AI note taker tools, Live transcription apps
What Is Voxtral Transcribe 2?
Voxtral Transcribe 2 is a family of two speech-to-text models from Mistral AI, the Paris-based company known for open-source large language models. The two models serve different use cases.
Voxtral Mini Transcribe V2 handles batch transcription. You upload an audio file, and it returns a transcript with speaker diarization (who said what), word-level timestamps, and context biasing for technical terms. It processes audio at roughly $0.003 per minute and achieves approximately 4% word error rate on the FLEURS benchmark. That makes it cheaper than OpenAI Whisper’s API ($0.006/min) while delivering better accuracy.
Voxtral Realtime is built for live transcription. It uses a streaming architecture that transcribes audio as it arrives, with latency configurable down to sub-200 milliseconds. At 2.4 seconds delay, it matches the batch model’s accuracy. At 480ms, it stays within 1-2% word error rate of the batch model. This is the model released under Apache 2.0, so anyone can download and run it locally.
Both models support 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Two Models, Two Use Cases
Understanding which model fits your needs is straightforward.
Use Voxtral Mini Transcribe V2 when:
- You have pre-recorded audio (meetings, interviews, podcasts)
- You need speaker labels and timestamps
- You want the highest possible accuracy
- Turnaround time of seconds to minutes is acceptable
- You need context biasing for domain-specific vocabulary
Use Voxtral Realtime when:
- You need live captions or subtitles
- You are building voice agents or real-time assistants
- Latency under 500ms matters
- You want to run the model on your own hardware
- Privacy requires on-device processing
The distinction matters because most transcription tools bundle everything into one product. Mistral split the problem into two specialized solutions, each optimized for its use case.
On-Device Transcription
The biggest story here is not accuracy or speed. It is privacy.
Voxtral Realtime runs on-device with a 4-billion parameter footprint. That means your audio never leaves your computer, phone, or server. For healthcare providers, legal professionals, financial advisors, and anyone handling sensitive conversations, this changes the calculus entirely.
Most transcription services today send your audio to cloud servers for processing. Otter.ai, Fireflies, and even ScreenApp process audio in the cloud. OpenAI’s Whisper API works the same way. While these services have privacy policies and encryption, the audio still travels to and gets processed on third-party infrastructure.
With Voxtral Realtime, organizations can deploy the model inside their own network. No audio leaves the premises. No third-party data processing agreements needed. No risk of data breaches at a transcription provider.
The trade-off is that you need to manage your own infrastructure. Running a 4B parameter model requires a decent GPU (or a modern laptop with sufficient memory). For individuals, cloud services remain more convenient. For enterprises with compliance requirements, on-device is a game-changer.
How Voxtral Compares
Here is how Voxtral Transcribe 2 stacks up against the major transcription tools available in 2026.
| Tool | Type | On-Device | Diarization | Price | Best For |
|---|---|---|---|---|---|
| Voxtral Transcribe 2 | API / Self-hosted | Yes (Realtime) | Yes | $0.003/min (API) | Developers, privacy-first |
| OpenAI Whisper | API / Self-hosted | Yes (open-source) | No (native) | $0.006/min (API) | Developers, general use |
| ScreenApp | Web app | No | Yes | Free / $19/mo | Full workflow: record + transcribe + summarize |
| Otter.ai | Web app / Mobile | No | Yes | Free / $8.33/mo | Meeting transcription |
| Fireflies.ai | Web app / Bot | No | Yes | Free / $10/mo | Meeting notes and CRM |
A few things stand out in this comparison. Voxtral is the cheapest API option and the only one offering both on-device deployment and built-in diarization in a single model family. Whisper is open-source but lacks native speaker diarization. The cloud services (ScreenApp, Otter.ai, Fireflies) offer complete products with UIs, integrations, and workflows that raw transcription models do not provide.
Raw Model vs. Complete Tool
This is the critical distinction most coverage of Voxtral misses.
Voxtral Transcribe 2 is a transcription model. It converts speech to text. That is all it does. There is no recording interface, no meeting scheduler, no summary generator, no search function, no sharing system, no integrations with Zoom or Google Meet.
For developers building transcription into their own products, Voxtral is excellent. For individuals and teams who need to transcribe meetings, lectures, or interviews, you still need a complete tool.
ScreenApp handles the full workflow: record your screen or upload audio, get an automatic transcript with speaker diarization, generate an AI summary, and search across all your transcripts later. The transcription is one step in a larger process.
Think of it this way: Voxtral is an engine. ScreenApp is the car. Most people need the car. Developers and enterprises building their own cars need the engine.
This is why VentureBeat called 2026 “the year of note-taking.” The underlying models keep getting better and cheaper, which makes the complete tools built on top of them more powerful and more affordable.
Privacy Considerations
The privacy angle deserves a deeper look because it affects different users differently.
For individuals: Cloud transcription services are generally fine. Your meeting recordings are encrypted in transit and at rest. The convenience of a hosted service outweighs the theoretical privacy risk for most personal and small business use cases.
For regulated industries: On-device transcription is significant. HIPAA compliance in healthcare, attorney-client privilege in law, and financial regulations all create situations where sending audio to third-party servers introduces compliance risk. Voxtral Realtime running inside a hospital’s network or a law firm’s servers eliminates that risk.
For enterprises: The choice depends on your threat model. If you are worried about a transcription provider being breached, on-device helps. If you are worried about insider threats, it does not because the audio still exists on your internal systems.
ScreenApp addresses privacy through encryption and data handling policies rather than on-device processing. For most users, this provides adequate protection. For organizations with strict data residency requirements, on-device models like Voxtral offer an additional option. You can learn more about how ScreenApp handles audio data on the voice test and recording page.
What This Means for 2026
The transcription market is moving fast. Here is what to watch for the rest of 2026.
Prices will keep falling. Voxtral at $0.003/min undercuts Whisper at $0.006/min. This pressure will push all transcription APIs toward lower pricing, which benefits end-user tools that rely on these APIs.
On-device will become standard. Apple already offers on-device transcription in iOS. Google has similar capabilities in Android. Voxtral brings this to the open-source world at production quality. Within a year, expect most transcription tools to offer an on-device option.
The value shifts to workflow. When transcription itself becomes cheap and accurate, the differentiation moves to what you do with the transcript. Summarization, action item extraction, searchable archives, and integrations become the real product. This is already where tools like ScreenApp and Otter.ai compete.
Real-time transcription opens new use cases. Sub-200ms latency enables live captioning, real-time translation, voice agents, and accessibility features that were not practical before. Expect to see these capabilities appear in video conferencing tools, customer support systems, and educational platforms.
Transcribe with ScreenApp
If you need transcription today and do not want to set up your own infrastructure, ScreenApp provides everything in one place.
- Record or upload your audio at screenapp.io/features/online-transcript-generator.
- Get your transcript with speaker labels and timestamps automatically.
- Generate summaries using the AI summarizer to extract key points and action items.
No software to install, no models to configure, no GPU required.
After Transcription
Once you have your transcript, ScreenApp gives you more tools to work with:
- AI Note Taker: Generate structured meeting notes from any recording
- Transcript Diarization: See exactly who said what with speaker labels
- Live Transcription: Transcribe audio in real time as it plays
- Speech to Text Extension: Transcribe directly from your browser
FAQ
Is Voxtral Transcribe 2 free?
Voxtral Realtime is open-weights under Apache 2.0, so you can download and run it for free on your own hardware. The API through Mistral’s platform costs $0.003 per minute for Voxtral Mini Transcribe V2.
How does Voxtral compare to Whisper?
Voxtral achieves lower word error rates than Whisper at half the API cost ($0.003/min vs $0.006/min). Voxtral also includes native speaker diarization, which Whisper lacks. Both can run on-device.
Can I use Voxtral for meeting transcription?
As a raw model, yes, but you would need to build your own recording and playback interface. For ready-to-use meeting transcription, tools like ScreenApp, Otter.ai, or Fireflies provide a complete experience out of the box.
What languages does Voxtral support?
Voxtral Transcribe 2 supports 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Is on-device transcription better than cloud?
It depends on your needs. On-device offers better privacy since audio never leaves your hardware. Cloud transcription is more convenient and does not require local compute resources. For most individuals, cloud is fine. For regulated industries, on-device is valuable.
What is speaker diarization?
Speaker diarization identifies who spoke when in a recording. Instead of a single block of text, you get labeled segments like “Speaker 1: …” and “Speaker 2: …”. Voxtral Mini Transcribe V2 and ScreenApp both offer this feature.
Will Voxtral replace Otter.ai or ScreenApp?
No. Voxtral is a transcription model, not a complete product. Otter.ai and ScreenApp provide recording, transcription, summarization, search, sharing, and integrations. Voxtral could power the transcription layer inside these tools, but it does not replace the full workflow.
FAQ
Voxtral Realtime is open-weights under Apache 2.0, so you can download and run it for free on your own hardware. The API through Mistral's platform costs $0.003 per minute for Voxtral Mini Transcribe V2.
Voxtral achieves lower word error rates than Whisper at half the API cost ($0.003/min vs $0.006/min). Voxtral also includes native speaker diarization, which Whisper lacks. Both can run on-device.
As a raw model, yes, but you would need to build your own recording and playback interface. For ready-to-use meeting transcription, tools like ScreenApp, Otter.ai, or Fireflies provide a complete experience out of the box.
Voxtral Transcribe 2 supports 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
It depends on your needs. On-device offers better privacy since audio never leaves your hardware. Cloud transcription is more convenient and does not require local compute resources. For most individuals, cloud is fine. For regulated industries, on-device is valuable.
Speaker diarization identifies who spoke when in a recording. Instead of a single block of text, you get labeled segments like "Speaker 1: ..." and "Speaker 2: ...". Voxtral Mini Transcribe V2 and ScreenApp both offer this feature.
No. Voxtral is a transcription model, not a complete product. Otter.ai and ScreenApp provide recording, transcription, summarization, search, sharing, and integrations. Voxtral could power the transcription layer inside these tools, but it does not replace the full workflow.