オンデバイス vs クラウド文字起こし:プライバシーと速度の比較
The launch of Mistral’s Voxtral Realtime in February 2026 reignited a debate that has been building for years: should AI transcription happen on your device or in the cloud? The answer is not as simple as “local is more private” or “cloud is more accurate.” Both approaches have real trade-offs in privacy, speed, accuracy, cost, and convenience.
According to VentureBeat, the release of production-quality open-source models that run on consumer hardware marks a turning point. For the first time, individuals and organizations can get near-cloud-quality transcription without sending a single byte of audio to an external server.
This guide breaks down when on-device transcription makes sense, when cloud is the better choice, and how tools like ScreenApp and Voxtral fit into each scenario.
Related guides: Voxtral Transcribe 2 overview, Best AI transcription tools 2026, Live transcription
What On-Device Means
On-device transcription means the AI model runs entirely on your local hardware. Your audio never leaves your computer, phone, or server. The model processes the speech-to-text conversion using your device’s CPU or GPU.
Current on-device options include:
- Voxtral Realtime (Mistral, 2026) - 4B parameter streaming model, Apache 2.0, runs on a single GPU
- OpenAI Whisper (open-source) - Multiple model sizes from 39M to 1.5B parameters, runs on CPU or GPU
- Apple Speech Recognition - Built into iOS and macOS, processes on the device’s Neural Engine
- Google Speech-to-Text - On-device mode available on Pixel and Android devices
- Mozilla DeepSpeech - Open-source, lightweight, CPU-friendly
The key requirement is that your device has enough compute power to run the model. Smaller models (Whisper tiny/base) run on almost any modern laptop. Larger models (Voxtral Realtime at 4B parameters) need a dedicated GPU or Apple Silicon with sufficient memory.
What Cloud Means
Cloud transcription sends your audio to remote servers where powerful hardware processes it. The transcript is returned to your device over the internet.
Cloud transcription services include:
- ScreenApp - Record, transcribe, summarize, and search in the browser
- Otter.ai - Meeting-focused transcription with team features
- Fireflies.ai - Meeting intelligence with CRM integrations
- OpenAI Whisper API - Pay-per-minute cloud Whisper at $0.006/min
- Voxtral API - Mistral’s managed transcription at $0.003/min
- Deepgram - Enterprise transcription API
- AssemblyAI - Developer API with audio intelligence
Cloud services handle the infrastructure, scaling, and model updates. You send audio, you get text. No GPU required on your end.
Privacy Comparison
Privacy is the primary reason people consider on-device transcription. Here is a detailed comparison.
On-device privacy advantages:
- Audio never leaves your hardware
- No third-party data processing
- No risk of server-side data breaches at the transcription provider
- No data retention policies to worry about
- Full control over transcript storage and deletion
- Compliant by default with data residency requirements
On-device privacy limitations:
- Your device can still be compromised (malware, physical access)
- Transcripts stored locally are only as secure as your device
- No automatic backup means data loss if the device fails
- You are responsible for your own security practices
Cloud privacy advantages:
- Professional security teams manage infrastructure
- Encryption in transit (TLS) and at rest
- Regular security audits and compliance certifications
- Automatic backups prevent data loss
- Access controls and team permissions
Cloud privacy limitations:
- Audio travels over the internet to third-party servers
- Provider employees may have theoretical access to data
- Data breaches at the provider expose your recordings
- Subpoenas can compel providers to release data
- Data may be processed in jurisdictions with different privacy laws
For most individuals and small businesses, cloud transcription with a reputable provider offers adequate privacy. The encryption and security practices of established services like ScreenApp exceed what most individuals implement on their own devices.
For regulated industries, the calculus changes. Healthcare (HIPAA), legal (attorney-client privilege), financial services, and government agencies may require that sensitive audio never leave their controlled infrastructure. On-device transcription with Voxtral Realtime or Whisper satisfies this requirement.
Speed Comparison
Speed in transcription means two things: how fast you get a transcript from recorded audio (batch speed), and how quickly text appears during live speech (latency).
Batch transcription speed:
On-device batch speed depends on your hardware. A modern GPU (NVIDIA RTX 4090 or Apple M3 Max) processes audio 10-30x faster than real-time with Whisper large-v3. A laptop CPU might process audio at 1-3x real-time, meaning a one-hour recording takes 20-60 minutes to transcribe.
Cloud batch speed is generally faster because providers use optimized hardware. Voxtral Mini Transcribe V2 processes audio approximately 3x faster than competing services according to Mistral’s benchmarks. Most cloud services return transcripts of hour-long recordings in 2-5 minutes.
Real-time latency:
Voxtral Realtime achieves sub-200ms latency, configurable up to 2.4 seconds for higher accuracy. This is on-device performance.
Cloud real-time transcription adds network latency on top of processing time. Expect 500ms to 2 seconds total latency depending on your internet connection and the provider’s infrastructure. Tools like ScreenApp’s live transcription and Otter.ai’s live captions fall in this range.
For live captioning and voice agents, on-device wins on raw latency. For batch transcription of recordings, cloud wins on total throughput unless you have high-end local hardware.
Accuracy Comparison
This is where the gap between on-device and cloud has narrowed dramatically in 2026.
Current accuracy benchmarks (word error rate on FLEURS, lower is better):
| Model | Deployment | Word Error Rate | Notes |
|---|---|---|---|
| Voxtral Mini V2 | Cloud API | ~4% | Best accuracy at lowest cost |
| Voxtral Realtime (2.4s delay) | On-device | ~4% | Matches batch at higher delay |
| Voxtral Realtime (480ms delay) | On-device | ~5-6% | Within 1-2% of batch |
| Whisper large-v3 | On-device / Cloud | ~5-7% | Varies by language |
| GPT-4o Mini Transcribe | Cloud API | ~5% | OpenAI's latest |
The key insight: on-device accuracy has reached parity with cloud for major languages when using the latest models. Voxtral Realtime at 2.4 seconds delay matches the cloud batch model. The accuracy gap only appears when you push for ultra-low latency (sub-500ms).
For less common languages, cloud services with larger model ensembles still have an edge. Whisper’s 97-language support versus Voxtral’s 13 languages means cloud remains the better choice for less common languages.
Cost Comparison
Cost includes both direct expenses and hidden costs.
On-device costs:
- Hardware: $0 if you already have capable hardware, $500-$2,000+ for a GPU
- Electricity: $0.01-$0.05 per hour of transcription (varies by hardware and rates)
- Maintenance: Your time setting up and maintaining the system
- Model updates: Free (download new weights when available)
- Per-minute cost after setup: Effectively $0
Cloud costs:
- API pricing: $0.003-$0.006/min for raw transcription
- SaaS pricing: $0-$30/mo for consumer tools
- No hardware investment
- No maintenance
- Automatic updates and improvements
Break-even analysis:
If you use Voxtral’s API at $0.003/min and transcribe 100 hours per month, that costs $18/month. Self-hosting on a cloud GPU (roughly $0.50-$1.00/hr for an A10G) would cost $50-$100/month for the same workload, making the API cheaper for moderate usage.
If you already own a GPU or Apple Silicon machine, self-hosting costs only electricity. At 100 hours/month, the API savings versus self-hosting pay for a $1,500 GPU in about 7 months.
For light users (under 10 hours/month), cloud tools like ScreenApp with included transcription minutes in the subscription offer the best value. You get transcription plus recording, summaries, and search for a flat monthly fee.
When to Use On-Device
On-device transcription is the right choice when:
You handle sensitive data. Medical records, legal proceedings, financial discussions, classified information. If a data breach at a transcription provider would cause real harm, keep the audio local.
You need real-time with minimal latency. Building a voice agent, live captioning system, or accessibility tool? Sub-200ms latency requires on-device processing. Network round-trips add too much delay.
You have high volume. Processing thousands of hours monthly? The marginal cost of on-device transcription approaches zero after the hardware investment. At scale, this beats per-minute API pricing.
You have compliance requirements. HIPAA, GDPR data residency, government security clearances, or industry-specific regulations that mandate data stays on controlled infrastructure.
You are building a product. If transcription is a core feature of your application, on-device deployment gives you control over quality, latency, and cost. No dependency on third-party API availability or pricing changes.
When to Use Cloud
Cloud transcription is the right choice when:
You want convenience. No setup, no hardware requirements, no maintenance. Open a browser, upload audio, get a transcript. Tools like ScreenApp handle everything.
You need a complete workflow. Transcription alone is rarely the end goal. You want summaries, action items, searchable archives, team sharing, and integrations. Cloud tools bundle these features. ScreenApp’s AI summarizer and note taker turn raw transcripts into usable output.
You are an individual or small team. The overhead of managing on-device infrastructure is not worth it for a few hours of transcription per week. A $19/month subscription is cheaper than your time.
You need broad language support. Cloud services offer more languages. If you transcribe in less common languages, cloud APIs with larger models perform better.
You want guaranteed uptime. Cloud providers offer SLAs and redundancy. Your laptop battery dying during a transcription is a you problem. Cloud services keep processing regardless.
You do not have a GPU. Running modern transcription models at useful speeds requires at least a mid-range GPU or Apple Silicon. If your hardware cannot handle it, cloud is your only option.
The Hybrid Approach
The most practical approach for many organizations combines both.
Use on-device for: Sensitive recordings (board meetings, HR conversations, legal consultations), real-time captioning, and high-volume batch processing.
Use cloud for: General meetings, lectures, podcasts, and anything where convenience matters more than data sensitivity. ScreenApp handles these use cases with recording, transcription, diarization, and summarization in one tool.
Some organizations route recordings through a classification step: sensitive audio goes to the on-device pipeline, everything else goes to the cloud tool. This balances privacy with productivity.
ScreenApp’s Position
ScreenApp is a cloud transcription tool that prioritizes the complete workflow over raw model deployment. Here is what that means in practice.
Your audio is encrypted in transit and at rest. Transcripts are stored securely and accessible only to you and anyone you share them with. For most users and use cases, this level of security is more than adequate.
Where ScreenApp differentiates from raw transcription models is in what happens after transcription:
- Recording: Capture screen, camera, or audio directly in the browser
- Diarization: Automatic speaker labels on every transcript
- Summaries: AI-generated key points and action items
- Search: Find any moment across all your recordings
- Notes: Structured meeting notes from any transcript
On-device models like Voxtral give you the transcript. ScreenApp gives you the transcript plus everything you need to actually use it.
The Future of Transcription
The on-device versus cloud debate will not be settled permanently because both approaches keep improving.
On-device is getting better. Model compression techniques are shrinking model sizes while maintaining accuracy. Apple’s Neural Engine, Qualcomm’s NPU, and similar dedicated AI hardware in consumer devices will make on-device transcription effortless within 2-3 years. Voxtral Realtime at 4B parameters is already runnable on a laptop. The next generation will likely be half that size.
Cloud is getting cheaper. Voxtral’s API at $0.003/min is half what Whisper charged. Competition from Deepgram, AssemblyAI, and others continues pushing prices down. The marginal cost of cloud transcription is approaching zero for light users.
The workflow layer matters most. As transcription accuracy converges across providers, the value shifts to what you do with the transcript. Summarization, action item extraction, searchable archives, and integrations become the differentiators. This is where cloud tools have a structural advantage because they can iterate on features faster than local software.
Expect hybrid solutions. Future tools will likely offer both modes: on-device for sensitive content, cloud for convenience and advanced features. The toggle between local and cloud processing will become as simple as choosing a Wi-Fi network.
Getting Started
If you want on-device transcription today:
- Download Voxtral Realtime weights from Hugging Face
- Set up a Python environment with the required dependencies
- Run the model on your GPU or Apple Silicon device
- Build or use an existing interface for audio input and transcript output
If you want cloud transcription today:
- Go to screenapp.io/features/online-transcript-generator
- Upload audio, paste a URL, or start recording
- Get your transcript with speaker labels and timestamps
- Run the AI summarizer for key points
Both approaches work. Choose based on your privacy needs, technical comfort, and how much you value the workflow features that come with cloud tools.
FAQ
Is on-device transcription more private?
Yes, in the sense that your audio never leaves your hardware. However, on-device transcription is only as secure as your device itself. If your computer is compromised, your transcripts are exposed regardless of where the processing happened.
Can on-device models match cloud accuracy?
In 2026, yes for major languages. Voxtral Realtime at 2.4 seconds delay matches cloud accuracy. For less common languages or specialized domains, cloud services with larger models still have an edge.
Do I need a GPU for on-device transcription?
For real-time transcription with Voxtral Realtime, a GPU or Apple Silicon with 8GB+ memory is recommended. For batch transcription with Whisper, a CPU works but is 5-10x slower. Smaller Whisper models (tiny, base) run acceptably on modern CPUs.
How much does cloud transcription cost?
API pricing ranges from $0.00249/min (AssemblyAI) to $0.006/min (Whisper). Consumer tools like ScreenApp ($19/mo) and Otter.ai ($8.33/mo annual) include transcription minutes in subscription plans with no per-minute charges.
Is Voxtral Realtime truly free?
The model weights are free under Apache 2.0. You can download and run them without paying Mistral anything. You do pay for your own hardware and electricity. If you use Mistral’s hosted API instead, that costs $0.003/min.
Which is better for meetings?
Cloud tools like ScreenApp and Otter.ai are better for meetings because they provide recording, live transcription, speaker labels, summaries, and sharing in one interface. On-device models require you to build or find these features separately.
Can I switch between on-device and cloud?
Yes. Many organizations use on-device for sensitive recordings and cloud for general meetings. There is no technical barrier to using both approaches, though you will need separate tools for each.
FAQ
Yes, in the sense that your audio never leaves your hardware. However, on-device transcription is only as secure as your device itself. If your computer is compromised, your transcripts are exposed regardless of where the processing happened.
In 2026, yes for major languages. Voxtral Realtime at 2.4 seconds delay matches cloud accuracy. For less common languages or specialized domains, cloud services with larger models still have an edge.
For real-time transcription with Voxtral Realtime, a GPU or Apple Silicon with 8GB+ memory is recommended. For batch transcription with Whisper, a CPU works but is 5-10x slower. Smaller Whisper models (tiny, base) run acceptably on modern CPUs.
API pricing ranges from $0.00249/min (AssemblyAI) to $0.006/min (Whisper). Consumer tools like ScreenApp ($19/mo) and Otter.ai ($8.33/mo annual) include transcription minutes in subscription plans with no per-minute charges.
The model weights are free under Apache 2.0. You can download and run them without paying Mistral anything. You do pay for your own hardware and electricity. If you use Mistral's hosted API instead, that costs $0.003/min.
Cloud tools like ScreenApp and Otter.ai are better for meetings because they provide recording, live transcription, speaker labels, summaries, and sharing in one interface. On-device models require you to build or find these features separately.
Yes. Many organizations use on-device for sensitive recordings and cloud for general meetings. There is no technical barrier to using both approaches, though you will need separate tools for each.