Voice messages in conversations
Where voice messages work
Inbound voice (visitor to bot):
- WhatsApp. Voice notes are standard.
- Telegram. Voice messages common.
- Voice channel. Full audio call.
Outbound voice (bot to visitor):
- Voice channel. TTS response.
- WhatsApp. Voice note response on Business and Enterprise (planned).
For pure-text channels (SMS, email), voice notes don't apply.
How transcription works
When visitor sends voice:
- Channel delivers audio file to AskVault.
- ASR (automatic speech recognition) transcribes within 5 seconds.
- Transcribed text flows through normal bot pipeline.
- Bot responds in text by default.
Accuracy: about 85 to 95% on clear English, lower for noisy environments.
Languages
35 languages auto-detected:
- English, Spanish, French, German, Italian, Portuguese, Dutch.
- Hindi, Bengali, Tamil, Telugu, Marathi.
- Mandarin, Cantonese, Japanese, Korean.
- Arabic, Hebrew, Turkish.
- And more.
Configure language hint per workspace if auto-detect struggles.
Audio retention
- Audio files stored 90 days standard.
- 1 year on Enterprise.
- Encrypted at rest.
- Accessible to agents via conversation view.
Player in conversation view
Agents see:
- Audio player with play/pause/seek.
- Transcript below with timestamps.
- Click any word to jump to that moment.
Useful for QA and verification.
Bot capabilities
The bot can:
- Answer questions from voice. Treats transcript like text input.
- Acknowledge voice with text ("I heard your message about refunds. Here's our policy...").
- Trigger skills based on transcribed intent.
What the bot doesn't do today:
- Reply with voice on WhatsApp. Planned.
- Process voice quality cues (tone, urgency). Planned via voice sentiment analysis.
Quality considerations
For best results, customers should:
- Speak clearly in a quiet environment.
- Use the channel's native voice feature (not attached file).
- Keep messages under 60 seconds for best transcription.
Limits
- Per-message duration. 5 minutes.
- Audio formats. Channel-native (OGG for WhatsApp, MP3 for Telegram, etc.).
- Transcription latency. Typically under 5 seconds.
- File size cap. 20 MB per audio.
Common pitfalls
Garbled transcription. Noisy audio. Bot may misunderstand. Agent should review.
Wrong language detected. Bot replies in wrong language. Set workspace default.
Long voice ramble. Customer talks 4 minutes; bot extracts intent imperfectly. Encourage shorter messages.
FAQ
Will customers know it's a bot if they send voice?
Bot responds in text by default. Customer can tell from response style.
Can I disable voice transcription?
Yes under Workspace Settings > Voice. Audio still archives; transcription skipped.
Does voice count against my query quota?
Yes. Each transcribed voice + bot reply counts as 1 query.