Voice messages in conversations

Written by Aashiq, Founder, AskVault · Reviewed by Aashiq

Last updated: May 15, 2026 · 3 min read

Where voice messages work

Inbound voice (visitor to bot):

WhatsApp. Voice notes are standard.
Telegram. Voice messages common.
Voice channel. Full audio call.

Outbound voice (bot to visitor):

Voice channel. TTS response.
WhatsApp. Voice note response on Business and Enterprise (planned).

For pure-text channels (SMS, email), voice notes don't apply.

How transcription works

When visitor sends voice:

Channel delivers audio file to AskVault.
ASR (automatic speech recognition) transcribes within 5 seconds.
Transcribed text flows through normal bot pipeline.
Bot responds in text by default.

Accuracy: about 85 to 95% on clear English, lower for noisy environments.

Languages

35 languages auto-detected:

English, Spanish, French, German, Italian, Portuguese, Dutch.
Hindi, Bengali, Tamil, Telugu, Marathi.
Mandarin, Cantonese, Japanese, Korean.
Arabic, Hebrew, Turkish.
And more.

Configure language hint per workspace if auto-detect struggles.

Audio retention

Audio files stored 90 days standard.
1 year on Enterprise.
Encrypted at rest.
Accessible to agents via conversation view.

Player in conversation view

Agents see:

Audio player with play/pause/seek.
Transcript below with timestamps.
Click any word to jump to that moment.

Useful for QA and verification.

Bot capabilities

The bot can:

Answer questions from voice. Treats transcript like text input.
Acknowledge voice with text ("I heard your message about refunds. Here's our policy...").
Trigger skills based on transcribed intent.

What the bot doesn't do today:

Reply with voice on WhatsApp. Planned.
Process voice quality cues (tone, urgency). Planned via voice sentiment analysis.

Quality considerations

For best results, customers should:

Speak clearly in a quiet environment.
Use the channel's native voice feature (not attached file).
Keep messages under 60 seconds for best transcription.

Limits

Per-message duration. 5 minutes.
Audio formats. Channel-native (OGG for WhatsApp, MP3 for Telegram, etc.).
Transcription latency. Typically under 5 seconds.
File size cap. 20 MB per audio.

Common pitfalls

Garbled transcription. Noisy audio. Bot may misunderstand. Agent should review.

Wrong language detected. Bot replies in wrong language. Set workspace default.

Long voice ramble. Customer talks 4 minutes; bot extracts intent imperfectly. Encourage shorter messages.

FAQ

Will customers know it's a bot if they send voice?

Bot responds in text by default. Customer can tell from response style.

Can I disable voice transcription?

Yes under Workspace Settings > Voice. Audio still archives; transcription skipped.

Does voice count against my query quota?

Yes. Each transcribed voice + bot reply counts as 1 query.

Was this page helpful?