How to evaluate AI customer support bots

Written by Aashiq, Founder, AskVault · Reviewed by Aashiq

Last updated: May 15, 2026 · 5 min read

The six criteria

Rank platforms on:

Source citation. Does every answer cite a source you can verify?
Hallucination rate. What % of answers contain made-up info?
Channel coverage. Widget plus WhatsApp plus Slack plus email plus voice natively?
Integration depth. Native HubSpot/Salesforce/Stripe vs Zapier-only?
Compliance posture. SOC 2, HIPAA, GDPR evidence available?
Total cost including platform plus integration costs plus engineering time.

Evaluation method

Run a 30-question evaluation:

Index the same content in 2 to 4 candidate platforms.
Prepare 30 representative customer questions.
Send each to each platform.
Score per criterion.
Tally.

About 4 to 8 hours per platform. Worth the time before a multi-year commitment.

Source citation scoring

Per answer:

Score 1. No citation.
Score 2. Generic "from documentation" without URL.
Score 3. Cited URL but inaccurate (page doesn't say what bot claims).
Score 4. Accurate citation, vague to verify.
Score 5. Accurate citation, exact passage verifiable.

Aim for 4.5+ average. Below 4 = audit-unfit.

Hallucination check

Take 10 questions you know the answers to. Score:

0% hallucination. Excellent.
Under 5%. Acceptable for low-stakes.
5 to 15%. Risk for paid customer support.
Above 15%. Not production-ready.

Channel coverage matrix

Per platform:

Channel	Native?	Add-on cost
Widget	Yes / No	$
WhatsApp	Yes / No	$
Slack	Yes / No	$
Email	Yes / No	$
Voice	Yes / No	$
Telegram	Yes / No	$

If most cells are "via Zapier" or "with add-on", expect 30 to 60% higher total cost.

Integration depth

For each integration you need:

Native (real-time, OAuth-based, free with platform).
Workaround (Zapier, custom webhook, manual sync).
Not supported.

Native is dramatically better. Workaround adds latency and operational burden.

Compliance evidence

Ask vendors:

SOC 2 Type II report (most recent).
HIPAA BAA template (if applicable to you).
GDPR data-processor agreement.
Penetration test summary.
Sub-processor list.

If they can't share within 5 business days, that's a signal. Enterprise-ready vendors have these on demand.

Total cost calculation

Beyond list price:

Subscription cost.
Per-channel add-ons (Zapier, plus per-integration fees).
Engineering time (setup plus maintenance per year).
Migration cost if switching.
Switching cost (lock-in).

A "cheap" platform with high engineering burden may cost more than a "expensive" turnkey one.

Sample scorecard

For a mid-size SaaS:

Criterion	Weight	Platform A	Platform B	AskVault
Source citation	25%	3.2	4.5	4.8
Hallucination	25%	12%	4%	2%
Channels native	15%	1	4	13
Integrations native	15%	2	6	12
Compliance	10%	None	SOC 2	SOC 2 + HIPAA
Total cost (annual)	10%	$5,000	$25,000	$4,800

Score against your own weights.

Common pitfalls

Evaluating on demos. Vendor cherry-picks. Insist on your own 30-question test.

Ignoring channel coverage. Easy to skip; expensive to retrofit.

Underweighting compliance. Procurement kills deals 12 months in.

Trial too short. 14 days isn't enough for full eval. Request 30.

FAQ

How do I score citation accuracy without checking every page?

Spot-check 5 random citations per platform. Statistically representative.

Should hallucination rate be 0%?

Practically, no. Aim for under 2% on text questions. Architecture-enforced grounding (RAG plus citation) is the key.

How long does a full evaluation take?

4 to 8 hours per platform; 16 to 32 hours total for 4 candidates. Worth it.

Was this page helpful?