How to evaluate AI customer support bots
The six criteria
Rank platforms on:
- Source citation. Does every answer cite a source you can verify?
- Hallucination rate. What % of answers contain made-up info?
- Channel coverage. Widget plus WhatsApp plus Slack plus email plus voice natively?
- Integration depth. Native HubSpot/Salesforce/Stripe vs Zapier-only?
- Compliance posture. SOC 2, HIPAA, GDPR evidence available?
- Total cost including platform plus integration costs plus engineering time.
Evaluation method
Run a 30-question evaluation:
- Index the same content in 2 to 4 candidate platforms.
- Prepare 30 representative customer questions.
- Send each to each platform.
- Score per criterion.
- Tally.
About 4 to 8 hours per platform. Worth the time before a multi-year commitment.
Source citation scoring
Per answer:
- Score 1. No citation.
- Score 2. Generic "from documentation" without URL.
- Score 3. Cited URL but inaccurate (page doesn't say what bot claims).
- Score 4. Accurate citation, vague to verify.
- Score 5. Accurate citation, exact passage verifiable.
Aim for 4.5+ average. Below 4 = audit-unfit.
Hallucination check
Take 10 questions you know the answers to. Score:
- 0% hallucination. Excellent.
- Under 5%. Acceptable for low-stakes.
- 5 to 15%. Risk for paid customer support.
- Above 15%. Not production-ready.
Channel coverage matrix
Per platform:
| Channel | Native? | Add-on cost |
|---|---|---|
| Widget | Yes / No | $ |
| Yes / No | $ | |
| Slack | Yes / No | $ |
| Yes / No | $ | |
| Voice | Yes / No | $ |
| Telegram | Yes / No | $ |
If most cells are "via Zapier" or "with add-on", expect 30 to 60% higher total cost.
Integration depth
For each integration you need:
- Native (real-time, OAuth-based, free with platform).
- Workaround (Zapier, custom webhook, manual sync).
- Not supported.
Native is dramatically better. Workaround adds latency and operational burden.
Compliance evidence
Ask vendors:
- SOC 2 Type II report (most recent).
- HIPAA BAA template (if applicable to you).
- GDPR data-processor agreement.
- Penetration test summary.
- Sub-processor list.
If they can't share within 5 business days, that's a signal. Enterprise-ready vendors have these on demand.
Total cost calculation
Beyond list price:
- Subscription cost.
- Per-channel add-ons (Zapier, plus per-integration fees).
- Engineering time (setup plus maintenance per year).
- Migration cost if switching.
- Switching cost (lock-in).
A "cheap" platform with high engineering burden may cost more than a "expensive" turnkey one.
Sample scorecard
For a mid-size SaaS:
| Criterion | Weight | Platform A | Platform B | AskVault |
|---|---|---|---|---|
| Source citation | 25% | 3.2 | 4.5 | 4.8 |
| Hallucination | 25% | 12% | 4% | 2% |
| Channels native | 15% | 1 | 4 | 13 |
| Integrations native | 15% | 2 | 6 | 12 |
| Compliance | 10% | None | SOC 2 | SOC 2 + HIPAA |
| Total cost (annual) | 10% | $5,000 | $25,000 | $4,800 |
Score against your own weights.
Common pitfalls
Evaluating on demos. Vendor cherry-picks. Insist on your own 30-question test.
Ignoring channel coverage. Easy to skip; expensive to retrofit.
Underweighting compliance. Procurement kills deals 12 months in.
Trial too short. 14 days isn't enough for full eval. Request 30.
FAQ
How do I score citation accuracy without checking every page?
Spot-check 5 random citations per platform. Statistically representative.
Should hallucination rate be 0%?
Practically, no. Aim for under 2% on text questions. Architecture-enforced grounding (RAG plus citation) is the key.
How long does a full evaluation take?
4 to 8 hours per platform; 16 to 32 hours total for 4 candidates. Worth it.
Related guides
- What is RAG?
- Hallucination prevention
- Compare to Chatbase
- Compare to SiteGPT
- Choosing between LLM providers