Skip to content
Try Free →

How to evaluate AI customer support bots

Last updated: · 5 min read

The six criteria

Rank platforms on:

  1. Source citation. Does every answer cite a source you can verify?
  2. Hallucination rate. What % of answers contain made-up info?
  3. Channel coverage. Widget plus WhatsApp plus Slack plus email plus voice natively?
  4. Integration depth. Native HubSpot/Salesforce/Stripe vs Zapier-only?
  5. Compliance posture. SOC 2, HIPAA, GDPR evidence available?
  6. Total cost including platform plus integration costs plus engineering time.

Evaluation method

Run a 30-question evaluation:

  1. Index the same content in 2 to 4 candidate platforms.
  2. Prepare 30 representative customer questions.
  3. Send each to each platform.
  4. Score per criterion.
  5. Tally.

About 4 to 8 hours per platform. Worth the time before a multi-year commitment.

Source citation scoring

Per answer:

  • Score 1. No citation.
  • Score 2. Generic "from documentation" without URL.
  • Score 3. Cited URL but inaccurate (page doesn't say what bot claims).
  • Score 4. Accurate citation, vague to verify.
  • Score 5. Accurate citation, exact passage verifiable.

Aim for 4.5+ average. Below 4 = audit-unfit.

Hallucination check

Take 10 questions you know the answers to. Score:

  • 0% hallucination. Excellent.
  • Under 5%. Acceptable for low-stakes.
  • 5 to 15%. Risk for paid customer support.
  • Above 15%. Not production-ready.

Channel coverage matrix

Per platform:

ChannelNative?Add-on cost
WidgetYes / No$
WhatsAppYes / No$
SlackYes / No$
EmailYes / No$
VoiceYes / No$
TelegramYes / No$

If most cells are "via Zapier" or "with add-on", expect 30 to 60% higher total cost.

Integration depth

For each integration you need:

  • Native (real-time, OAuth-based, free with platform).
  • Workaround (Zapier, custom webhook, manual sync).
  • Not supported.

Native is dramatically better. Workaround adds latency and operational burden.

Compliance evidence

Ask vendors:

  • SOC 2 Type II report (most recent).
  • HIPAA BAA template (if applicable to you).
  • GDPR data-processor agreement.
  • Penetration test summary.
  • Sub-processor list.

If they can't share within 5 business days, that's a signal. Enterprise-ready vendors have these on demand.

Total cost calculation

Beyond list price:

  • Subscription cost.
  • Per-channel add-ons (Zapier, plus per-integration fees).
  • Engineering time (setup plus maintenance per year).
  • Migration cost if switching.
  • Switching cost (lock-in).

A "cheap" platform with high engineering burden may cost more than a "expensive" turnkey one.

Sample scorecard

For a mid-size SaaS:

CriterionWeightPlatform APlatform BAskVault
Source citation25%3.24.54.8
Hallucination25%12%4%2%
Channels native15%1413
Integrations native15%2612
Compliance10%NoneSOC 2SOC 2 + HIPAA
Total cost (annual)10%$5,000$25,000$4,800

Score against your own weights.

Common pitfalls

Evaluating on demos. Vendor cherry-picks. Insist on your own 30-question test.

Ignoring channel coverage. Easy to skip; expensive to retrofit.

Underweighting compliance. Procurement kills deals 12 months in.

Trial too short. 14 days isn't enough for full eval. Request 30.

FAQ

How do I score citation accuracy without checking every page?

Spot-check 5 random citations per platform. Statistically representative.

Should hallucination rate be 0%?

Practically, no. Aim for under 2% on text questions. Architecture-enforced grounding (RAG plus citation) is the key.

How long does a full evaluation take?

4 to 8 hours per platform; 16 to 32 hours total for 4 candidates. Worth it.

Was this page helpful?