Skip to content
Try Free →

Compare models side-by-side

Last updated: · 3 min read

When to use

Three patterns:

  1. Picking a default model. Compare models on representative queries before committing.
  2. Validating an upgrade. Before switching to a higher tier, run sample queries against both.
  3. Quality regression check. When AskVault rolls out a new model version, validate continuity.

How to use

  1. Open Chat Playground.
  2. Click "Compare Models".
  3. Pick 2 to 3 models from your available roster.
  4. Type a query.
  5. Click Run.
  6. Side-by-side results show within 5 seconds.

What's compared

Per response:

  • Answer text.
  • Cited sources (does each model cite?).
  • Latency (first token time, total time).
  • Token usage (input plus output).
  • Skill triggers (did each model trigger the same skills?).

Useful for objective evaluation.

Sample comparison

For a query "What's your refund policy?":

ModelAnswer qualityLatencyTokensCitations
Model A (Free)Concise, accurate1.2s3202 sources
Model B (Growth)Detailed, accurate2.1s5802 sources
Model C (Business)Most thorough2.8s7503 sources

Pick based on your priority: speed (A), balance (B), thoroughness (C).

Cost implications

Higher-tier models cost more in token billing. The dashboard estimates cost per query so you can weigh:

  • More tokens = better quality but higher operational cost.
  • For high-volume bots, choose the smallest model that maintains quality.
  • For low-volume bots, prioritize quality.

About 70% of teams default to mid-tier; 20% pick the high-tier; 10% pick the smallest.

Batch comparison

For Business and above:

  1. Compare Models > Batch.
  2. Upload a CSV of test queries.
  3. AskVault runs each query against each model.
  4. Returns a CSV with answers, latency, scores.

Useful for regression testing or pre-launch validation.

Limits

  • Models per comparison. 3 simultaneous.
  • Batch size. 100 queries per batch.
  • Playground queries. Unlimited and free.
  • Response latency. Under 5 seconds per model.
  • Comparison history retention. 90 days.

Common pitfalls

Comparing on a single query. Not statistically meaningful. Use 10 to 30 queries for confidence.

Cherry-picking favorable queries. Test on representative customer queries, not just easy ones.

Latency variance. Single-call latency varies. Run 5+ calls per model and average.

FAQ

Does compare-mode count against my quota?

No. Playground queries are free.

Can I compare across workspaces?

Yes via cross-workspace comparison mode (Enterprise).

Will my production switch immediately reflect compare results?

Only if you change the workspace's default model after.

Was this page helpful?