Compare models side-by-side
When to use
Three patterns:
- Picking a default model. Compare models on representative queries before committing.
- Validating an upgrade. Before switching to a higher tier, run sample queries against both.
- Quality regression check. When AskVault rolls out a new model version, validate continuity.
How to use
- Open Chat Playground.
- Click "Compare Models".
- Pick 2 to 3 models from your available roster.
- Type a query.
- Click Run.
- Side-by-side results show within 5 seconds.
What's compared
Per response:
- Answer text.
- Cited sources (does each model cite?).
- Latency (first token time, total time).
- Token usage (input plus output).
- Skill triggers (did each model trigger the same skills?).
Useful for objective evaluation.
Sample comparison
For a query "What's your refund policy?":
| Model | Answer quality | Latency | Tokens | Citations |
|---|---|---|---|---|
| Model A (Free) | Concise, accurate | 1.2s | 320 | 2 sources |
| Model B (Growth) | Detailed, accurate | 2.1s | 580 | 2 sources |
| Model C (Business) | Most thorough | 2.8s | 750 | 3 sources |
Pick based on your priority: speed (A), balance (B), thoroughness (C).
Cost implications
Higher-tier models cost more in token billing. The dashboard estimates cost per query so you can weigh:
- More tokens = better quality but higher operational cost.
- For high-volume bots, choose the smallest model that maintains quality.
- For low-volume bots, prioritize quality.
About 70% of teams default to mid-tier; 20% pick the high-tier; 10% pick the smallest.
Batch comparison
For Business and above:
- Compare Models > Batch.
- Upload a CSV of test queries.
- AskVault runs each query against each model.
- Returns a CSV with answers, latency, scores.
Useful for regression testing or pre-launch validation.
Limits
- Models per comparison. 3 simultaneous.
- Batch size. 100 queries per batch.
- Playground queries. Unlimited and free.
- Response latency. Under 5 seconds per model.
- Comparison history retention. 90 days.
Common pitfalls
Comparing on a single query. Not statistically meaningful. Use 10 to 30 queries for confidence.
Cherry-picking favorable queries. Test on representative customer queries, not just easy ones.
Latency variance. Single-call latency varies. Run 5+ calls per model and average.
FAQ
Does compare-mode count against my quota?
No. Playground queries are free.
Can I compare across workspaces?
Yes via cross-workspace comparison mode (Enterprise).
Will my production switch immediately reflect compare results?
Only if you change the workspace's default model after.