POST /v1/query/stream, Server-Sent Events reference
Endpoint
POST https://api.askvault.co/v1/query/streamSame request schema as POST /v1/query. The response is a Server-Sent Events stream instead of synchronous JSON.
Authentication: Authorization: Bearer ak_xxx. See authentication.
When to use streaming
Two cases where streaming wins over synchronous:
- Live chat UI. The customer sees the bot "typing" word by word. Perceived latency drops dramatically even though total latency is the same.
- Long answers. Synchronous calls might time out at 30 seconds for very long responses. Streaming sends tokens as they're generated, so there's no single 30-second boundary to hit.
For backend automation (a nightly job processing emails), synchronous is simpler. For any user-facing real-time UI, streaming.
Minimal example
curl -N -X POST https://api.askvault.co/v1/query/stream \-H "Authorization: Bearer ak_xxx" \-H "Content-Type: application/json" \-d '{ "workspace_id": "wt_xxx", "message": "How does pricing work?"}'import os, requests
with requests.post( "https://api.askvault.co/v1/query/stream", headers={"Authorization": f"Bearer {os.environ['ASKVAULT_API_KEY']}"}, json={"workspace_id": "wt_xxx", "message": "How does pricing work?"}, stream=True, timeout=60,) as r: for line in r.iter_lines(): if line and line.startswith(b"data: "): print(line[6:].decode())const response = await fetch("https://api.askvault.co/v1/query/stream", {method: "POST",headers: { "Authorization": `Bearer ${process.env.ASKVAULT_API_KEY}`, "Content-Type": "application/json",},body: JSON.stringify({ workspace_id: "wt_xxx", message: "How does pricing work?" }),});
const reader = response.body.getReader();const decoder = new TextDecoder();let buffer = "";while (true) {const { value, done } = await reader.read();if (done) break;buffer += decoder.decode(value, { stream: true });const lines = buffer.split("\n");buffer = lines.pop();for (const line of lines) { if (line.startsWith("data: ")) { const event = JSON.parse(line.slice(6)); console.log(event); }}}The curl -N flag disables output buffering so you see tokens as they arrive.
Event types
The stream sends multiple event types. Parse each one based on the type field:
token
Partial answer text. Many of these arrive as the LLM generates the response.
{ "type": "token", "text": "To "}Concatenate the text values in order to assemble the full answer.
source
A citation as it's selected during retrieval. Arrives before token events.
{ "type": "source", "document_id": "doc_a1b2c3", "document_title": "Refund policy", "url": "https://acme.co/policies/refunds", "relevance_score": 0.94, "snippet": "Refunds are available..."}Multiple source events per stream, typically 3 to 5.
done
Stream completion. Always the last event.
{ "type": "done", "confidence": "high", "tokens_used": 187, "latency_ms": 1842, "request_id": "req_xxx", "conversation_id": "conv_xxx"}After done, the server closes the connection.
error
Stream-level error. Stops the stream.
{ "type": "error", "code": "model_provider_error", "detail": "Upstream LLM provider returned 503"}If you receive an error event, the connection closes. Retry with backoff for transient codes (model_provider_error, provider_down); don't retry for permanent codes (invalid_workspace_id, etc.).
Latency profile
Typical timing of events:
- 0 to 100 ms. Request validated, retrieval begins.
- 100 to 250 ms.
sourceevents arrive (3 to 5 of them in quick succession). - 250 to 350 ms. First
tokenevent arrives. - 350 ms to 2.5 seconds. Token stream continues.
- 2.5 to 4 seconds.
doneevent arrives, connection closes.
First-token latency under 300 ms is the key UX number. The customer sees the bot start typing almost immediately, which feels fast even if the full answer takes 3 seconds.
Request parameters
Same as POST /v1/query. One stream-specific extension:
| Field | Type | Required | Description |
|---|---|---|---|
include_sources_in_text | boolean | No | If true, source citation markers like [1], [2] appear inline in the token stream. Default false. |
Parsing the stream
Server-Sent Events format:
event: messagedata: {"type":"token","text":"Hello"}
event: messagedata: {"type":"token","text":" world"}
event: messagedata: {"type":"done","confidence":"high",...}Each event has:
- An optional
event: <name>line (AskVault usesmessagefor all). - A
data: <json>line with the event payload. - A blank line as the terminator.
Standard SSE parsers handle this. If you're writing a parser by hand, split on \n\n to find event boundaries, then on \n to find lines within an event.
Error recovery
For 5xx errors mid-stream:
- Close the stream connection.
- Wait
2^attempt * 1seconds (exponential backoff). - Retry the same request, optionally with
conversation_idto resume context. - Max 5 attempts.
For client-side errors (network drop, browser tab backgrounded): don't auto-retry; show the user the partial response and a "regenerate" button.
Cancellation
To cancel a stream mid-flight, close the connection on the client. AskVault detects the close and stops generation within 200 ms, freeing the compute budget.
Cancelled streams count as 1 query against your quota (same as completed streams). We don't refund quota for cancellations.
Limits
- Plan availability. Same as synchronous: Free through Enterprise.
- Rate limits. Per the rate limits page. Streaming queries count the same as synchronous.
- Concurrent streams per key. 50 on Growth, 200 on Business. Higher concurrency available on Enterprise.
Common pitfalls
Connection hangs after the first event. Output buffering in your HTTP client. Disable it (curl -N, requests.iter_lines(decode_unicode=False), etc.).
Some tokens missing. SSE parsers occasionally drop events when chunks span TCP packet boundaries. Use a battle-tested SSE library, not a hand-rolled parser.
Stream feels slow on first query, fast on subsequent. Cold-start latency. Pre-warm with a noop query at app start.
Done event never arrives. Connection dropped. Treat the partial response as final and show a "regenerate" option.
FAQ
Does streaming cost more than synchronous?
No. Same per-query cost.
Can I use streaming in a browser?
Yes, but your API key would be exposed. Proxy through your backend or use the widget channel which authenticates with a public workspace token.
How do I show "typing" indicators in my UI?
Show a typing animation as soon as you send the request. Hide it when the first token event arrives and start appending tokens to the bubble.
Can I get the source citations before tokens arrive?
Yes, source events arrive before tokens in the stream. Render the source list as soon as you see them, then start streaming the answer.
Does cancellation save quota?
No. Once the request is accepted, it counts against your quota regardless of completion. Cancellation only saves the compute time on our side.
Related guides
- POST /v1/query reference
- API authentication
- Rate limits per plan
- Error codes
- How to embed an AI chatbot on a React website