Skip to content
Try Free →

POST /v1/query/stream, Server-Sent Events reference

Last updated: · 5 min read

Endpoint

POST https://api.askvault.co/v1/query/stream

Same request schema as POST /v1/query. The response is a Server-Sent Events stream instead of synchronous JSON.

Authentication: Authorization: Bearer ak_xxx. See authentication.

When to use streaming

Two cases where streaming wins over synchronous:

  1. Live chat UI. The customer sees the bot "typing" word by word. Perceived latency drops dramatically even though total latency is the same.
  2. Long answers. Synchronous calls might time out at 30 seconds for very long responses. Streaming sends tokens as they're generated, so there's no single 30-second boundary to hit.

For backend automation (a nightly job processing emails), synchronous is simpler. For any user-facing real-time UI, streaming.

Minimal example

Terminal window
curl -N -X POST https://api.askvault.co/v1/query/stream \
-H "Authorization: Bearer ak_xxx" \
-H "Content-Type: application/json" \
-d '{
"workspace_id": "wt_xxx",
"message": "How does pricing work?"
}'

The curl -N flag disables output buffering so you see tokens as they arrive.

Event types

The stream sends multiple event types. Parse each one based on the type field:

token

Partial answer text. Many of these arrive as the LLM generates the response.

{
"type": "token",
"text": "To "
}

Concatenate the text values in order to assemble the full answer.

source

A citation as it's selected during retrieval. Arrives before token events.

{
"type": "source",
"document_id": "doc_a1b2c3",
"document_title": "Refund policy",
"url": "https://acme.co/policies/refunds",
"relevance_score": 0.94,
"snippet": "Refunds are available..."
}

Multiple source events per stream, typically 3 to 5.

done

Stream completion. Always the last event.

{
"type": "done",
"confidence": "high",
"tokens_used": 187,
"latency_ms": 1842,
"request_id": "req_xxx",
"conversation_id": "conv_xxx"
}

After done, the server closes the connection.

error

Stream-level error. Stops the stream.

{
"type": "error",
"code": "model_provider_error",
"detail": "Upstream LLM provider returned 503"
}

If you receive an error event, the connection closes. Retry with backoff for transient codes (model_provider_error, provider_down); don't retry for permanent codes (invalid_workspace_id, etc.).

Latency profile

Typical timing of events:

  • 0 to 100 ms. Request validated, retrieval begins.
  • 100 to 250 ms. source events arrive (3 to 5 of them in quick succession).
  • 250 to 350 ms. First token event arrives.
  • 350 ms to 2.5 seconds. Token stream continues.
  • 2.5 to 4 seconds. done event arrives, connection closes.

First-token latency under 300 ms is the key UX number. The customer sees the bot start typing almost immediately, which feels fast even if the full answer takes 3 seconds.

Request parameters

Same as POST /v1/query. One stream-specific extension:

FieldTypeRequiredDescription
include_sources_in_textbooleanNoIf true, source citation markers like [1], [2] appear inline in the token stream. Default false.

Parsing the stream

Server-Sent Events format:

event: message
data: {"type":"token","text":"Hello"}
event: message
data: {"type":"token","text":" world"}
event: message
data: {"type":"done","confidence":"high",...}

Each event has:

  • An optional event: <name> line (AskVault uses message for all).
  • A data: <json> line with the event payload.
  • A blank line as the terminator.

Standard SSE parsers handle this. If you're writing a parser by hand, split on \n\n to find event boundaries, then on \n to find lines within an event.

Error recovery

For 5xx errors mid-stream:

  1. Close the stream connection.
  2. Wait 2^attempt * 1 seconds (exponential backoff).
  3. Retry the same request, optionally with conversation_id to resume context.
  4. Max 5 attempts.

For client-side errors (network drop, browser tab backgrounded): don't auto-retry; show the user the partial response and a "regenerate" button.

Cancellation

To cancel a stream mid-flight, close the connection on the client. AskVault detects the close and stops generation within 200 ms, freeing the compute budget.

Cancelled streams count as 1 query against your quota (same as completed streams). We don't refund quota for cancellations.

Limits

  • Plan availability. Same as synchronous: Free through Enterprise.
  • Rate limits. Per the rate limits page. Streaming queries count the same as synchronous.
  • Concurrent streams per key. 50 on Growth, 200 on Business. Higher concurrency available on Enterprise.

Common pitfalls

Connection hangs after the first event. Output buffering in your HTTP client. Disable it (curl -N, requests.iter_lines(decode_unicode=False), etc.).

Some tokens missing. SSE parsers occasionally drop events when chunks span TCP packet boundaries. Use a battle-tested SSE library, not a hand-rolled parser.

Stream feels slow on first query, fast on subsequent. Cold-start latency. Pre-warm with a noop query at app start.

Done event never arrives. Connection dropped. Treat the partial response as final and show a "regenerate" option.

FAQ

Does streaming cost more than synchronous?

No. Same per-query cost.

Can I use streaming in a browser?

Yes, but your API key would be exposed. Proxy through your backend or use the widget channel which authenticates with a public workspace token.

How do I show "typing" indicators in my UI?

Show a typing animation as soon as you send the request. Hide it when the first token event arrives and start appending tokens to the bubble.

Can I get the source citations before tokens arrive?

Yes, source events arrive before tokens in the stream. Render the source list as soon as you see them, then start streaming the answer.

Does cancellation save quota?

No. Once the request is accepted, it counts against your quota regardless of completion. Cancellation only saves the compute time on our side.

Was this page helpful?