How vector databases work, in plain English
The 60-second version
A traditional database matches exact strings. Search for "password reset" and you get pages with those exact words. A vector database matches meaning. Search for "I forgot my login" and you get pages about password reset, account recovery, "can't sign in", and "trouble accessing your account", even though none of them share words with the query.
The magic is in two steps:
- Embedding. A neural network turns each piece of text into a vector of 768 to 4,096 floating-point numbers. Texts with similar meanings produce vectors that are mathematically close.
- Similarity search. When you query, you embed the query the same way, then ask the database "find the K vectors closest to this one". The database returns them, sorted by similarity.
That's it. Everything else (indexing strategies, dimensionality reduction, hybrid retrieval) is plumbing on top.
What's an embedding, really?
An embedding is a list of numbers. For a common production embedding model it's 1,536 numbers. Each piece of text gets one embedding.
"password reset" -> [0.012, -0.034, 0.078, ..., 0.041]"forgot my password" -> [0.018, -0.031, 0.082, ..., 0.039]"weather in Bangalore" -> [-0.142, 0.211, -0.038, ..., 0.176]The first two are close (similar meanings). The third is far away (unrelated). The neural network that produces these vectors was trained on hundreds of billions of words of text, learning that "password reset" and "forgot my password" should land near each other in the 1,536-dimensional space.
You don't need to understand HOW the network learned this. You only need to trust that it did, and that the relationship "similar meaning produces similar vectors" holds reliably.
Cosine similarity, the standard distance function
To check if two vectors are "close", we compute their cosine similarity. The math:
cos(A, B) = (A · B) / (|A| * |B|)That's the dot product of the two vectors divided by their magnitudes. The result is a number between -1 and 1:
- 1.0 means the vectors point in exactly the same direction. Identical meaning.
- 0.0 means orthogonal. Unrelated.
- -1.0 means opposite directions. Antonymous meaning.
In practice, useful matches usually score 0.7 to 0.9. Anything below 0.5 is too distant to be a good match. Anything above 0.95 means you found a near-duplicate.
You almost never need to compute this by hand. Every production vector database has it as a built-in operator.
Why "approximate" nearest-neighbor is fine
The naive way to find the K closest vectors is to compute cosine similarity against every vector in the database. For a million vectors, that's a million operations per query. Slow.
Approximate nearest-neighbor (ANN) algorithms cheat. They build an index that lets you find vectors that are "probably in the top K" without comparing against everything. Two common families:
HNSW (Hierarchical Navigable Small World). Builds a graph where each vector is connected to its closest neighbors at multiple "levels". Search starts at the top level (sparse, fast traversal), narrows down at lower levels (dense, more accurate). Sub-millisecond queries on tens of millions of vectors. Most production systems use HNSW.
IVF (Inverted File Index). Clusters vectors into a few thousand buckets. Search probes only the closest buckets, ignores the rest. Faster build time, slightly worse recall than HNSW.
Both lose maybe 1-3% recall (i.e., they might miss 1 to 3 of the top-100 results) in exchange for 10x to 1000x speedup. For RAG use cases, this trade is essentially free; the LLM step is so much slower that retrieval is rarely the bottleneck anyway.
Production vector database options in 2026
A few common picks:
- Postgres with a vector extension. Add the vector extension to a Postgres database. Same database for relational data and vectors. SOC 2 Type II certified if you use a managed Postgres provider. Fast queries at production scale. AskVault uses this pattern.
- Pinecone. Fully managed, designed-for-vectors SaaS. Easy to start with, expensive at scale.
- Weaviate. Open source, GraphQL API. Good if you want self-hosted with rich filtering.
- Qdrant. Open source, Rust. Very fast. Good for latency-sensitive workloads.
- Chroma. Developer-friendly. Lower production maturity than the others.
- Milvus. Open source, designed for billion-scale workloads. Operational overhead is significant.
For most teams the right starting point is whatever Postgres-based vector solution your stack already has. You don't need a separate database for this until you've genuinely outgrown Postgres performance, which for most B2B workloads is well beyond typical operating scale.
Filtering: combining vectors with structured data
Pure vector search returns the K closest vectors regardless of any other property. In practice you usually want to combine it with structured filtering. Examples:
- "Find the closest vectors to this query, but only from documents tagged
audience: ["managers"]." - "Find the closest vectors, but only from documents updated in the last 30 days."
- "Find the closest vectors, but only from this workspace."
The pattern is: build a structured filter, apply it before (pre-filter) or after (post-filter) the vector search. Pre-filtering is usually preferred because it shrinks the search space; post-filtering is needed when the filter is computed dynamically.
For multi-tenant systems, the workspace filter is usually a pre-filter applied at the index level, so you never even consider vectors from other workspaces. That's why AskVault's per-workspace vector partition is structural rather than an after-the-fact filter.
Vector search vs full-text search
Pure vector search is great for semantic matches but bad at exact-name retrieval. If someone asks "configure the OPENAI_BASE_URL environment variable", the query vector emphasizes "configure" and "environment variable" semantically. It might miss the chunk that literally lists OPENAI_BASE_URL because that exact token doesn't dominate the vector.
Full-text search (BM25) is the opposite. It nails exact-name retrieval but misses semantic relationships.
The fix is hybrid retrieval: run vector AND keyword search in parallel, fuse the results. The vector search catches semantic matches; the keyword search catches exact-name retrieval. Combined, hybrid retrieval beats either alone on real-world technical support questions by roughly 8 to 14 percentage points in answer accuracy.
AskVault uses hybrid retrieval on higher-tier plans where the trade-off makes sense for the workload.
How big should chunks be?
Chunking is the step where you split documents into smaller pieces before embedding. Common mistakes:
- Too small. Chunks under 100 tokens produce noisy embeddings. Too many false-positive matches.
- Too large. Chunks over 1,500 tokens dilute the semantic signal. The vector captures the average meaning of the chunk, not the specific section you want.
- Naively split. Splitting every 500 tokens regardless of document structure breaks paragraphs mid-sentence and lists mid-item. Causes weird citation behavior.
The right answer for most documents is 400 to 800 tokens per chunk with 50 to 100 tokens of overlap between adjacent chunks (so context survives the boundary). Production chunkers also use document structure: split on heading boundaries, preserve list items as units, keep table rows together.
Re-embedding when content changes
Vector embeddings are computed once per chunk and stored. If the content changes, you have to re-embed. Two patterns:
- Full re-embed on document change. Simple. Re-runs every chunk through the embedding model when any part of the document is updated. Cheap if your documents are small.
- Incremental re-embed on diff. Detects which chunks actually changed, re-embeds only those. Faster for large documents.
AskVault's recurring crawl handles re-embedding automatically. Daily or weekly crawls detect content changes and re-embed only the affected chunks.
Common debugging patterns
When a vector search returns wrong results, three things to check:
- Are the right chunks in the database? Inspect the index directly. If the relevant content isn't there, your chunker dropped it or the crawler missed it.
- Is the query embedding correct? Embed the user's query manually, find the top-K matches. If the top match looks right but the answer is wrong, the LLM step is the problem, not retrieval.
- Are filters too tight? A workspace filter, audience filter, or date filter that's stricter than needed will return zero or too-few results. Loosen one filter at a time.
For technical support specifically, retrieval is rarely the problem in well-architected RAG systems. Most "wrong answer" bugs are in chunking, query rewriting, or LLM-side prompt engineering.
FAQ
Do I need to know any of this to use AskVault?
No. AskVault handles all the vector database operations for you. This page exists so you can debug RAG performance, evaluate competitors, or build your own system if you outgrow ours.
Can I bring my own embeddings?
For Enterprise customers wanting full data isolation, we support pre-embedded chunk uploads. You compute embeddings client-side, upload them via a special endpoint, AskVault stores and queries them. Custom contract required.
Are 1536-dimensional embeddings always best?
No. Smaller embeddings (256 to 768 dimensions) are faster and cheaper but lose some retrieval accuracy. For most production B2B workloads, 1,024 to 1,536 dimensions is the sweet spot. Above 4,096 you hit diminishing returns and dramatic cost increases.
Why not just use OpenAI's file search?
OpenAI's Assistants file search is a managed RAG product, similar to AskVault. It works but it's a closed system: you can't tune chunking, can't see the retrieved chunks at debug time, can't easily switch LLM providers, can't audience-tag documents. AskVault's value proposition is the things you get to control.
Can vector databases search images and audio too?
Yes. Multi-modal embedding models (CLIP for images, Whisper for audio) produce vectors in the same way. The same database can store image and audio vectors alongside text vectors. AskVault is text-only today; multi-modal retrieval is on the roadmap.
Related guides
- What is Retrieval-Augmented Generation (RAG)?
- How to scrape a JavaScript-rendered website
- POST /v1/query reference