File uploads to the knowledge base
Supported file types
Five formats today:
- PDF. Most common. Text-based PDFs index fully; scanned PDFs require OCR pre-processing.
- DOCX. Microsoft Word. Tables and headings preserved.
- TXT. Plain text. Indexed as-is.
- MD (Markdown). Headings, lists, code blocks structured during indexing.
- CSV. Parsed as a table; column headers preserved. Each row becomes a queryable chunk.
For other formats:
- PPTX (PowerPoint). Export to PDF first, then upload.
- XLSX (Excel). Export to CSV first.
- HTML. Use URL crawling instead.
- Image-only PDFs. Run OCR via Adobe Acrobat or a free tool, then upload.
Uploading a file
Two paths.
Via the dashboard.
- Open Knowledge Hub > Add Source > File Upload.
- Drag and drop one or more files into the upload area.
- Optionally pick an audience tag before upload.
- Click Upload. Indexing starts within 10 seconds.
Via the API.
curl -X POST https://api.askvault.co/v1/documents/upload \ -H "Authorization: Bearer ak_xxx" \ -F "file=@/path/to/handbook.pdf" \ -F "audience=internal" \ -F "workspace_id=ws_xxx"Returns the document ID and indexing-status URL.
What happens during indexing
The pipeline:
- File received. Stored encrypted at rest immediately.
- Parser runs. Format-specific extractor pulls text, headings, tables.
- Chunker splits content into semantic chunks (typically 200 to 500 tokens each).
- Embedder generates vectors for each chunk.
- Vectors stored in the workspace's index.
- Document marked ready. Available for retrieval.
Total time: about 30 seconds per MB of content. A 10 MB PDF indexes in about 5 minutes.
Watch progress under Knowledge Hub > [document] which shows "Queued > Indexing > Ready" with a percentage.
Per-file size limit
50 MB per file. For larger documents:
- Split into smaller files. PDF tools can split by chapter.
- Compress images in the source document before uploading.
- Convert to text-only. A scanned PDF can be 10x larger than the OCR-extracted text equivalent.
If a file exceeds 50 MB at upload, the dashboard rejects it. The API returns HTTP 413.
Workspace storage cap
Total content size per workspace varies by plan:
- Free. 5 MB. Roughly 50 to 100 pages of typical text.
- Starter. 15 MB. Roughly 150 to 300 pages. Starter+
- Growth. 40 MB. Roughly 400 to 800 pages. Growth+
- Business. 100 MB. Roughly 1,000 to 2,000 pages. Business+
- Enterprise. Unlimited.
When you hit the cap, new uploads fail with HTTP 413. Either delete unused documents or upgrade the plan.
Check current usage under Knowledge Hub > Storage Usage.
File-type indexing details
How each format is handled:
PDF (text-based). Pages extracted in order. Headings detected from font-size heuristics. Tables converted to inline text with cell separators.
PDF (scanned/image). Indexing produces empty results without OCR. Pre-process via Adobe Acrobat (File > OCR), Tesseract (free CLI), or an online tool. Then re-upload.
DOCX. Headings, paragraphs, tables, bullet lists structurally preserved. Tracked changes and comments stripped.
TXT. Indexed line-by-line. No structural inference; chunks split at paragraph breaks.
Markdown. Headings (H1-H6) used for chunk boundaries. Code blocks indexed as code; fenced language preserved.
CSV. First row treated as column headers. Each subsequent row becomes a chunk with column context. Useful for FAQ tables, product catalogs, or structured data.
Audience tagging on upload
Tag files at upload time:
- In the upload modal, click "Set audience" before clicking Upload.
- Pick a tag (or type a new one).
- All uploaded files inherit the tag.
Override per file later under Knowledge Hub > [document] > Audience.
See audience tags for how the bot uses these.
Replacing or updating a file
When the source document changes:
Replace option. Under Knowledge Hub > [document] > Replace, upload the new version. The system swaps content while preserving the document ID. Existing citations remain valid.
Re-upload as new. Upload a fresh copy and delete the old one. Use only if the document fundamentally changed and old citations should break.
Replace is the safer default. Document IDs and citation links stay stable.
Citations in bot responses
When the bot answers from an uploaded file:
Per the Employee Handbook, PTO accrues at 1.5 days per month for full-time staff. [Source: Employee Handbook 2026.pdf, page 14]
Citations include:
- File name as the human-readable label.
- Approximate page number for PDFs (best-effort, can drift by 1 to 2 pages on layout-heavy PDFs).
- Click-through link to download the source file (visible to authorized visitors only).
OCR pre-processing for scanned PDFs
Many older PDFs are scanned images, not text. Without OCR, AskVault indexes nothing useful.
Three OCR options:
- Adobe Acrobat Pro. File > Recognize Text. About 30 seconds per 10 pages.
- Tesseract (free CLI).
tesseract input.pdf output -l eng pdf. Open-source, runs locally. - Online tools like ilovepdf.com or smallpdf.com.
After OCR, upload the OCR'd PDF (now text-searchable) to AskVault.
Bulk upload
For uploading many files:
Dashboard. Drag-and-drop up to 50 files at once. Each indexes in parallel.
API. Loop through files calling /v1/documents/upload per file. Concurrency cap 5 files in flight per workspace.
Zip archives. Today, AskVault doesn't auto-extract zip files. Unzip first, then bulk-upload.
Planned features (on the roadmap)
Documented for accuracy:
- PPTX and XLSX native support. Today, convert to PDF or CSV first. Native parsing planned.
- Zip archive auto-extract. Today, unzip manually. Auto-extract planned.
- OCR-as-a-service. Today, OCR is your responsibility. Server-side OCR planned for Business and above.
- Table-extraction improvement. Today, table parsing is best-effort. Improved table-to-CSV conversion planned for complex layouts.
Limits
- Per-file size. 50 MB.
- Files per workspace. 1,000.
- Total content per workspace. Plan-dependent (5 to 100 MB; unlimited on Enterprise).
- Indexing speed. About 30 seconds per MB.
- Bulk upload concurrent. Up to 5 files in flight.
Common pitfalls
PDF uploaded but no answers reference it. Scanned PDF without OCR. Run OCR and re-upload.
Indexing stuck at "Queued". Workspace hit the indexing rate limit. Wait 5 minutes; queue catches up.
Workspace hit storage cap. Free plan covers 5 MB. Either delete old documents or upgrade.
Table content garbled in answers. Complex PDF tables lose structure during extraction. Upload the source CSV alongside for better table queries.
File rejected as "unsupported format". Check the extension; .doc (old Word) not supported (convert to .docx first).
FAQ
Can I upload files larger than 50 MB?
Not per single file. Split into multiple files, each under 50 MB.
Do I need to OCR PDFs myself?
For scanned PDFs, yes today. Server-side OCR is planned.
How long does indexing take?
About 30 seconds per MB. A 10 MB PDF indexes in roughly 5 minutes.
Can I bulk upload via API?
Yes. Loop through files calling /v1/documents/upload. Concurrency cap 5 per workspace.
What happens to my files if I delete the workspace?
Files are wiped within 30 days, backups purged at 90 days. Not recoverable after.
Related guides
- Knowledge Hub overview
- URL crawling
- PDF upload specifics
- Audience tags
- AI document analysis use case