How to upload PDF and Word documents to AskVault
Supported file types
Five formats supported out of the box:
- PDF. Most common case. AskVault extracts text, preserves tables, handles multi-column layouts.
- DOCX. Microsoft Word documents. Headings, lists, and tables preserved.
- TXT. Plain text. No formatting to preserve, fast to ingest.
- MD (Markdown). Markdown files. Headings, lists, code blocks, tables preserved.
- CSV. Comma-separated tabular data. Each row indexes as a separately-retrievable record by default; full-table mode also available.
Image-only PDFs (scanned documents) require OCR. AskVault runs OCR automatically on detected image PDFs. Adds 5 to 30 seconds per page of scanning time. OCR is on Growth and above. Growth+
For other formats (XLSX, PPTX, HTML files), upload as a different supported type after conversion. Excel files convert to CSV; PowerPoint to PDF; HTML can be crawled instead via URL crawling.
Upload via the dashboard
- Open Knowledge > Add Source > Upload File in AskVault.
- Drag and drop your file(s). Or click to browse.
- Configure per-file options. Audience tags, parse-as-table for CSV, document title override.
- Click Upload.
Files queue for processing. Indexing time depends on size:
- Small PDFs (under 10 pages). 10 to 30 seconds.
- Standard docs (10 to 100 pages). 30 to 120 seconds.
- Large reports (100 to 500 pages). 2 to 8 minutes.
- OCR'd scanned PDFs. 5 to 30 seconds per page on top of regular indexing.
Progress visible under Knowledge Hub > [filename] > Status.
Upload via API
For programmatic upload (CI/CD pipelines, bulk imports from your CMS):
curl -X POST https://api.askvault.co/v1/documents \ -H "Authorization: Bearer ak_xxx" \ -F "workspace_id=wt_xxx" \ -F "file=@policy.pdf" \ -F "audience=hr_team,managers"Response:
{ "document_id": "doc_xxx", "filename": "policy.pdf", "size_bytes": 245678, "status": "indexing", "audience": ["hr_team", "managers"]}Poll the document's status:
curl https://api.askvault.co/v1/documents/doc_xxx -H "Authorization: Bearer ak_xxx"Status transitions through queued, indexing, ready. Failed uploads transition to failed with an error code in the response.
How AskVault chunks each format
Format-specific chunking rules:
- PDF. Detects logical reading order, handles multi-column layouts, preserves table rows as a unit. Chunks at 400 to 800 tokens with parent-heading prefixes.
- DOCX. Splits on heading styles (Heading 1, Heading 2). Preserves bullet lists and tables. Same chunk sizes.
- TXT. Splits on paragraph breaks. Less structural info available; chunks tend to be more uniform.
- MD. Splits on
##boundaries. Preserves code blocks intact. - CSV. Two modes:
- Row-per-document (default). Each row is its own document; useful for product catalogs.
- Full-table (parse-as-table option). Whole CSV becomes one chunk. Useful for small reference tables.
See chunking strategies for the rationale.
Audience tagging at upload
For documents that should only be visible to specific verified users, set audience tags at upload time:
- Dashboard. Type tag names in the audience field on the upload form.
- API. Pass
audienceas a multi-value form field (audience=hr_team,audience=managers).
The bot only retrieves these documents for verified visitors whose audience set includes one of the tags. Combine with identity verification for production access control. Growth+
Bulk upload
For uploading 50+ files at once:
- Dashboard. Drag-and-drop supports multiple files in one go. Up to 100 files per drop.
- API. Loop POST requests with controlled concurrency. Rate limits apply per your plan.
- ZIP upload. Upload a ZIP file containing multiple supported documents. AskVault extracts and processes each. Useful for shipping the entire
/policies/folder.
For bulk imports from existing knowledge systems (Confluence, Notion), use the native integrations instead of file upload. Notion and GitHub integrations stay in sync over time; one-time bulk uploads don't.
Plan-level content cap
Total storage per workspace:
- Free. 5 MB.
- Starter. 15 MB.
- Growth. 40 MB.
- Business. 100 MB.
- Enterprise. Custom.
Average B2B SaaS policy document is 50 to 200 KB of extracted text. 100 MB therefore covers 500 to 2,000 documents in practice. Plan upgrades are immediate.
If you hit the cap, new uploads fail with HTTP 413 Payload Too Large. Delete unused docs or upgrade.
Common pitfalls
PDF text extracts garbled. PDF was created from a scan and OCR is disabled. Enable OCR or re-upload after running through Adobe Acrobat's "Recognize Text" pre-processing.
Tables look wrong in retrieved chunks. PDF table detection failed. For dense tabular PDFs, convert to CSV first; the CSV ingestion preserves table structure better.
Document indexed but bot doesn't cite it. Audience tags restrict it to verified users. If you're testing as an anonymous visitor, the bot won't see it. Test with an identity-verified user or temporarily remove audience tags.
Large PDF times out. Files over 100 MB hit our upload limit. Split into smaller files (per chapter or section) and upload separately.
FAQ
Does AskVault keep the original file after indexing?
No. The original file is discarded within 10 minutes of indexing completion. Only the extracted text and embeddings remain in the workspace. This reduces the data-at-rest surface for sensitive documents.
Can I update a document by re-uploading?
Yes. Upload a file with the same name; AskVault detects the match and replaces the old version. Vector embeddings are regenerated; old chunks are removed.
Does PDF page order matter for retrieval?
The chunker tries to detect logical reading order. For multi-column layouts (research papers, magazines), expect occasional out-of-order issues. Single-column standard PDFs work perfectly.
Can I extract specific PDF page ranges?
Not via the upload form. Pre-process the PDF to extract the page range, then upload. Or use the API with page_range parameter on Business and above.
What about password-protected PDFs?
Upload with the password field. AskVault decrypts during ingestion, extracts text, then discards both the file and the password.
Related guides
- URL crawling
- Q&A pairs
- Chunking strategies for production RAG
- How to restrict the AI bot to specific URLs only
- Notion integration setup