Skip to content
Try Free →

Batch URL ingestion

Last updated: · 3 min read

When to use batch ingest

Three patterns:

  1. Migrating from another platform. You have a list of URLs from Chatbase/SiteGPT to re-index.
  2. Onboarding without a sitemap. Known set of important URLs to index.
  3. Targeted re-index of a subset after content changes.

For full-site crawls without a URL list, see full-site crawl.

Method 1: CSV upload

  1. Prepare a CSV with one URL per line. Optional columns: audience, tags.
  2. Knowledge Hub > Add Source > Batch URL > Upload CSV.
  3. Click Upload.
  4. AskVault validates (checks URL format, reachability).
  5. Parallel crawl starts within 60 seconds.

CSV size cap: 10 MB (about 100,000 URLs).

Method 2: paste list

For smaller batches:

  1. Knowledge Hub > Add Source > Batch URL > Paste.
  2. Paste up to 1,000 URLs (one per line).
  3. Click Start.

Method 3: API

For programmatic ingestion:

Terminal window
curl -X POST https://api.askvault.co/v1/documents/batch-crawl \
-H "Authorization: Bearer ak_xxx" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://yoursite.co/page1", "https://yoursite.co/page2"],
"audience": "public"
}'

Returns a job ID; poll for completion.

Crawl rate

Configurable:

  • Default 30 URLs in flight.
  • Per-host throttle. Max 8 URLs in flight per origin (prevents hammering your own server).
  • Customizable under Source Settings.

For a 1,000-URL batch on a healthy site: about 5 to 15 minutes to complete.

Audience and tags per URL

CSV with metadata:

url,audience,tags
https://yoursite.co/pricing,public,pricing
https://yoursite.co/enterprise-docs,enterprise,enterprise|docs
https://yoursite.co/internal-runbook,internal,runbook

Each URL inherits the specified audience and tags.

Progress monitoring

Dashboard shows:

  • Total URLs in batch.
  • Queued / Indexing / Ready / Failed counts.
  • ETA based on current rate.
  • Per-URL detail clickable.

Failed URLs surface with the reason (404, timeout, blocked, etc.).

Re-running

Re-crawling URLs in the batch:

  • All URLs: trigger full re-crawl.
  • Failed only: retry just the failed ones.
  • Stale only: re-crawl URLs not synced in N days.

Limits

  • CSV size. 10 MB.
  • URLs per single API call. 1,000.
  • Concurrent batch jobs. 3 per workspace.
  • Crawl rate. Up to 30 in flight; 8 per host.

Common pitfalls

Rate limited by your own server. Lower per-host throttle.

Many failures. Site blocks our crawler. Allowlist our crawler IPs or use BYOK scraper.

Slow batch. Per-host throttle. Increase under Source Settings.

Duplicates. Same URL with different query strings. Normalize before upload.

FAQ

Can I cancel a batch mid-flight?

Yes. Cancel button in dashboard. In-flight URLs complete; queued ones drop.

Will batch affect my plan's MB cap?

Yes. Each crawled page counts toward the MB cap.

Can I pause and resume?

Pause supported on Business and above. Pauses queue; resume picks up.

Was this page helpful?