Skip to content
Try Free →

How to crawl a sitemap into AskVault

Last updated: · 4 min read

A sitemap is an explicit URL inventory the site owner publishes. Two reasons it beats link-following:

  1. Speed. AskVault knows every URL upfront. Parallel fetch with controlled concurrency. 5 to 10x faster than following internal links recursively.
  2. Completeness. Link-following misses orphaned pages (no links pointing to them) and dynamically-generated pages. Sitemaps include both.

For any site that publishes a sitemap, sitemap crawling is the right choice. Most modern WordPress, Shopify, Webflow, and Next.js sites publish one by default at /sitemap.xml.

Setup

Two minutes end-to-end.

  1. Find your sitemap URL. Usually https://yoursite.com/sitemap.xml. Some sites use /sitemap_index.xml (the index of multiple sub-sitemaps). Verify by visiting the URL in your browser; you should see XML listing URLs.
  2. In AskVault, open Knowledge > Add Source > Website.
  3. Paste the sitemap URL. AskVault auto-detects it as a sitemap (vs a regular page).
  4. Pick crawl mode = Sitemap.
  5. Optionally configure include/exclude patterns. Useful for sitemaps with thousands of URLs where you only want a subset.
  6. Click Start Crawl.

AskVault parses the sitemap (or sitemap index, recursively if needed), enumerates every listed URL, and crawls them in parallel. Progress visible under Knowledge Hub > [site] > Status.

Sitemap index vs single sitemap

Two common sitemap structures:

  • Single sitemap. One file listing every URL. Most small sites. Up to 50,000 URLs per file is the standard cap.
  • Sitemap index. A top-level file that lists multiple sub-sitemaps. Each sub-sitemap covers a section (posts, pages, products). Large sites use this.

AskVault handles both. For sitemap indexes, the crawler walks each sub-sitemap and aggregates the URLs.

Multi-language sitemaps

Sites with hreflang annotations or per-language sitemaps work:

  • Single sitemap with hreflang. Each URL has alternate-language variants. AskVault picks the language matching your workspace's primary language by default.
  • Per-language sitemaps. Sitemap index lists sitemap-en.xml, sitemap-fr.xml, etc. Configure which languages to ingest under Crawl Config > Languages.

For multi-language workspaces, ingest all languages and the bot routes responses to the language matching each conversation.

Include and exclude patterns

For sitemaps with thousands of URLs where you only want a subset, use patterns:

  • Include path prefixes. /docs/, /help/. Only URLs matching these prefixes are crawled.
  • Exclude path prefixes. /admin/, /internal/, /tag/. URLs matching these are skipped.

Patterns apply during sitemap enumeration. AskVault parses the sitemap, filters URLs, then crawls only the kept ones.

Gzipped sitemaps

Sites with very large sitemaps often gzip them (sitemap.xml.gz). AskVault decompresses on the fly. No special configuration needed.

Crawl-frequency hints in sitemaps

Sitemaps can include <changefreq> and <priority> tags per URL. AskVault uses these as hints for the re-crawl schedule:

  • <changefreq>daily</changefreq> URLs. Re-crawled daily.
  • <changefreq>weekly</changefreq> URLs. Re-crawled weekly.
  • <priority> weight. Higher-priority URLs surface more prominently in retrieval (default weight applied across all).

If your sitemap doesn't include these tags, AskVault applies a uniform weekly re-crawl.

Skipping the sitemap

If your sitemap is missing or outdated, fall back to standard URL crawling:

  • Pure URL crawl. AskVault walks internal links recursively starting from the home page.
  • Pattern-based crawl. Restrict the recursion to specific path prefixes.

For sites without a sitemap, URL crawl is the only option. See URL crawling.

Limits

  • Max URLs per sitemap crawl. 50,000 per sitemap file (sitemap standard). For larger sites, use a sitemap index.
  • Initial crawl latency. 500 pages in about 10 minutes; 5,000 pages in 1 to 2 hours.
  • Per-plan content cap. Same as URL crawling: 5 MB Free, 15 MB Starter, 40 MB Growth, 100 MB Business.

Common pitfalls

Sitemap returns 404. Your site doesn't publish at the expected path. Common alternatives: /sitemap_index.xml, /wp-sitemap.xml, /sitemap1.xml. Try a few; AskVault accepts any URL.

Sitemap returns but is empty. Some CMS plugins generate empty sitemaps. Verify by opening the URL in your browser. If empty, you need to fix the source CMS first.

Some URLs missing. They weren't in the sitemap. Add them via the CMS or include them manually as URL sources.

Crawl slower than expected. Sites that throttle aggressive crawlers (return 429 or block by IP). AskVault has built-in backoff but very throttled sites can take significantly longer.

FAQ

What if my site has multiple sitemaps?

AskVault handles sitemap indexes natively. Point at the index file (typically /sitemap_index.xml) and the crawler walks all sub-sitemaps.

Can I crawl a sitemap on a different domain?

Yes. The sitemap can be hosted on a different domain than the URLs it lists. Useful for CDN-cached sitemaps.

Does AskVault respect lastmod for incremental sync?

Yes. During re-crawl, AskVault checks <lastmod> in the sitemap. URLs unchanged since the last sync are skipped. Significantly reduces re-crawl time on stable sites.

What about news sitemaps or image sitemaps?

News sitemaps follow the same standard; AskVault treats them like regular sitemaps. Image sitemaps are crawled but only as URL discovery; AskVault doesn't index image content directly.

Can I crawl a sitemap behind authentication?

Yes. Provide a session cookie or API header under Crawl Config > Custom Headers. The crawler sends the auth on every sitemap and content fetch.

Was this page helpful?