How to crawl a website into your AskVault knowledge base
What URL crawling does
URL crawling is the most common way to populate an AskVault knowledge base. You give the platform a URL and it walks the site, extracts text, chunks it, and indexes it for retrieval. The bot then answers questions from that content with source citations linking back.
Four ways to scope the crawl:
- Single page. Just one URL.
- Full site. Walk every internal link recursively starting from the given URL.
- Sitemap. Read the site's
/sitemap.xmland ingest every URL listed. - Pattern-based. Walk the site but restrict to URL paths matching a pattern.
Most teams start with full-site or sitemap. Both work well.
Setup, step by step
- Open Knowledge > Add Source > Website in AskVault.
- Paste the URL. Format:
https://acme.coorhttps://acme.co/help. - Pick the crawl mode. Single page, full site, sitemap, or pattern. Default: full site.
- Configure optional settings. Include patterns, exclude patterns, CSS selector, cookies, sync schedule.
- Click Start Crawl.
AskVault begins crawling. Progress appears under Knowledge Hub > [your URL] > Status. 50 pages index in about 90 seconds. 500 pages take about 10 minutes. Larger sites (5,000+ pages) take 30 minutes to 2 hours.
Include and exclude patterns
For sites where you want only specific paths, use patterns:
- Allowed path prefixes.
/docs/,/help/,/guides/. Crawler stays within these. - Disallowed path prefixes.
/admin/,/internal/,/staff/,/blog/2019/. Crawler refuses these.
Combine both for fine-grained control. Most marketing sites need a disallow on the blog (old content) and an allow on docs and help.
CSS selector for targeted extraction
By default the crawler extracts the main content area, stripping nav, footer, and sidebar. For sites with unusual layouts, specify a CSS selector that targets your main content:
main(semantic main tag).#content(common ID)..article-body(common class).[role="main"](ARIA landmark).
If the default extraction is including noise (nav repeated on every chunk, ads cluttering the index), set a CSS selector. Configure under Crawl Config > CSS Selector.
Cookies for behind-login content
For content behind a login (member-only knowledge base, customer portal docs), provide a session cookie:
- Log in to the source site in your browser.
- Open browser DevTools, Application > Cookies.
- Copy the session cookie value.
- In AskVault, Crawl Config > Cookies. Paste.
The crawler sends the cookie on every request and gets the authenticated version of pages. Two warnings:
- Cookies expire. Most sessions expire in days or weeks. Schedule monthly re-validation.
- Log out after indexing. Once the crawl completes, log out of the source account so the cookie can't be re-used.
For long-lived authenticated crawling, an API key in a custom request header is safer than a session cookie. Configure under Crawl Config > Custom Headers.
Recurring re-crawl
To keep content fresh, set a re-crawl schedule:
- Daily (recommended for active sites with regular content updates).
- Weekly (right for stable knowledge bases).
- Manual (re-crawl on demand only).
Re-crawl is incremental. Only pages that changed since the last crawl are re-extracted and re-embedded. New pages are picked up; deleted pages are removed from the index.
Daily re-crawl typically uses about 5 to 10% of the initial-crawl time because most pages don't change.
Prune dead links
When re-crawling, AskVault detects pages that returned 404 (deleted) and removes them from the index. Default behavior. To disable, configure under Crawl Config > Prune Dead Links.
For sites with planned URL changes (redirects), set Follow Redirects = On so the crawler captures the new URL while removing the old.
How the crawler handles JavaScript
Modern sites built with React, Vue, Angular, or Next.js render content client-side. A naive HTTP fetch returns an empty <div id="root"></div>. AskVault detects this and escalates to a headless browser automatically. See the scraper architecture for the full mechanism.
You don't configure anything for JS-rendered sites. The crawler handles it.
Anti-bot protection
For Cloudflare-protected, Akamai-protected, or similar sites, AskVault has built-in bypass logic. For sites that block our bypass (some hostile sites still do), use BYOK Scraper mode where you bring your own anti-bot subscription. Business+
Content limits
- Per-plan content cap. 5 MB on Free, 15 MB on Starter, 40 MB on Growth, 100 MB on Business. Average page is 2 to 8 KB of extracted text.
- Per-page size. Pages over 1 MB get truncated. The first 1 MB indexes; the rest is skipped.
- File types. HTML pages index normally. PDFs, DOCX, and similar in the same crawl get extracted as separate documents.
For sites larger than your plan allows, either upgrade or use include/exclude patterns to scope down.
Common pitfalls
Crawl finds 5 pages instead of 500. The site lacks internal links from the start URL, or the crawler stayed on the first page. Switch to sitemap mode if /sitemap.xml is available.
Some pages missing. They're behind a login or pattern-excluded. Check Knowledge Hub > [your URL] > Excluded URLs for the list of refused pages.
Bot quotes wrong content. Crawl ingested boilerplate (nav, footer) rather than article body. Set a CSS selector for the main content area.
Re-crawl misses recent changes. Schedule is set to weekly and the page changed yesterday. Trigger a manual resync or move to daily.
FAQ
Can I crawl multiple sites into one workspace?
Yes. Add multiple URL sources under Knowledge > Add Source > Website. Each appears separately in Knowledge Hub. Re-crawl schedules are per-source.
Does the crawler respect robots.txt?
Yes. Sites that disallow crawling in their robots.txt are refused. For internal sites where you own both ends, you can override under Crawl Config > Ignore robots.txt (Business and above).
What about session-based pagination?
Pagination links that update the URL (/page/2, ?page=2) get followed. JS-driven "Load more" buttons that don't change the URL are harder; the crawler tries on a best-effort basis.
Does it work with infinite scroll?
Yes, when the crawler escalates to a headless browser. The browser scrolls the page to trigger lazy-loaded content.
How do I delete crawled content?
Either delete the entire source (removes everything) or individual pages from Knowledge Hub > [page] > Delete. Cascade removal happens within 5 minutes.
Related guides
- How to scrape a JavaScript-rendered website
- tiered scraper architecture
- How to restrict the AI bot to specific URLs only
- PDF and document uploads
- Q&A pairs