Full-site crawl

Written by Aashiq, Founder, AskVault · Reviewed by Aashiq

Last updated: May 15, 2026 · 4 min read

When to use full-site crawl

Three patterns:

Initial onboarding. Index your site without needing a sitemap.
No sitemap available. Older sites or CMS without auto-sitemap.
Discovery crawl. Find content you didn't know existed.

For known URL lists, prefer batch URL ingest. For sitemap-driven crawls, use sitemap crawl.

Setup

Knowledge Hub > Add Source > URL Crawl > Full Site.
Enter starting URL (typically homepage).
Set max depth (default 5; 0 = homepage only, 1 = homepage plus direct links).
Set include patterns (e.g., /docs/*, /blog/*).
Set exclude patterns (e.g., /admin/*, ?print=true).
Click Start.

Within 60 seconds, crawl begins.

Depth

Depth 1. Homepage plus its direct links.
Depth 3. Three hops from homepage.
Depth 5. Default; deep enough for most sites.
Depth 10. Maximum.

Deeper depth catches more content but takes longer. For typical marketing sites, depth 5 is enough.

Include / exclude patterns

URL globbing:

Include /docs/*. Only URLs starting with /docs/.

Exclude *?print=*. Skip print-friendly versions.

Exclude /admin/*. Skip admin areas.

Exclude /tag/* and /category/*. Skip taxonomy pages (just lists, not content).

Order: include first, then exclude. About 80% of crawls benefit from at least one exclude pattern.

Auto-followed links

By default, crawler follows:

Same-origin links. Within your domain.
Subdomain links if explicitly enabled.
Anchor links (#hash) deduplicated.

Crawler doesn't follow:

Cross-origin links. External sites.
rel="nofollow" links.
robots.txt-disallowed paths.

Robots.txt respect

By default, AskVault respects your robots.txt:

User-agent: AskVault-Bot rules apply.
User-agent: * rules apply as fallback.
Disallow rules skip those paths.

If you want AskVault to crawl despite robots.txt (your own site you own), set "Ignore robots.txt" in source settings.

Re-crawl behavior

After initial crawl:

Scheduled re-crawl daily, weekly, or monthly.
Incremental. Only changed pages re-index (detected via ETags, Last-Modified).
Full re-crawl any time on demand.

Incremental re-crawl typically completes in about 10% of the initial-crawl time.

Cookies for authenticated crawls

For sites with auth-gated content, see cookies for login crawls.

Performance considerations

For very large sites (1,000+ pages):

Crawl runs in background. Bot becomes queryable as pages complete.
Parallel up to 30 in flight; 8 per host.
About 5 to 15 minutes per 100 pages depending on site speed.

For sites over 5,000 pages: contact us about Enterprise high-volume crawling.

Audience tags during crawl

Apply audience tags to all crawled URLs:

Per-source default. All URLs public.
Per-path overrides. Use globs (/internal/* → internal).

Limits

Pages per crawl. 5,000.
Max depth. 10.
Per-host throttle. 8 in flight.
Re-crawl frequency. As fast as daily.

Common pitfalls

Crawl gets stuck. Site rate-limiting or anti-bot. Try lower per-host throttle.

Indexes pages you don't want. Add exclude patterns. Re-crawl.

Pagination loops. Calendar pages or ?page=1...?page=999. Exclude with *?page=* if not needed.

JS-rendered content missed. Most sites work; SPA sites may need explicit JS rendering. AskVault auto-detects.

FAQ

Can I crawl a competitor's site?

Yes if their robots.txt allows. Don't violate Terms of Service.

Does the crawl affect my site's analytics?

AskVault's crawler identifies as AskVault-Bot in User-Agent. Filter out in analytics if desired.

Will my server be hit hard?

Per-host throttle caps at 8 in flight. Light load typically.

Was this page helpful?