Full-site crawl
When to use full-site crawl
Three patterns:
- Initial onboarding. Index your site without needing a sitemap.
- No sitemap available. Older sites or CMS without auto-sitemap.
- Discovery crawl. Find content you didn't know existed.
For known URL lists, prefer batch URL ingest. For sitemap-driven crawls, use sitemap crawl.
Setup
- Knowledge Hub > Add Source > URL Crawl > Full Site.
- Enter starting URL (typically homepage).
- Set max depth (default 5; 0 = homepage only, 1 = homepage plus direct links).
- Set include patterns (e.g.,
/docs/*,/blog/*). - Set exclude patterns (e.g.,
/admin/*,?print=true). - Click Start.
Within 60 seconds, crawl begins.
Depth
- Depth 1. Homepage plus its direct links.
- Depth 3. Three hops from homepage.
- Depth 5. Default; deep enough for most sites.
- Depth 10. Maximum.
Deeper depth catches more content but takes longer. For typical marketing sites, depth 5 is enough.
Include / exclude patterns
URL globbing:
Include /docs/*. Only URLs starting with /docs/.
Exclude *?print=*. Skip print-friendly versions.
Exclude /admin/*. Skip admin areas.
Exclude /tag/* and /category/*. Skip taxonomy pages (just lists, not content).
Order: include first, then exclude. About 80% of crawls benefit from at least one exclude pattern.
Auto-followed links
By default, crawler follows:
- Same-origin links. Within your domain.
- Subdomain links if explicitly enabled.
- Anchor links (#hash) deduplicated.
Crawler doesn't follow:
- Cross-origin links. External sites.
rel="nofollow"links.robots.txt-disallowed paths.
Robots.txt respect
By default, AskVault respects your robots.txt:
User-agent: AskVault-Botrules apply.User-agent: *rules apply as fallback.- Disallow rules skip those paths.
If you want AskVault to crawl despite robots.txt (your own site you own), set "Ignore robots.txt" in source settings.
Re-crawl behavior
After initial crawl:
- Scheduled re-crawl daily, weekly, or monthly.
- Incremental. Only changed pages re-index (detected via ETags, Last-Modified).
- Full re-crawl any time on demand.
Incremental re-crawl typically completes in about 10% of the initial-crawl time.
Cookies for authenticated crawls
For sites with auth-gated content, see cookies for login crawls.
Performance considerations
For very large sites (1,000+ pages):
- Crawl runs in background. Bot becomes queryable as pages complete.
- Parallel up to 30 in flight; 8 per host.
- About 5 to 15 minutes per 100 pages depending on site speed.
For sites over 5,000 pages: contact us about Enterprise high-volume crawling.
Audience tags during crawl
Apply audience tags to all crawled URLs:
- Per-source default. All URLs
public. - Per-path overrides. Use globs (
/internal/*→internal).
Limits
- Pages per crawl. 5,000.
- Max depth. 10.
- Per-host throttle. 8 in flight.
- Re-crawl frequency. As fast as daily.
Common pitfalls
Crawl gets stuck. Site rate-limiting or anti-bot. Try lower per-host throttle.
Indexes pages you don't want. Add exclude patterns. Re-crawl.
Pagination loops. Calendar pages or ?page=1...?page=999. Exclude with *?page=* if not needed.
JS-rendered content missed. Most sites work; SPA sites may need explicit JS rendering. AskVault auto-detects.
FAQ
Can I crawl a competitor's site?
Yes if their robots.txt allows. Don't violate Terms of Service.
Does the crawl affect my site's analytics?
AskVault's crawler identifies as AskVault-Bot in User-Agent. Filter out in analytics if desired.
Will my server be hit hard?
Per-host throttle caps at 8 in flight. Light load typically.