Skip to content
Try Free →

Full-site crawl

Last updated: · 4 min read

When to use full-site crawl

Three patterns:

  1. Initial onboarding. Index your site without needing a sitemap.
  2. No sitemap available. Older sites or CMS without auto-sitemap.
  3. Discovery crawl. Find content you didn't know existed.

For known URL lists, prefer batch URL ingest. For sitemap-driven crawls, use sitemap crawl.

Setup

  1. Knowledge Hub > Add Source > URL Crawl > Full Site.
  2. Enter starting URL (typically homepage).
  3. Set max depth (default 5; 0 = homepage only, 1 = homepage plus direct links).
  4. Set include patterns (e.g., /docs/*, /blog/*).
  5. Set exclude patterns (e.g., /admin/*, ?print=true).
  6. Click Start.

Within 60 seconds, crawl begins.

Depth

  • Depth 1. Homepage plus its direct links.
  • Depth 3. Three hops from homepage.
  • Depth 5. Default; deep enough for most sites.
  • Depth 10. Maximum.

Deeper depth catches more content but takes longer. For typical marketing sites, depth 5 is enough.

Include / exclude patterns

URL globbing:

Include /docs/*. Only URLs starting with /docs/.

Exclude *?print=*. Skip print-friendly versions.

Exclude /admin/*. Skip admin areas.

Exclude /tag/* and /category/*. Skip taxonomy pages (just lists, not content).

Order: include first, then exclude. About 80% of crawls benefit from at least one exclude pattern.

By default, crawler follows:

  • Same-origin links. Within your domain.
  • Subdomain links if explicitly enabled.
  • Anchor links (#hash) deduplicated.

Crawler doesn't follow:

  • Cross-origin links. External sites.
  • rel="nofollow" links.
  • robots.txt-disallowed paths.

Robots.txt respect

By default, AskVault respects your robots.txt:

  • User-agent: AskVault-Bot rules apply.
  • User-agent: * rules apply as fallback.
  • Disallow rules skip those paths.

If you want AskVault to crawl despite robots.txt (your own site you own), set "Ignore robots.txt" in source settings.

Re-crawl behavior

After initial crawl:

  • Scheduled re-crawl daily, weekly, or monthly.
  • Incremental. Only changed pages re-index (detected via ETags, Last-Modified).
  • Full re-crawl any time on demand.

Incremental re-crawl typically completes in about 10% of the initial-crawl time.

Cookies for authenticated crawls

For sites with auth-gated content, see cookies for login crawls.

Performance considerations

For very large sites (1,000+ pages):

  • Crawl runs in background. Bot becomes queryable as pages complete.
  • Parallel up to 30 in flight; 8 per host.
  • About 5 to 15 minutes per 100 pages depending on site speed.

For sites over 5,000 pages: contact us about Enterprise high-volume crawling.

Audience tags during crawl

Apply audience tags to all crawled URLs:

  • Per-source default. All URLs public.
  • Per-path overrides. Use globs (/internal/*internal).

Limits

  • Pages per crawl. 5,000.
  • Max depth. 10.
  • Per-host throttle. 8 in flight.
  • Re-crawl frequency. As fast as daily.

Common pitfalls

Crawl gets stuck. Site rate-limiting or anti-bot. Try lower per-host throttle.

Indexes pages you don't want. Add exclude patterns. Re-crawl.

Pagination loops. Calendar pages or ?page=1...?page=999. Exclude with *?page=* if not needed.

JS-rendered content missed. Most sites work; SPA sites may need explicit JS rendering. AskVault auto-detects.

FAQ

Can I crawl a competitor's site?

Yes if their robots.txt allows. Don't violate Terms of Service.

Does the crawl affect my site's analytics?

AskVault's crawler identifies as AskVault-Bot in User-Agent. Filter out in analytics if desired.

Will my server be hit hard?

Per-host throttle caps at 8 in flight. Light load typically.

Was this page helpful?