Skip to content
Try Free →

CSS selector targeting for URL crawling

Last updated: · 3 min read

Why CSS selectors matter

Default crawl indexes every page's full HTML. Issues:

  • Navigation links become noise in retrieval.
  • Footers repeat across pages.
  • Sidebars advertise unrelated content.
  • Cookie banners pollute every page.

By targeting <main> or .content, you index only the actual content. Retrieval quality lifts 25 to 40%.

Setup

  1. Knowledge Hub > URL Crawl source > Edit.
  2. Advanced > Content Selector.
  3. Enter a CSS selector (e.g., article.post-content).
  4. Optionally exclude selectors (e.g., .related-posts, nav, footer).
  5. Re-crawl to apply.

Common selector patterns

  • main. Most semantic-HTML sites have a main element.
  • article. Blog posts.
  • #content. Common ID.
  • .post-content. WordPress.
  • .entry-content. Ghost.
  • [role="main"]. ARIA-based.

Test in browser DevTools first.

Excluding sub-elements

Even within <main>, exclude noise:

Include: main
Exclude: main .ads, main .related, main .author-bio

Example: blog with sidebar

Site: yourblog.co/posts/...

Include: article.post-content
Exclude: .share-buttons, .related-posts, .author-card

Bot indexes the actual post body. Recommendations like "Related Posts: X, Y, Z" don't leak.

Example: documentation site

Site: docs.yoursite.co/...

Include: main .docs-content
Exclude: .toc, .footer-nav, .feedback-widget

Indexes only the docs content; not the sidebar TOC, breadcrumb, etc.

Testing the selector

Use the Crawl Preview:

  1. Knowledge Hub > URL source > Test Selector.
  2. Enter sample URL.
  3. AskVault renders the extracted content.
  4. Verify it includes what you want, excludes what you don't.

About 80% of sites get a clean selector on first try.

Auto-detection fallback

If you don't set a selector, AskVault uses readability-based heuristics:

  • Detects the main content via density and tag analysis.
  • About 65 to 80% accuracy on standard layouts.
  • Custom selector recommended when accuracy matters.

Limits

  • Selector complexity. Standard CSS3 syntax.
  • Include / exclude combinations. Up to 5 of each per source.
  • Re-crawl on selector change. About 30 seconds per MB.
  • Test preview latency. Under 10 seconds.
  • Setup time. About 5 minutes per source.

Common pitfalls

Selector too specific. Misses content on pages with slightly different markup. Use a broader selector.

Selector too broad. Includes nav and footer. Tighten.

JS-rendered content. Selector applies after JS renders. Works with AskVault's headless rendering.

Pages with multiple <main> elements. Pick a more specific selector.

FAQ

Can I use different selectors per URL pattern?

Yes via URL globbing under Source Settings.

Does this work for JS-rendered SPAs?

Yes. AskVault renders pages headlessly before applying the selector.

Will the selector affect crawl speed?

Slightly faster (less HTML to process). 5 to 10% speed-up.

Was this page helpful?