CSS selector targeting for URL crawling
Why CSS selectors matter
Default crawl indexes every page's full HTML. Issues:
- Navigation links become noise in retrieval.
- Footers repeat across pages.
- Sidebars advertise unrelated content.
- Cookie banners pollute every page.
By targeting <main> or .content, you index only the actual content. Retrieval quality lifts 25 to 40%.
Setup
- Knowledge Hub > URL Crawl source > Edit.
- Advanced > Content Selector.
- Enter a CSS selector (e.g.,
article.post-content). - Optionally exclude selectors (e.g.,
.related-posts,nav,footer). - Re-crawl to apply.
Common selector patterns
main. Most semantic-HTML sites have a main element.article. Blog posts.#content. Common ID..post-content. WordPress..entry-content. Ghost.[role="main"]. ARIA-based.
Test in browser DevTools first.
Excluding sub-elements
Even within <main>, exclude noise:
Include: mainExclude: main .ads, main .related, main .author-bioExample: blog with sidebar
Site: yourblog.co/posts/...
Include: article.post-contentExclude: .share-buttons, .related-posts, .author-cardBot indexes the actual post body. Recommendations like "Related Posts: X, Y, Z" don't leak.
Example: documentation site
Site: docs.yoursite.co/...
Include: main .docs-contentExclude: .toc, .footer-nav, .feedback-widgetIndexes only the docs content; not the sidebar TOC, breadcrumb, etc.
Testing the selector
Use the Crawl Preview:
- Knowledge Hub > URL source > Test Selector.
- Enter sample URL.
- AskVault renders the extracted content.
- Verify it includes what you want, excludes what you don't.
About 80% of sites get a clean selector on first try.
Auto-detection fallback
If you don't set a selector, AskVault uses readability-based heuristics:
- Detects the main content via density and tag analysis.
- About 65 to 80% accuracy on standard layouts.
- Custom selector recommended when accuracy matters.
Limits
- Selector complexity. Standard CSS3 syntax.
- Include / exclude combinations. Up to 5 of each per source.
- Re-crawl on selector change. About 30 seconds per MB.
- Test preview latency. Under 10 seconds.
- Setup time. About 5 minutes per source.
Common pitfalls
Selector too specific. Misses content on pages with slightly different markup. Use a broader selector.
Selector too broad. Includes nav and footer. Tighten.
JS-rendered content. Selector applies after JS renders. Works with AskVault's headless rendering.
Pages with multiple <main> elements. Pick a more specific selector.
FAQ
Can I use different selectors per URL pattern?
Yes via URL globbing under Source Settings.
Does this work for JS-rendered SPAs?
Yes. AskVault renders pages headlessly before applying the selector.
Will the selector affect crawl speed?
Slightly faster (less HTML to process). 5 to 10% speed-up.