How to scrape a JavaScript-rendered website for LLM training
Why HTTP-only scrapers fail in 2026
Around 65% of marketing sites built since 2022 ship as a JavaScript-rendered single-page app. The HTML the server returns is essentially empty:
<!DOCTYPE html><html> <head><title>Acme Corp</title></head> <body> <div id="root"></div> <script src="/static/main.abc123.js"></script> </body></html>The actual content (hero copy, feature list, pricing, testimonials, FAQ) is built at runtime by the JavaScript bundle. A scraper that runs httpx.get(url) and returns the response body gets the empty shell. Then your RAG system embeds the empty shell. Then your AI agent has no idea what your business does.
This is the single most common reason a "smart chatbot" trained on a website returns useless answers. The scraper didn't render JavaScript, so the knowledge base is empty.
The two failure modes you actually have to handle
There are two distinct cases where naive scraping breaks. They need different fixes.
Case 1: JavaScript-rendered content. The site is a single-page app built with React, Vue, Angular, Svelte, or similar. The server returns a shell, the client builds the page. Detected by inspecting the response body: if it's under 3 KB and contains a <div id="root"> (or app, __next, etc.) with little surrounding text, you need a headless browser.
Case 2: Anti-bot protection. The site is protected by Cloudflare, Akamai, PerimeterX, or similar. The server returns a challenge page ("Just a moment...", "Checking your browser before accessing...") and only serves real content after a JavaScript proof-of-work or a TLS fingerprint check. Detected by status code (403 or 503) or body content (the word "challenge" or "Just a moment").
Most production scrapers conflate these. They escalate to a headless browser in both cases. That works but it's wasteful: about 35% of pages don't need a browser, and a browser-based fetch is 10x slower and 50x more expensive than a plain HTTP fetch.
The tiered approach
The right architecture is a tiered fallback: try the cheap fetch first, escalate when it fails. Here's the shape of it:
- Tier 0: plain HTTP fetch with a normal browser user-agent. Sub-second per page. Works on roughly 35% of modern marketing sites.
- Tier 1: headless browser (Playwright or Puppeteer with Chromium). 3 to 5 seconds per page. Works on JavaScript-rendered SPAs.
- Tier 2: anti-bot bypass service (your own subscription to a premium scraping API). 3 to 8 seconds per page. Works on Cloudflare-protected and high-friction sites.
- Tier 3: BYOK fallback where the customer brings their own scraping API key for the hardest sites we still can't crack.
Each tier escalates only if the previous one returns a known-failure signal. Critically: once a host fails at Tier 0, future requests to that host skip Tier 0 entirely and start at Tier 1. That's the "per-host learned escalation" that turns a 50-minute crawl into a 5-minute one.
The failure signals you check for
Detection logic is what separates a real scraper from a slow one. These are the signals AskVault checks before escalating:
- HTTP status. 403 or 503 with a Cloudflare-specific body. 429 means rate-limited (escalate, don't retry).
- Body length. Under 3 KB on a page that should be a real article. Usually means JS-rendered.
- DOM signature.
<div id="root"></div>,<div id="__next"></div>,<div id="app"></div>with little surrounding text. React, Next.js, Vue. - Challenge phrases. "Just a moment", "Checking your browser", "Please enable JavaScript", "Attention Required! | Cloudflare".
- Meta refresh redirects to a challenge page. Some bot-protection services route through a JS redirect.
- Cookie wall.
<meta http-equiv="refresh" ...>to a consent page that blocks content until cookies are accepted.
When any of these match, escalate to the next tier. Otherwise, accept the response and move on.
DIY: a minimal Python implementation
If you're rolling your own scraper instead of using AskVault, here's the smallest version that handles both failure modes. It uses httpx for Tier 0 and a headless browser for Tier 1:
import reimport httpxfrom playwright.sync_api import sync_playwright
CHALLENGE_PHRASES = ( "just a moment", "checking your browser", "attention required", "enable javascript",)SPA_SIGNATURES = ( '<div id="root">', '<div id="__next">', '<div id="app">',)
def fetch_page(url: str) -> str | None: # Tier 0: plain HTTP try: r = httpx.get(url, headers={"User-Agent": "Mozilla/5.0 ..."}, timeout=10, follow_redirects=True) if r.status_code == 200 and len(r.text) > 3000 and not any(p in r.text.lower() for p in CHALLENGE_PHRASES): if not any(s in r.text for s in SPA_SIGNATURES): return r.text except httpx.RequestError: pass # Tier 1: headless browser with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto(url, wait_until="networkidle", timeout=30000) html = page.content() browser.close() return htmlA few things this minimal version doesn't do (which a production scraper should):
- Per-host escalation cache. Once a host fails Tier 0, future URLs from that host skip Tier 0.
- Rate limiting per host. Some sites will ban you if you fetch 50 pages per second.
- Concurrent fetches with backoff. A real crawler runs 20+ pages in parallel with adaptive throttling.
- Content extraction. Once you have the rendered HTML, you still need to strip nav, footer, scripts, and ads before chunking.
If you want to ship something quickly, you can take the snippet above, add a queue, and you'll have a working RAG ingestion scraper for under 100 lines of code. Just don't expect it to handle Cloudflare or hostile bot protection out of the box.
Or use AskVault
The whole point of AskVault is that we did this work. When you crawl a website through AskVault:
- A real production scraper runs the tiered fallback automatically. No per-host configuration.
- The per-host escalation cache learns which sites need a browser, so the second URL from a hostile host is fast.
- Anti-bot challenges are handled at the right tier without you knowing or caring.
- Failed crawls retry with exponential backoff so transient errors don't tank your knowledge base.
- Extracted content goes straight into a workspace-isolated vector index ready for retrieval.
Setup: paste your URL into the onboarding wizard. Indexing 50 pages takes about 90 seconds; 500 pages takes about 10 minutes.
Detect which case applies to your site
If you want to know how hard your site is to scrape before you start, run this one-liner:
curl -s -o /dev/null -w "%{http_code}\n" https://yoursite.comcurl -s https://yoursite.com | head -c 2000Read the output. If the status is 200 and you see real content (your hero copy, navigation links, product names), you're fine. A plain HTTP scraper will work.
If you see a <div id="root"> with nothing inside, your site is JS-rendered. You need a headless browser.
If you see "Just a moment" or "Checking your browser", you're behind Cloudflare or similar. You need an anti-bot bypass service or a real browser fingerprint.
Common follow-up questions
Why not use a headless browser for everything?
Cost and speed. A headless browser fetch takes 3 to 5 seconds, runs Chromium, uses about 200 MB of memory per concurrent page, and on cloud GPUs costs roughly $0.30 per 1,000 pages. A plain HTTP fetch takes 200 to 800 ms, uses 5 MB of memory, and is essentially free.
A 1,000-page docs site crawled headless: 50 minutes, $0.30, 6 minutes of compute. Crawled with HTTP-first tiering: 5 minutes, $0.05 (only the 350 pages that needed escalation paid the browser tax). 10x throughput at the same coverage.
How do I handle infinite-scroll content?
Headless browsers can scroll the page programmatically: page.evaluate("window.scrollTo(0, document.body.scrollHeight)") then wait for new content to load, repeat. AskVault's scraper handles this automatically if it detects a IntersectionObserver-style infinite-scroll pattern.
What about content behind login walls?
If the content is gated, your scraper needs the cookies of an authenticated session. AskVault supports a Cookies field in Crawl config where you paste a session cookie. The scraper sends it on every request and gets the authenticated version of the page. Be careful: log out of the source account after indexing so the cookie expires.
Will my scraping get detected and blocked?
Probably yes if you scrape aggressively without rate limits and identifying user-agent. Be polite: 1 request per second per host, identify yourself in the User-Agent, respect robots.txt. AskVault's scraper does all of this by default.
Does this work with Single-Page Apps that use client-side routing?
Yes. You need to either (a) crawl by enumerating all URLs from the sitemap, or (b) crawl the home page and discover internal links from the rendered DOM (not the raw HTML, since SPA routing renders client-side). AskVault's scraper does both. It prefers the sitemap when available.
Related guides
- Install the AskVault widget on any website
- How to restrict the AI bot to specific URLs only
- How vector databases work
- What is Retrieval-Augmented Generation (RAG)?