Skip to content
Try Free →

HTTP-first scraping with automatic browser escalation

Last updated: · 7 min read

The core insight

About 35% of modern websites work fine with a plain HTTP GET request. Server-rendered marketing pages, traditional CMS sites, blog posts, documentation hosted on most static-site generators. A fast HTTP fetch gets the full content in 200 to 800 milliseconds.

The other 65% are JavaScript-rendered single-page apps, behind anti-bot protection, or both. For those you need a headless browser that executes JavaScript and returns the rendered DOM. Browser fetches take 3 to 8 seconds and use about 200 MB of memory per concurrent page.

A naive scraper escalates everything to a browser. Slow and expensive. A better scraper tries HTTP first and only escalates when needed. That's the AskVault approach.

The signals

Two kinds of failures trigger escalation. Each has its own detection signal.

Failure type 1: JavaScript-rendered content.

The HTTP fetch succeeds (status 200) but the response body is essentially empty. Signals:

  • Body length under 3 KB on a page that should be a real article.
  • DOM signature: the body contains <div id="root"></div>, <div id="__next"></div>, or <div id="app"></div> with little surrounding text.
  • The page's title is generic ("Acme") but the visible content doesn't match the expected article structure.

When any of these match, we know the actual content is built client-side. Escalate to a headless browser.

Failure type 2: anti-bot protection.

The HTTP fetch is blocked or challenged. Signals:

  • Status code 403, 503, or 429 from a known bot-protection service.
  • Body contains "Just a moment", "Checking your browser", "Attention Required! | Cloudflare", or "Please enable JavaScript".
  • A meta-refresh redirect to a challenge page.

When any of these match, escalate to a browser that can solve the JavaScript challenge.

The escalation path

AskVault's scraper runs the request through this sequence, escalating only when the current step fails:

  1. HTTP fetch with a normal browser user-agent. Fast, free. Works on 35% of pages.
  2. Headless browser fetch. Executes JavaScript, returns the rendered DOM. Works on most JS-rendered SPAs.
  3. Anti-bot bypass service. A premium scraping API designed to defeat Cloudflare, Akamai, PerimeterX. Slower and costlier, but works on hostile sites.
  4. Customer's BYOK key. If the site is so hostile we still can't crack it, the customer can bring their own scraping API key for that specific host.

Each step takes the request as input and returns either rendered HTML (success) or a known-failure signal (escalate). The pipeline is deterministic: the same URL produces the same outcome on every retry.

Per-host learned escalation cache

Here's the optimization that turns this from "tiered fallback" to "actually 10x faster". Once a host fails at level 1, future URLs from that host start at level 2.

Concretely: the first URL we fetch from app.shopify.com runs through level 1, fails, escalates to level 2, succeeds. We cache the fact that app.shopify.com needs level 2 on the first try. The next 999 URLs we crawl from the same host go straight to level 2, skipping level 1's failed attempt.

Without the cache, every URL pays the failure tax. A 1,000-page crawl with 35% requiring browser would do 1,000 level-1 fetches plus 650 escalations to level 2. About 50 minutes wall-clock.

With the cache, the first 5 to 10 URLs from a hostile host pay the discovery cost, the remaining 990+ go straight to the right tier. About 5 minutes wall-clock for the same crawl.

Concurrency and rate limiting

The scraper runs multiple concurrent fetches across hosts. Two limits matter:

  • Global concurrency cap. Total in-flight fetches across the whole crawl. Prevents memory blowout when many hosts are slow.
  • Per-host concurrency cap. In-flight fetches against any single host. Prevents getting rate-limited by overloading one origin.

Both are tuned conservatively. The per-host cap is low enough that polite scraping holds. The global cap is high enough that a 1,000-page crawl across 50 hosts saturates the available compute.

When a host returns 429 (rate-limited), AskVault backs off with exponential delay. After 3 consecutive rate-limits the host enters a temporary perma-ban: no more requests to that host for an hour. Prevents one hostile site from blocking the whole crawl.

Content extraction after fetch

Getting the rendered HTML is half the battle. Extracting useful text is the other half. AskVault's chunker:

  1. Strips navigation, header, footer, ads. These are noise. Recognized by common HTML structures (<nav>, <header>, <footer>, aside, [role="navigation"], etc.) and CSS selectors typical of ad networks.
  2. Preserves heading structure. H1, H2, H3 boundaries become chunk boundaries. The parent heading is prepended to each chunk so context survives.
  3. Keeps tables intact. Tables get serialized to Markdown table syntax and stay as a single chunk. LLMs cite tables verbatim at much higher rates than fragmented data.
  4. Preserves code blocks. Code stays in fenced format. Critical for technical docs.
  5. Splits paragraphs only at natural boundaries. No mid-sentence cuts. Aim for 400 to 800 tokens per chunk with 50 to 100 tokens of overlap.

The output is a series of clean text chunks ready for embedding.

Some content is gated behind a login. AskVault supports a Cookies field in Crawl config where you paste a session cookie. The scraper sends the cookie on every request and gets the authenticated version of the page.

Two warnings:

  • Cookies expire. Schedule a re-crawl with a fresh cookie monthly, or your indexed content goes stale.
  • Don't share live session cookies. Log out of the source account after indexing so the cookie can't be re-used by AskVault or anyone with access.

For long-lived authenticated access, an API key in a request header is better than a session cookie. Configure under Crawl config > Custom Headers.

Robots.txt and politeness

The scraper respects robots.txt by default. If a site disallows /, we don't crawl it at all. If it disallows specific paths, we skip those.

User-agent strings identify the scraper as AskVault-Bot/1.0 with a contact URL. Site owners who want us to back off can do so via robots.txt or by emailing security@askvault.co.

Request frequency stays under 1 request per second per host by default, with adaptive backoff if the host shows signs of strain.

What you get in the dashboard

Operationally, this scraper architecture surfaces in Knowledge Hub > Crawl Status:

  • Per-document indexing state. QUEUED, INDEXING, READY, or one of several error states.
  • Last-attempted timestamp. When the scraper last tried this URL.
  • Failure breakdown. If indexing failed, the dashboard shows the specific failure (timeout, 403, empty body, content too thin) so you know whether to retry or change settings.
  • Retry budget. Transient failures auto-retry on an exponential backoff schedule. Permanent failures (404, robots-disallowed) don't retry and are surfaced in the KPI tile "X sources need attention".

Why this matters for your AI agent

If your AI agent doesn't index a page, it can't answer questions from that page. A scraper that fails silently on JavaScript-rendered content means the bot gives wrong-or-empty answers on the modern half of your site.

The most common reason new customers see "the bot answered wrong" in week 1 is that a chunk of their site is JS-rendered and their previous tool's basic scraper skipped it. Once AskVault re-indexes with proper rendering, the bot's accuracy jumps.

Crawl coverage isn't just about completeness. It's about whether the bot can answer the question at all.

FAQ

Can I disable browser escalation if I'm sure my site is HTTP-only?

Yes. Under Crawl config > Advanced, force-HTTP mode skips the escalation logic entirely. Useful if you know all your content is server-rendered and want maximum speed.

What if the bot keeps failing on a specific page?

Check the Knowledge Hub crawl-status page. If the page consistently fails, you can manually upload its content as a document or paste it as a Snippet. The bot will retrieve from those sources instead.

Can I bring my own scraping API key for hostile sites?

Yes, on the Business plan and above. The BYOK scraper feature lets you use your own subscription to a premium scraping API for specific hosts where AskVault's built-in tiers don't crack the site. Configure under Settings > BYOK Scraper.

Does the scraper render dynamic content like infinite scroll?

Yes when escalated to a browser. The browser scrolls the page programmatically and waits for new content to load. Works on most infinite-scroll patterns.

How does it handle pagination?

The scraper follows pagination links automatically if they're standard <a href="...?page=N"> patterns. JS-driven "Load more" buttons that don't update the URL are harder; the scraper finds them on a best-effort basis.

Was this page helpful?