feat(playwright): add JsRenderingDetector parse filter & bolt by rzo1 · Pull Request #1898 · apache/stormcrawler

rzo1 · 2026-05-04T18:59:39Z

Heuristically flags URLs whose content looks JavaScript-rendered by
inspecting SPA framework fingerprints, noscript blocks, empty hydration
roots, and a thin-content fallback. Sets a routing metadata key so
DelegatorProtocol can dispatch subsequent fetches to Playwright while
the bulk of the crawl stays on a cheap HTTP client.

Heuristically flags URLs whose content looks JavaScript-rendered by inspecting SPA framework fingerprints, noscript blocks, empty hydration roots, and a thin-content fallback. Sets a routing metadata key so DelegatorProtocol can dispatch subsequent fetches to Playwright while the bulk of the crawl stays on a cheap HTTP client.

…list Pairs JsRenderingDetector with a bolt that reads the routing flag and emits to StatusStreamName for an immediate refetch through Playwright, suppressing the cheap fetch's stub from the index. Adds a requiredMessages parameter to the detector for free-form JS-required / loader / cookie prompts that don't fit the noscript pattern.

jnioche · 2026-05-05T08:31:47Z

This is great. Quick question: why not have the redirection bolt point back to the FetcherBolt instead so that the URL gets refetched straight away? Obviously would have to make sure it does not get into an endless loop by checking if it has been fetched by Playwright.

Should any outlinks with the same hostname inherit the flag? Should we have a URL filter to that effect?

rzo1 · 2026-05-05T08:43:23Z

Quick question: why not have the redirection bolt point back to the FetcherBolt instead so that the URL gets refetched straight away? Obviously would have to make sure it does not get into an endless loop by checking if it has been fetched by Playwright.

A direct edge back to the fetcher would bypass the scheduler, which is what enforces fetch intervals, per-host delays, and robots-aware re-queueing. With a batch of URLs from the same domain flagged for JS rendering, that path would re-fetch immediately and risk hammering the host. It also breaks Storm's acking model: the original tuple from the spout never anchors cleanly, every re-fetch extends the tuple tree, and any failure deep in the cycle replays from the spout. Routing through the status index breaks the tree at a natural boundary: the status update is acked, and the spout re-emits as a fresh tuple.

We have a persistence angle too: Once the "needs JS rendering" flag is written into the status index metadata, it survives a topology restart. An in-flight cyclic tuple would not. In addition, it also matches the established pattern in StormCrawler: redirects, retries, and re-fetches always loop via status →scheduler → spout. A sideband direct re-fetch path would break with that pattern.

Finally, the approach removes the need for explicit loop detection. The metadata flag in the status index is itself the guard against re-rendering, so no extra "already rendered by Playwright" check is required.

Should any outlinks with the same hostname inherit the flag? Should we have a URL filter to that effect?

On the second point: agreed, propagating the flag to outlinks of the same host (or expressing it as a URL filter) is worth doing, but I'd rather keep it out of this PR to keep the scope tight. I'll open a follow-up PR or issue/ticket for it so we can discuss the inheritance rules and filter shape on their own. wdyt?

rzo1 added 2 commits May 4, 2026 20:46

rzo1 added this to the 3.6.0 milestone May 4, 2026

rzo1 requested review from jnioche, mvolikas, sebastian-nagel and sigee May 4, 2026 18:59

style(playwright): apply google-java-format to detector and bolt

f8b4d68

jnioche approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(playwright): add JsRenderingDetector parse filter & bolt#1898

feat(playwright): add JsRenderingDetector parse filter & bolt#1898
rzo1 wants to merge 3 commits intomainfrom
playwright-js-detector

rzo1 commented May 4, 2026

Uh oh!

jnioche commented May 5, 2026

Uh oh!

rzo1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rzo1 commented May 4, 2026

Uh oh!

jnioche commented May 5, 2026

Uh oh!

rzo1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants