Skip to content

feat(playwright): add JsRenderingDetector parse filter & bolt#1898

Open
rzo1 wants to merge 3 commits intomainfrom
playwright-js-detector
Open

feat(playwright): add JsRenderingDetector parse filter & bolt#1898
rzo1 wants to merge 3 commits intomainfrom
playwright-js-detector

Conversation

@rzo1
Copy link
Copy Markdown
Contributor

@rzo1 rzo1 commented May 4, 2026

Heuristically flags URLs whose content looks JavaScript-rendered by
inspecting SPA framework fingerprints, noscript blocks, empty hydration
roots, and a thin-content fallback. Sets a routing metadata key so
DelegatorProtocol can dispatch subsequent fetches to Playwright while
the bulk of the crawl stays on a cheap HTTP client.

rzo1 added 2 commits May 4, 2026 20:46
Heuristically flags URLs whose content looks JavaScript-rendered by
inspecting SPA framework fingerprints, noscript blocks, empty hydration
roots, and a thin-content fallback. Sets a routing metadata key so
DelegatorProtocol can dispatch subsequent fetches to Playwright while
the bulk of the crawl stays on a cheap HTTP client.
…list

Pairs JsRenderingDetector with a bolt that reads the routing flag and
emits to StatusStreamName for an immediate refetch through Playwright,
suppressing the cheap fetch's stub from the index. Adds a requiredMessages
parameter to the detector for free-form JS-required / loader / cookie
prompts that don't fit the noscript pattern.
@rzo1 rzo1 added this to the 3.6.0 milestone May 4, 2026
@jnioche
Copy link
Copy Markdown
Contributor

jnioche commented May 5, 2026

This is great. Quick question: why not have the redirection bolt point back to the FetcherBolt instead so that the URL gets refetched straight away? Obviously would have to make sure it does not get into an endless loop by checking if it has been fetched by Playwright.

Should any outlinks with the same hostname inherit the flag? Should we have a URL filter to that effect?

@rzo1
Copy link
Copy Markdown
Contributor Author

rzo1 commented May 5, 2026

Quick question: why not have the redirection bolt point back to the FetcherBolt instead so that the URL gets refetched straight away? Obviously would have to make sure it does not get into an endless loop by checking if it has been fetched by Playwright.

A direct edge back to the fetcher would bypass the scheduler, which is what enforces fetch intervals, per-host delays, and robots-aware re-queueing. With a batch of URLs from the same domain flagged for JS rendering, that path would re-fetch immediately and risk hammering the host. It also breaks Storm's acking model: the original tuple from the spout never anchors cleanly, every re-fetch extends the tuple tree, and any failure deep in the cycle replays from the spout. Routing through the status index breaks the tree at a natural boundary: the status update is acked, and the spout re-emits as a fresh tuple.

We have a persistence angle too: Once the "needs JS rendering" flag is written into the status index metadata, it survives a topology restart. An in-flight cyclic tuple would not. In addition, it also matches the established pattern in StormCrawler: redirects, retries, and re-fetches always loop via status →scheduler → spout. A sideband direct re-fetch path would break with that pattern.

Finally, the approach removes the need for explicit loop detection. The metadata flag in the status index is itself the guard against re-rendering, so no extra "already rendered by Playwright" check is required.

Should any outlinks with the same hostname inherit the flag? Should we have a URL filter to that effect?

On the second point: agreed, propagating the flag to outlinks of the same host (or expressing it as a URL filter) is worth doing, but I'd rather keep it out of this PR to keep the scope tight. I'll open a follow-up PR or issue/ticket for it so we can discuss the inheritance rules and filter shape on their own. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants