feat(playwright): add JsRenderingDetector parse filter & bolt#1898
feat(playwright): add JsRenderingDetector parse filter & bolt#1898
Conversation
Heuristically flags URLs whose content looks JavaScript-rendered by inspecting SPA framework fingerprints, noscript blocks, empty hydration roots, and a thin-content fallback. Sets a routing metadata key so DelegatorProtocol can dispatch subsequent fetches to Playwright while the bulk of the crawl stays on a cheap HTTP client.
…list Pairs JsRenderingDetector with a bolt that reads the routing flag and emits to StatusStreamName for an immediate refetch through Playwright, suppressing the cheap fetch's stub from the index. Adds a requiredMessages parameter to the detector for free-form JS-required / loader / cookie prompts that don't fit the noscript pattern.
|
This is great. Quick question: why not have the redirection bolt point back to the FetcherBolt instead so that the URL gets refetched straight away? Obviously would have to make sure it does not get into an endless loop by checking if it has been fetched by Playwright. Should any outlinks with the same hostname inherit the flag? Should we have a URL filter to that effect? |
A direct edge back to the fetcher would bypass the scheduler, which is what enforces fetch intervals, per-host delays, and robots-aware re-queueing. With a batch of URLs from the same domain flagged for JS rendering, that path would re-fetch immediately and risk hammering the host. It also breaks Storm's acking model: the original tuple from the spout never anchors cleanly, every re-fetch extends the tuple tree, and any failure deep in the cycle replays from the spout. Routing through the status index breaks the tree at a natural boundary: the status update is acked, and the spout re-emits as a fresh tuple. We have a persistence angle too: Once the "needs JS rendering" flag is written into the status index metadata, it survives a topology restart. An in-flight cyclic tuple would not. In addition, it also matches the established pattern in StormCrawler: redirects, retries, and re-fetches always loop via status →scheduler → spout. A sideband direct re-fetch path would break with that pattern. Finally, the approach removes the need for explicit loop detection. The metadata flag in the status index is itself the guard against re-rendering, so no extra "already rendered by Playwright" check is required.
On the second point: agreed, propagating the flag to outlinks of the same host (or expressing it as a URL filter) is worth doing, but I'd rather keep it out of this PR to keep the scope tight. I'll open a follow-up PR or issue/ticket for it so we can discuss the inheritance rules and filter shape on their own. wdyt? |
Heuristically flags URLs whose content looks JavaScript-rendered by
inspecting SPA framework fingerprints, noscript blocks, empty hydration
roots, and a thin-content fallback. Sets a routing metadata key so
DelegatorProtocol can dispatch subsequent fetches to Playwright while
the bulk of the crawl stays on a cheap HTTP client.