Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions docs/src/main/asciidoc/configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,64 @@ See the link:https://github.com/apache/stormcrawler/tree/main/external/playwrigh
| playwright.load.event | - | Page load event to wait for (e.g., "domcontentloaded", "networkidle").
|===

===== JS rendering detection

Browser fetching is much more expensive than a plain HTTP fetch, so most operators only want
Playwright on URLs that actually need it. The `JsRenderingDetector` parse filter inspects the
parsed page from a cheap fetch and sets a metadata flag (default `fetch.with=playwright`) on URLs
that look JavaScript-rendered. Pair it with link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/DelegatorProtocol.java[DelegatorProtocol]
to route subsequent fetches of those URLs to the Playwright protocol while leaving everything else
on a fast HTTP client.

Detection signals (cheapest first, short-circuiting):

* SPA framework fingerprints in raw HTML — `data-reactroot`, `ng-version=`, `__NEXT_DATA__`,
`window.__NUXT__`, `data-svelte-h=`, `data-vue-app`, `data-astro-cid`, `<router-outlet`.
* `<noscript>` blocks containing language like _"enable JavaScript"_.
* Empty SPA hydration roots: `<div id="root"></div>` / `#app` / `#__next` / `#__nuxt`.
* Outcome-based fallback: at least one `<script>` is present and both `text.length` and the
outlink count are below configurable thresholds.

Detection is skipped when `playwright.protocol.end` is already on the URL (i.e. it was just
fetched by Playwright) or when the routing key is already set, so the filter is idempotent.

Register the filter in `parsefilters.json`:

[source,json]
----
{
"class": "org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector",
"name": "js-rendering-detector",
"params": { "minTextLength": 200, "minOutlinks": 2 }
}
----

And route on the metadata key it sets:

[source,yaml]
----
http.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
https.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
protocol.delegator.config:
- className: "org.apache.stormcrawler.protocol.playwright.HttpProtocol"
filters:
"fetch.with": "playwright"
- className: "org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
----

The dotted metadata key is quoted in the YAML above for readability; SnakeYAML accepts the
unquoted form too. Note that `DelegatorProtocol` requires the *last* entry in
`protocol.delegator.config` to have no `filters:` — it acts as the fallback, so keep the cheap
protocol at the bottom of the list.

The parse filter alone does **not** trigger an immediate refetch — it only sets the metadata flag
on the current fetch, and `DefaultScheduler` reschedules the URL according to the FETCHED interval
(`fetchInterval.default`, 24h by default). For faster turnaround, either add a per-metadata-key
fetch interval (`fetchInterval.fetch.with=playwright: 5`) or drop `JsRenderingRedirectionBolt`
between the parser and indexer. The bolt reads the routing flag and, on hit, emits only to the
status stream with `Status.FETCHED` so the stub document never reaches the index. The full
parameter list and tuning notes are in the link:https://github.com/apache/stormcrawler/tree/main/external/playwright[playwright module README].

==== Language ID

Language identification for crawled documents using the lang-detect library.
Expand Down
109 changes: 109 additions & 0 deletions external/playwright/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,112 @@ Per-URL metadata triggers:
|---|---|
| `playwright.trace` | If present on the input metadata, a Playwright trace zip is recorded for the navigation and its path is returned in the response metadata under the same key. |

## JS rendering detection

Browser-based fetching is expensive — typically 10–50× slower than a plain HTTP fetch and limited by how many browsers a host can run concurrently. Most operators only want Playwright on the URLs that actually need it. The `JsRenderingDetector` parse filter solves the routing question without adding new infrastructure: it inspects the parsed page from a cheap fetch and, when the content looks JS-rendered, sets a metadata flag that `DelegatorProtocol` (already part of `core`) routes on.

### How detection works

The filter applies four heuristics, cheapest-first, and short-circuits on the first hit:

1. **SPA framework fingerprints** in raw HTML — `data-reactroot`, `ng-version=`, `__NEXT_DATA__`, `window.__NUXT__`, `data-svelte-h=`, `data-vue-app`, `data-astro-cid`, `<router-outlet`. Defaults are overridable via the `fingerprints` parameter.
2. **`<noscript>` blocks** that explicitly request JavaScript — match patterns like _"enable JavaScript"_, _"requires JavaScript"_, _"JavaScript is disabled"_.
3. **Empty SPA hydration roots** — `<div id="root"></div>` / `#app` / `#__next` / `#__nuxt` with no children. IDs override­able via `emptyRootIds`.
4. **Outcome-based fallback** — when at least one `<script>` is present and both `text.length < minTextLength` (default 200) and `outlinks.size() < minOutlinks` (default 2), the URL is flagged as a thin SPA. The `<script>` gate keeps the filter from flagging static error stubs.

### What the filter sets

| Metadata key | Value | Notes |
|---|---|---|
| `fetch.with` | `playwright` | Routing key, override­able via `metadataKey` / `metadataValue`. |
| `fetch.with.reason` | e.g. `fingerprint:data-reactroot`, `noscript-js-required`, `empty-root:root`, `thin-content:text=12,outlinks=0` | Diagnostic — set unless `recordReason: false`. |

### Loop guards

- Detection is skipped when `playwright.protocol.end` is already present on the URL — i.e. the URL was just fetched by Playwright; reapplying the heuristic would just reflag it. Override the watch key via `skipIfMetadataPresent`.
- Detection is also skipped when the routing key is already set, so the filter is idempotent and safe to leave permanently in `parsefilters.json`.

### Parameters

| Name | Type | Default | Notes |
|---|---|---|---|
| `metadataKey` | string | `fetch.with` | Routing key set on a hit. |
| `metadataValue` | string | `playwright` | Value to set. |
| `minTextLength` | int | `200` | Outcome-based threshold for visible text. |
| `minOutlinks` | int | `2` | Outcome-based threshold for extracted outlinks. |
| `fingerprints` | string array | _see above_ | Substrings searched in raw HTML; replaces defaults when set. |
| `emptyRootIds` | string array | `["root","app","__next","__nuxt"]` | Element IDs treated as empty SPA hydration roots. |
| `requiredMessages` | string array | _empty_ | Additional substrings that, when found anywhere in the HTML, flag the URL. Use for site-specific JS-required prompts and loader text that don't fit the noscript pattern (e.g. `"Loading..."`, `"[object Object]"`, `"Please enable cookies"`). |
| `skipIfMetadataPresent` | string | `playwright.protocol.end` | Short-circuit when this metadata key is set. Empty string disables. |
| `recordReason` | bool | `true` | Also set `metadataKey + ".reason"` describing which signal fired. |

### Wiring

Add the filter to your `parsefilters.json`:

```json
{
"class": "org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector",
"name": "js-rendering-detector",
"params": { "minTextLength": 200, "minOutlinks": 2 }
}
```

Route on the metadata key it sets via `DelegatorProtocol`:

```yaml
http.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
https.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
protocol.delegator.config:
- className: "org.apache.stormcrawler.protocol.playwright.HttpProtocol"
filters:
"fetch.with": "playwright"
- className: "org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
```

A few wiring notes:

- The dotted metadata key (`fetch.with`) is quoted in the YAML above to make it unambiguous to a human reader; SnakeYAML treats unquoted `fetch.with: "playwright"` as the same single-key scalar, so either parses correctly.
- `DelegatorProtocol` requires the **last** entry in `protocol.delegator.config` to have no `filters:` — it acts as the fallback. Keep OkHttp (or whichever cheap protocol you pick) at the bottom of the list.
- The filter alone does **not** trigger an immediate refetch. It only sets the metadata; the URL is rescheduled by `DefaultScheduler` according to the FETCHED interval (`fetchInterval.default`, 24h by default), and `DelegatorProtocol` picks Playwright on the next scheduled fetch. To get faster turnaround, either drop in the `JsRenderingRedirectionBolt` described below, or add a per-metadata-key fetch interval to your YAML: `fetchInterval.fetch.with=playwright: 5` (refetch flagged URLs in 5 minutes instead of 24 hours).
- Sibling URLs on the same host don't inherit the flag — that requires a host-keyed metadata transfer scheme and is intentionally out of scope.

### Forcing an immediate refetch — `JsRenderingRedirectionBolt`

The detector flags URLs but doesn't, on its own, prevent the cheap fetch's stub document from flowing downstream into the parser, indexer, and outlink emission. For most crawls that's fine — the next scheduled fetch replaces the stub with the rendered version. If you want the stub to be discarded and the URL refetched immediately through Playwright, drop `JsRenderingRedirectionBolt` between the parser and the indexer. The bolt:

- reads the routing flag set by the detector (or any other upstream component),
- on hit, emits **only** to `StatusStreamName` with `Status.FETCHED` so the URL is rescheduled and the stub never reaches the index,
- on miss, passes the tuple through unchanged,
- short-circuits when `playwright.protocol.end` is already on the URL — the loop guard.

The bolt has no detection logic of its own; it just acts on the metadata flag. That keeps the heuristics in one place (the parse filter) and lets you swap or extend the bolt independently.

Topology fragment:

```text
... -> JSoupParserBolt -> JsRenderingRedirectionBolt -> IndexerBolt -> ...
\-> StatusStream
```

YAML:

```yaml
# refetch flagged URLs in 5 minutes rather than 24 hours
fetchInterval.fetch.with=playwright: 5
```

Configuration keys:

| Key | Default | Notes |
|---|---|---|
| `playwright.redirect.metadata.key` | `fetch.with` | Routing key the bolt watches for. |
| `playwright.redirect.metadata.value` | `playwright` | Value the bolt watches for. |
| `playwright.redirect.skip.if.metadata.present` | `playwright.protocol.end` | Loop guard — pass through unchanged when this key is set on the URL. Empty string disables. |

### When _not_ to use it

- **Operator allowlist suffices.** If you already know which hosts need a browser, add them as a `urlPatterns` rule on the Playwright leg of `DelegatorProtocol` and skip the filter.
- **Anti-bot / WAF challenge pages.** Cloudflare, DataDome, and Akamai challenge fingerprints aren't covered here; those usually need a stealth-mode browser, not just rendering.
- **Aggressively first-fetch-sensitive crawls.** The first fetch on an unknown SPA host is always wasted (you get a stub document) before the filter learns about the host. If that's unacceptable, prefer the operator allowlist.

Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to you under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.stormcrawler.protocol.playwright.bolt;

import java.util.Map;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.stormcrawler.Constants;
import org.apache.stormcrawler.Metadata;
import org.apache.stormcrawler.persistence.Status;
import org.apache.stormcrawler.protocol.playwright.HttpProtocol;
import org.apache.stormcrawler.util.ConfUtils;
import org.slf4j.LoggerFactory;

/**
* Bolt that consumes the routing flag set by {@link
* org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector} (or any other
* upstream component) and forces an immediate refetch through Playwright instead of letting the
* cheap fetch's stub document propagate downstream.
*
* <p>Pipeline placement: between the parser bolt (which produces tuples of {@code (url, content,
* metadata, text)}) and the indexer / persistence bolts. On hit, the bolt emits only to the {@link
* Constants#StatusStreamName} with status {@link Status#FETCHED}, so the URL is rescheduled and the
* stub never reaches the index. On miss, the tuple passes through to the default stream unchanged.
*
* <p>Pair this with a per-metadata-key fetch interval to control how soon the refetch happens — by
* default {@code Status.FETCHED} reschedules at {@code fetchInterval.default} (24h):
*
* <pre>{@code
* # refetch flagged URLs in 5 minutes rather than 24 hours
* fetchInterval.fetch.with=playwright: 5
* }</pre>
*
* <h3>Configuration</h3>
*
* <ul>
* <li>{@code playwright.redirect.metadata.key} (default {@code fetch.with})
* <li>{@code playwright.redirect.metadata.value} (default {@code playwright})
* <li>{@code playwright.redirect.skip.if.metadata.present} (default {@link
* HttpProtocol#MD_KEY_END}) — passes the tuple through unchanged when this metadata key is
* already set, preventing loops with content that came back from Playwright. Set to empty to
* disable the loop guard.
* </ul>
*/
public class JsRenderingRedirectionBolt extends BaseRichBolt {

private static final org.slf4j.Logger LOG =
LoggerFactory.getLogger(JsRenderingRedirectionBolt.class);

public static final String CONF_METADATA_KEY = "playwright.redirect.metadata.key";
public static final String CONF_METADATA_VALUE = "playwright.redirect.metadata.value";
public static final String CONF_SKIP_IF_METADATA_PRESENT =
"playwright.redirect.skip.if.metadata.present";

public static final String DEFAULT_METADATA_KEY = "fetch.with";
public static final String DEFAULT_METADATA_VALUE = "playwright";

private OutputCollector collector;
private String routingKey;
private String routingValue;
private String skipIfMetadataPresent;

@Override
public void prepare(
final Map<String, Object> conf,
final TopologyContext context,
final OutputCollector collector) {
this.collector = collector;
this.routingKey = ConfUtils.getString(conf, CONF_METADATA_KEY, DEFAULT_METADATA_KEY);
this.routingValue = ConfUtils.getString(conf, CONF_METADATA_VALUE, DEFAULT_METADATA_VALUE);
this.skipIfMetadataPresent =
ConfUtils.getString(conf, CONF_SKIP_IF_METADATA_PRESENT, HttpProtocol.MD_KEY_END);
}

@Override
public void execute(final Tuple tuple) {
final String url = tuple.getStringByField("url");
final byte[] content = tuple.getBinaryByField("content");
final Metadata metadata = (Metadata) tuple.getValueByField("metadata");
final String text = tuple.getStringByField("text");

if (shouldRedirect(metadata)) {
LOG.debug("Redirecting {} to Playwright (status stream)", url);
collector.emit(
Constants.StatusStreamName, tuple, new Values(url, metadata, Status.FETCHED));
} else {
collector.emit(tuple, new Values(url, content, metadata, text));
}
collector.ack(tuple);
}

private boolean shouldRedirect(final Metadata metadata) {
if (metadata == null) {
return false;
}
if (skipIfMetadataPresent != null
&& !skipIfMetadataPresent.isEmpty()
&& metadata.containsKey(skipIfMetadataPresent)) {
// already came back from Playwright — don't loop
return false;
}
return metadata.containsKeyWithValue(routingKey, routingValue);
}

@Override
public void declareOutputFields(final OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("url", "content", "metadata", "text"));
declarer.declareStream(Constants.StatusStreamName, new Fields("url", "metadata", "status"));
}
}
Loading