Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 58 additions & 4 deletions docs/src/main/asciidoc/configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -488,11 +488,65 @@ See the link:https://github.com/apache/stormcrawler/tree/main/external/playwrigh
|===
| key | default value | description

| playwright.cdp.url | - | Chrome DevTools Protocol URL for connecting to an existing browser instance.
| playwright.remote.ws | - | Remote WebSocket URL for Playwright (alternative to CDP).
| playwright.skip.download | true | Skip automatic browser download. Set to false to let Playwright manage its own browser.
| playwright.load.event | - | Page load event to wait for (e.g., "domcontentloaded", "networkidle").
| playwright.cdp.url | - | Chrome DevTools Protocol URL for connecting to an existing browser instance (e.g. `http://localhost:9222`). Mutually exclusive with `playwright.remote.ws`.
| playwright.remote.ws | - | Remote WebSocket URL for Playwright (alternative to CDP, e.g. `ws://localhost:3000/`).
| playwright.skip.download | false | Skip automatic browser download. Implicitly forced to `true` when `playwright.cdp.url` or `playwright.remote.ws` is set.
| playwright.load.event | load | Page load event to wait for. One of `load`, `domcontentloaded`, `networkidle`.
| playwright.skip.resource.types | - | List of resource types aborted during navigation (`document`, `stylesheet`, `image`, `media`, `font`, `script`, `texttrack`, `xhr`, `fetch`, `eventsource`, `websocket`, `manifest`, `other`).
| playwright.evaluations | - | List of JavaScript expressions evaluated after load; each JSON-serialised result is stored in response metadata under the expression itself.
| playwright.capture.content.on.error | false | If `true`, also capture `page.content()` for non-2xx responses — useful for SPAs that return a stub then hydrate via JS.
| playwright.override.status.on.content | false | When content was captured for a non-2xx response, override the reported HTTP status with `200`. The original status is preserved under the `playwright.origin.status` response metadata key. No-op unless `playwright.capture.content.on.error` is also `true`.
| playwright.page.actions.config.file | - | JSON file declaring an ordered chain of `PageAction` implementations applied after navigate succeeds and before content capture. See _Page actions_ below.
|===

===== Page actions

The Playwright protocol exposes a `PageAction` extension point — an ordered chain of post-navigate
DOM transformations loaded from a JSON file referenced by `playwright.page.actions.config.file`. Use
this to plug site-specific behaviour (tab/accordion expansion, cookie-banner dismissal,
infinite-scroll, custom `evaluate()` calls, screenshotting, ...) into the protocol without
subclassing it. The chain runs only when content would otherwise be captured (on 2xx, or on non-2xx
when `playwright.capture.content.on.error` is `true`). Per-action failures are logged and swallowed
so one bad action cannot abort the rest of the chain.

[source,json]
----
{
"org.apache.stormcrawler.protocol.playwright.PageActions": [
{
"class": "org.apache.stormcrawler.protocol.playwright.actions.DismissOverlayAction",
"name": "cookies",
"params": { "selectors": ["#cookie-accept"] }
},
{
"class": "org.apache.stormcrawler.protocol.playwright.actions.ExpandClickablesAction",
"name": "tabs",
"params": {
"selectors": [".tab-widget .tab-header"],
"root": ".tab-widget",
"body": ".tab-widget-body",
"waitMs": 300
}
}
]
}
----

Built-in actions:

[cols="1,3", options="header"]
|===
| Class | Purpose

| `ExpandClickablesAction` | Clicks every element matching the configured selectors and clones the resulting body container into a hidden cache under the same widget root, so `page.content()` ends up containing the HTML of every tab/accordion panel rather than only the active one.
| `EvaluateAction` | Evaluates a list of JavaScript expressions and stores each JSON-serialised result in response metadata.
| `ScrollToBottomAction` | Repeatedly scrolls to the bottom of the page until the document height stops growing, the step cap is reached, or the time budget elapses — useful for infinite-scroll feeds.
| `WaitForSelectorAction` | Waits for a selector to reach an `attached` / `detached` / `visible` / `hidden` state. Soft-fails on timeout by default; set `required: true` to fail.
| `DismissOverlayAction` | Dismisses cookie banners, GDPR walls, newsletter modals, etc. by clicking the first match of each selector, and optionally removes sticky overlays from the DOM via `removeSelectors`.
| `ScreenshotAction` | Captures a screenshot of the page and stores it base64-encoded in response metadata. For diagnostics / small-volume use; larger crawls should write to a blob store.
|===

See the link:https://github.com/apache/stormcrawler/tree/main/external/playwright[playwright module README] for the full parameter list of each built-in action and a guide on writing your own.

==== Language ID

Expand Down
51 changes: 51 additions & 0 deletions docs/src/main/asciidoc/extending.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,57 @@ https.protocol.implementation: "com.example.MyProtocol"

Use the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/DelegatorProtocol.java[DelegatorProtocol] when you need to route URLs to different protocol implementations based on metadata or URL patterns.

==== Custom Page Action (Playwright)

The link:https://github.com/apache/stormcrawler/tree/main/external/playwright[Playwright protocol] exposes a `PageAction` extension point so you can plug site-specific post-navigate behaviour (tab/accordion expansion, cookie-banner dismissal, infinite-scroll, custom `evaluate()` calls, screenshotting, ...) into the protocol without subclassing it. Actions are loaded as an ordered chain from a JSON file referenced by `playwright.page.actions.config.file` and follow the same `Configurable` lifecycle as URL/parse filters. The chain runs after `page.navigate()` succeeds and before `page.content()` is captured, so any DOM mutations land in the rendered content returned by the protocol.

A handful of built-in actions ship with the module — `DismissOverlayAction`, `ExpandClickablesAction`, `ScrollToBottomAction`, `EvaluateAction`, `WaitForSelectorAction`, `ScreenshotAction`. Reach for a custom action when none of those fit. Extend `PageAction` and implement `apply`:

[source,java]
----
import org.apache.stormcrawler.protocol.playwright.PageAction;
import org.apache.stormcrawler.Metadata;
import com.fasterxml.jackson.databind.JsonNode;
import com.microsoft.playwright.Page;
import java.util.Map;

public class MyPageAction extends PageAction {

private String selector;

@Override
public void configure(Map<String, Object> stormConf, JsonNode params) {
if (!params.has("selector")) {
throw new IllegalArgumentException("MyPageAction requires 'selector'");
}
this.selector = params.get("selector").asText();
}

@Override
public void apply(Page page, String url,
Metadata sourceMetadata, Metadata responseMetadata) {
page.locator(selector).click();
}
}
----

Reference the action by its fully-qualified class name in the chain JSON:

[source,json]
----
{
"org.apache.stormcrawler.protocol.playwright.PageActions": [
{
"class": "com.example.MyPageAction",
"name": "my-action",
"params": { "selector": "#load-more" }
}
]
}
----

Per-action failures in `apply()` are logged and swallowed by the chain wrapper so one bad action cannot abort the rest. If you need a hard failure on misconfiguration, throw from `configure()` — that propagates at topology start-up, before any URL is fetched.

==== Custom Bolt or Spout

For a bolt that emits on the status stream (like fetchers and parsers), extend link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/StatusEmitterBolt.java[StatusEmitterBolt]:
Expand Down
70 changes: 70 additions & 0 deletions external/playwright/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,80 @@ The setting `playwright.skip.download` to `true` in the configuration will assum
| `playwright.evaluations` | _empty_ | List of JavaScript expressions evaluated on the page after load. Each result is JSON-serialized and stored in the response metadata under the expression itself as the key. |
| `playwright.capture.content.on.error` | `false` | By default the rendered DOM is only captured when the origin returns a 2xx status. Set to `true` to also capture `page.content()` for non-2xx responses — useful for Single-Page Applications that return a non-2xx stub document and then hydrate the real content via JavaScript. |
| `playwright.override.status.on.content` | `false` | When the rendered DOM was captured for a non-2xx response, override the reported HTTP status with `200` so downstream components treat the URL as `FETCHED`. The original origin status is preserved in the response metadata under the key `playwright.origin.status`. No-op unless `playwright.capture.content.on.error` is also `true`. |
| `playwright.page.actions.config.file` | _unset_ | Path to a JSON file declaring an ordered chain of `PageAction` implementations applied after `page.navigate()` succeeds and before `page.content()` is captured. Use this to plug site-specific post-navigate behaviour (tab/accordion expansion, cookie-banner dismissal, scroll-to-bottom, custom `evaluate()` calls, ...) into the protocol without subclassing it. The chain runs only when content would otherwise be captured (i.e. on 2xx, or on non-2xx if `playwright.capture.content.on.error` is `true`). |

Per-URL metadata triggers:

| Metadata key | Effect |
|---|---|
| `playwright.trace` | If present on the input metadata, a Playwright trace zip is recorded for the navigation and its path is returned in the response metadata under the same key. |

## Page actions

Custom post-navigate behaviour is added by implementing `PageAction` and listing the implementation in the JSON file referenced by `playwright.page.actions.config.file`. Actions follow the same `Configurable` pattern as URL/parse filters and are loaded as an ordered chain. A failure in one action is logged and swallowed so the rest of the chain still runs.

```json
{
"org.apache.stormcrawler.protocol.playwright.PageActions": [
{
"class": "org.apache.stormcrawler.protocol.playwright.actions.ExpandClickablesAction",
"name": "tabs",
"params": {
"selectors": [".tab-widget .tab-header"],
"root": ".tab-widget",
"body": ".tab-widget-body",
"waitMs": 300
}
}
]
}
```

### Built-in actions

| Class | Purpose |
|---|---|
| `ExpandClickablesAction` | Clicks every element matching the configured selectors and clones the resulting body container into a hidden cache under the same widget root, so `page.content()` ends up containing the HTML of every tab/accordion panel rather than only the originally active one. Anchors with an `href` are skipped. Parameters: `selectors` (array, required), `root` (string, required), `body` (string, required), `waitMs` (int, default `200`), `clickTimeoutMs` (int, default `2000`). |
| `EvaluateAction` | Evaluates a list of JavaScript expressions on the page and stores the JSON-serialised result of each in the response metadata. Parameters: `expressions` (array of strings, required), `keyPrefix` (string, optional — when set, results are stored under `keyPrefix + index` rather than under the expression itself, matching the legacy `playwright.evaluations` behaviour). |
| `ScrollToBottomAction` | Repeatedly scrolls to the bottom of the page until the document height stops growing, the step cap is reached, or the time budget elapses — useful for infinite-scroll feeds. Parameters: `waitMs` (int, default `500`), `maxSteps` (int, default `20`), `maxDurationMs` (int, default `15000`). |
| `WaitForSelectorAction` | Waits for a selector to reach a given state before allowing the chain to proceed. By default a timeout is treated as a soft failure (logged and swallowed); set `required: true` to fail the action on timeout. Parameters: `selector` (string, required), `state` (one of `attached`, `detached`, `visible`, `hidden` — default `visible`), `timeoutMs` (int, default `5000`), `required` (bool, default `false`). |
| `DismissOverlayAction` | Dismisses cookie banners, GDPR walls, newsletter modals, etc. by clicking the first match of each selector, and optionally removes sticky overlays by deleting matching elements from the DOM. Missing elements and click failures are silently skipped. Parameters: `selectors` (array of strings), `removeSelectors` (array of strings), `timeoutMs` (int, default `1500`). At least one of `selectors` or `removeSelectors` must be non-empty. |
| `ScreenshotAction` | Captures a screenshot of the page and stores it base64-encoded in the response metadata. Intended for diagnostics and small-volume use; larger crawls should write to a blob store instead. Parameters: `metadataKey` (string, default `playwright.screenshot`), `fullPage` (bool, default `false`), `type` (`png` or `jpeg`, default `png`), `quality` (int 0-100, only honoured for JPEG). |

### Writing your own action

Extend `PageAction` (which extends `AbstractConfigurable`) and implement `apply(Page, url, sourceMetadata, responseMetadata)`. Read parameters from the supplied `JsonNode` in `configure()` and validate at load time — anything thrown there propagates through `PageActions.fromConf()` and stops the topology from starting with a misconfigured chain.

```java
public class MyAction extends PageAction {

private String selector;

@Override
public void configure(final Map<String, Object> stormConf, final JsonNode params) {
if (!params.has("selector")) {
throw new IllegalArgumentException("MyAction requires 'selector'");
}
this.selector = params.get("selector").asText();
}

@Override
public void apply(final Page page, final String url,
final Metadata sourceMetadata, final Metadata responseMetadata) {
page.locator(selector).click();
}
}
```

Reference it from the chain JSON via its fully-qualified class name. Per-action failures are logged and swallowed by the chain wrapper so one bad action cannot abort the rest — if you need a hard failure, raise it from `configure()`, not from `apply()`.

## Tests

The module has three test classes:

- `PageActionsTest` — JSON loader / chain construction (no browser).
- `actions/ActionConfigureTest` — `configure()` validation for every built-in (no browser).
- `PageActionsLiveTest` — end-to-end browser tests for the chain and individual actions.

The live tests use the same `assumeTrue("false".equals(System.getProperty("CI_ENV", "false")))` gate as `ProtocolTest`, so they run locally (`mvn test`) but skip on CI runners launched with `-DCI_ENV=true`. They expect a usable Chromium — either via `mvn exec:java -e -Dexec.mainClass=com.microsoft.playwright.CLI -Dexec.args="install chromium"` or by pointing `playwright.cdp.url` at an existing browser.

7 changes: 7 additions & 0 deletions external/playwright/playwright-conf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,10 @@ config:
# playwright.capture.content.on.error is also true.
# playwright.override.status.on.content: false

# JSON file declaring an ordered chain of PageAction implementations applied
# after navigate() succeeds and before page.content() is captured. Use this
# to plug in site-specific post-navigate behaviour (tab/accordion expansion,
# cookie-banner dismissal, infinite-scroll, custom evaluate calls, ...)
# without subclassing the protocol. See README for the JSON shape.
# playwright.page.actions.config.file: "page-actions.json"

Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ public class HttpProtocol extends AbstractHttpProtocol {

private WaitUntilState loadEvent;

private PageActions pageActions = PageActions.emptyPageActions;

@Override
public void configure(final Config conf) {
super.configure(conf);
Expand Down Expand Up @@ -172,6 +174,9 @@ public void configure(final Config conf) {

// expressions to evaluate
evaluations = ConfUtils.loadListFromConf(MD_EVALUATIONS, conf);

// optional chain of page actions applied after navigate, before content capture
pageActions = PageActions.fromConf(conf);
}

@Override
Expand Down Expand Up @@ -260,6 +265,8 @@ public ProtocolResponse getProtocolOutput(String url, Metadata md) throws Except
boolean contentCaptured = false;

if (fetched || captureContentOnError) {
// run any configured post-navigate actions before capturing content
pageActions.apply(page, url, md, responseMetaData);
// retrieve the rendered content
content = page.content().getBytes(StandardCharsets.UTF_8);
contentCaptured = true;
Expand Down Expand Up @@ -320,6 +327,7 @@ private Proxy getProxy(String proxyserver, String proxyuser, String proxypwd) {
public void cleanup() {
synchronized (this) {
super.cleanup();
pageActions.cleanup();
context.close();
browser.close();
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to you under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.stormcrawler.protocol.playwright;

import com.microsoft.playwright.Page;
import org.apache.storm.task.IBolt;
import org.apache.stormcrawler.Metadata;
import org.apache.stormcrawler.util.AbstractConfigurable;
import org.jetbrains.annotations.NotNull;

/**
* A pluggable post-navigate page transformation. Each implementation is invoked after {@code
* page.navigate()} succeeds and before {@code page.content()} is captured, so any DOM mutations it
* makes are reflected in the rendered content returned by the protocol.
*
* <p>Actions are loaded as an ordered chain via {@link PageActions} from a JSON file referenced by
* the {@code playwright.page.actions.config.file} configuration key. They follow the same {@link
* org.apache.stormcrawler.util.Configurable} pattern as URL/parse filters.
*/
public abstract class PageAction extends AbstractConfigurable {

/**
* Apply this action to the page.
*
* @param page the live Playwright {@link Page}, already navigated to {@code url}
* @param url the URL being fetched
* @param sourceMetadata input metadata associated with the URL (read-only intent)
* @param responseMetadata response metadata being built up; actions may add diagnostics here
*/
public abstract void apply(
@NotNull final Page page,
@NotNull final String url,
@NotNull final Metadata sourceMetadata,
@NotNull final Metadata responseMetadata)
throws Exception;

/** Release any resources held by the action. See {@link IBolt#cleanup()} for more details. */
public void cleanup() {
// nothing to do here
}
}
Loading