Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions samples/browser-harness-webscraping/.env.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Microsoft Playwright Service - Environment Variables
# Copy this file to .env and fill in your values

# Playwright Service (Required for all samples)
PLAYWRIGHT_SERVICE_URL=
PLAYWRIGHT_SERVICE_ACCESS_TOKEN=
86 changes: 86 additions & 0 deletions samples/browser-harness-webscraping/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Parallel Web Scraping with Browser-Harness + Playwright Workspaces

This sample demonstrates how to use [browser-harness](https://github.com/browser-use/browser-harness) with [Playwright Workspaces (PWW)](https://aka.ms/pww/docs) to run 10+ parallel remote browser sessions for web scraping, with LiveView for real-time debuggability.

## Overview

When you need to scrape data from many pages simultaneously — product prices, inventory levels, competitor catalogs — you need parallel browser sessions. This sample shows how to:

1. **Connect browser-harness** to PWW's remote CDP endpoint
1. **Spawn 10+ parallel browser sessions** — each with its own isolated browser
1. **Scrape product data** from multiple pages concurrently

## Prerequisites

- **Azure subscription** with permissions to create Playwright Workspaces
- **Playwright Workspace** & a **Playwright Service Access Token**. [Information on how to create a workspace](https://learn.microsoft.com/en-us/azure/app-testing/playwright-workspaces/quickstart-run-end-to-end-tests?tabs=playwrightcli&pivots=playwright-test-runner) and [how to create an access token](https://learn.microsoft.com/en-us/azure/app-testing/playwright-workspaces/how-to-manage-access-tokens)
- **Python 3.10+**
- **Git** installed

## Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Set Up Environment Variables

Copy `.env.template` to `.env` and fill in your values:

```bash
cp .env.template .env
```

Required variables:
```
PLAYWRIGHT_SERVICE_URL=<playwright-service-url>
PLAYWRIGHT_SERVICE_ACCESS_TOKEN=<playwright-service-access-token>
```


### Use the setup prompt to setup browser-harness to connect to Playwright Service Browsers

In a coding agent of your choice like Codex/Claude Code, use the following prompt:

```text
Set up https://github.com/browser-use/browser-harness for me.

Read `install.md` and follow the steps to install browser-harness and connect it to my Playwright Workspaces remote browsers.

Get the SERVICE_URL needed for provisioning remote browsers by running `get_cdp_browsers_endpoint()` method from `playwright_service_client.py`

Then update your skill to Follow the two-step connection flow for playwright remote browsers:

1. HTTP GET the SERVICE_URL (allow 60-90s for the browser to spin up). Parse the JSON response to extract the `sessionUrl` (a wss:// WebSocket URL).
2. Set BU_CDP_WS to the resolved sessionUrl in .env, then restart the daemon ONCE.

IMPORTANT:

- Do NOT kill or restart the daemon after the session is connected — the remote browser is destroyed when the WebSocket connection closes.
- Do NOT set shouldRedirect=true; use shouldRedirect=false and manually resolve the sessionUrl.
- The cold start takes 30-90s. Use a generous timeout on the initial HTTP GET.
- After connecting, verify with: browser-harness <<'PY'\nprint(page_info())\nPY

Once connected, confirm with a screenshot that the remote browser is alive.
```

#### Start scraping with the power of browser-harness and Playwright Remote Browsers

Once this done, you can ask your agent to use browser-harness with playwright remote browsers to perform web scraping. Use a prompt similar to something like this:

```text

Go to ecommerce websites Website 1, Website 2, in Geography India search for gifts under 500 for 10 year old kids which is useful, reusable and not single time use.
Delivery in Bengaluru should be within 3 days.It should be such that 5 pieces of the item are available.
Create independent Playwright Service remote browser sessions per
website and use one sub-agent per website to browse in parallel using browser harness. Clone each remote session after scraping.

```

## More Resources

- [Playwright Workspaces Documentation](https://aka.ms/pww/docs)
- [Browser-Harness GitHub](https://github.com/browser-use/browser-harness)
- [PWW Pricing](https://aka.ms/pww/pricing)
93 changes: 93 additions & 0 deletions samples/browser-harness-webscraping/playwright_service_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
Microsoft Playwright Service - Python Client

Get a Service URL to connect to get remote CDP browsers.

----------------------------------------
📌 Prerequisites
----------------------------------------
pip install python-dotenv

----------------------------------------
📌 Environment Variables
----------------------------------------
PLAYWRIGHT_SERVICE_URL=wss://<region>.api.playwright.microsoft.com/playwrightworkspaces/<workspaceId>/browsers
PLAYWRIGHT_SERVICE_ACCESS_TOKEN=your_access_token

----------------------------------------
📌 How to Use
----------------------------------------
from playwright_service_client import get_cdp_browsers_endpoint

endpoint = get_cdp_browsers_endpoint()
"""

import re
import os
from dotenv import load_dotenv

load_dotenv()


class PlaywrightServiceError(Exception):
"""Exception for Playwright Service errors."""
pass


# URL pattern: wss://<region>.api.playwright.microsoft.com/playwrightworkspaces/<workspaceId>/browsers
_URL_PATTERN = re.compile(
r'wss://(\w+)\.api\.playwright\.microsoft\.com/playwrightworkspaces/([^/]+)/browsers'
)


def _parse_url(url: str) -> tuple[str, str]:
"""Extract region and workspace ID from service URL."""
match = _URL_PATTERN.match(url)
if not match:
raise PlaywrightServiceError(
f"Invalid PLAYWRIGHT_SERVICE_URL format: {url}\n"
f"Expected: wss://<region>.api.playwright.microsoft.com/playwrightworkspaces/<workspaceId>/browsers"
)
return match.group(1), match.group(2)


def get_cdp_browsers_endpoint(
service_url: str | None = None,
access_token: str | None = None
) -> str:
"""
Get the SERVICE_URL that an agent can use to get browsers that it can connect to via CDP
Args:
service_url: Service URL (defaults to PLAYWRIGHT_SERVICE_URL env var)
access_token: Access token (defaults to PLAYWRIGHT_SERVICE_ACCESS_TOKEN env var)

Returns:
URL for getting CDP browsers

Example:
SERVICE_URL = await get_cdp_browsers_endpoint()
"""
# Get credentials from env vars if not provided
service_url = service_url or os.getenv("PLAYWRIGHT_SERVICE_URL")
access_token = access_token or os.getenv("PLAYWRIGHT_SERVICE_ACCESS_TOKEN")

if not service_url:
raise PlaywrightServiceError(
"PLAYWRIGHT_SERVICE_URL environment variable is not set.\n"
"Expected: wss://<region>.api.playwright.microsoft.com/playwrightworkspaces/<workspaceId>/browsers"
)
if not access_token:
raise PlaywrightServiceError(
"PLAYWRIGHT_SERVICE_ACCESS_TOKEN environment variable is not set."
)

# Parse URL to get region and workspace ID
region, workspace_id = _parse_url(service_url)

# Build API URL
api_url = (
f"https://{region}.api.playwright.microsoft.com"
f"/playwrightworkspaces/{workspace_id}/browsers"
f"?os=linux&browser=chromium&playwrightVersion=cdp&shouldRedirect=false")

return api_url
1 change: 1 addition & 0 deletions samples/browser-harness-webscraping/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python-dotenv