Add opt-in browser-scraper Worker for JS-rendered monitor pages#14
Add opt-in browser-scraper Worker for JS-rendered monitor pages#14valuecodes merged 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new apps/browser-scraper Cloudflare Worker so the operator can optionally scrape JS-rendered pages through Browser Rendering instead of the existing native fetch() path. It also factors URL validation into a shared workspace package, adds the operator-side service binding and schedule flag plumbing, and includes the logger metadata redaction update.
Changes:
- Adds a new
@repo/browser-scraperWorker that uses Playwright on Cloudflare Browser Rendering, plus its config, package metadata, and tests. - Moves URL safety checks into a shared
@repo/url-validatorpackage and switches operator scraping/validation code to use it. - Adds operator-side browser-scraper binding/migration wiring and updates CI deploy order so the new service exists before operator deploys.
Reviewed changes
Copilot reviewed 25 out of 28 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pnpm-lock.yaml |
Locks new workspace/package dependencies for browser-scraper and url-validator. |
packages/url-validator/vitest.config.ts |
Adds Vitest config for the shared URL validator package. |
packages/url-validator/tsconfig.json |
Adds TS config for the shared URL validator package. |
packages/url-validator/src/url-validator.ts |
Introduces shared source URL validation logic used by operator and browser-scraper. |
packages/url-validator/src/url-validator.test.ts |
Adds unit tests for the shared URL validator. |
packages/url-validator/package.json |
Defines the new shared package metadata and scripts. |
packages/url-validator/eslint.config.ts |
Adds ESLint config for the new shared package. |
packages/logger/src/logger.ts |
Adds top-level metadata redaction for sensitive log keys. |
packages/logger/src/__tests__/logger.test.ts |
Tests the new logger redaction behavior. |
apps/operator/wrangler.jsonc |
Wires the browser-scraper service binding into the operator Worker. |
apps/operator/src/types/env.ts |
Extends operator environment typings with the browser-scraper binding. |
apps/operator/src/services/scrape.ts |
Refactors scraping to support native fetch or browser-scraper via options. |
apps/operator/src/services/scrape.test.ts |
Adds tests for the new browser-scraper scrape path. |
apps/operator/src/scheduled.ts |
Passes per-schedule browser options into monitor execution. |
apps/operator/src/modules/telegram/controller.ts |
Switches monitor URL validation import to the shared package. |
apps/operator/src/db/schema.ts |
Adds the use_browser column to schedules. |
apps/operator/package.json |
Adds the shared URL validator dependency to operator. |
apps/operator/migrations/meta/_journal.json |
Registers the new migration in Drizzle metadata. |
apps/operator/migrations/meta/0005_snapshot.json |
Captures the schema snapshot including use_browser. |
apps/operator/migrations/0005_nervous_scorpion.sql |
Adds the SQL migration for the new schedule column. |
apps/browser-scraper/wrangler.jsonc |
Configures the new browser-scraper Worker and Browser Rendering binding. |
apps/browser-scraper/tsconfig.json |
Adds TS config for the new Worker. |
apps/browser-scraper/src/services/playwright.ts |
Implements Playwright-based rendering and subresource URL checks. |
apps/browser-scraper/src/services/playwright.test.ts |
Adds unit tests for Playwright rendering behavior. |
apps/browser-scraper/src/index.ts |
Exposes the browser-scraper Worker request handler. |
apps/browser-scraper/package.json |
Defines package metadata/scripts/dependencies for the new Worker. |
apps/browser-scraper/eslint.config.ts |
Adds ESLint config for the new Worker. |
.github/workflows/main.yml |
Updates deploy ordering so browser-scraper deploys before operator. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| useBrowser: int("use_browser", { mode: "boolean" }) | ||
| .notNull() | ||
| .default(false), |
There was a problem hiding this comment.
Intentional and called out as out-of-scope in the PR plan: enabling a schedule for the browser path is currently a manual step (wrangler d1 execute … "UPDATE schedules SET use_browser = 1 WHERE id = '…'"). Wiring it into createScheduleSchema / the OpenAI create_schedule tool / ScheduleService.create is a deliberate follow-up so we can validate the daily browser-time budget before exposing the toggle to LLM-generated schedules.
| const html = await page.content(); | ||
| const truncated = html.length > MAX_HTML_CHARS; | ||
| const finalHtml = truncated ? html.slice(0, MAX_HTML_CHARS) : html; |
There was a problem hiding this comment.
Intentional. page.content() has no streaming variant on the @cloudflare/playwright binding, so the only enforcement point the worker exposes is post-serialization. The cap is therefore on the response payload we send back over the service binding (and ultimately into the AI prompt), not on browser memory — the 60s hard timeout and 10 min/day daily budget bound the memory side. Char vs byte: agreed it's not a hard byte cap, but UTF-8 worst case is ~4× and downstream convertContent handles oversized input fine, so the slack is acceptable here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d65cd20a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - name: Deploy operator to Cloudflare Workers | ||
| run: pnpm --filter @repo/operator run deploy |
There was a problem hiding this comment.
Run DB migration before deploying the operator
This workflow deploys the updated operator immediately but never applies migration 0005_nervous_scorpion.sql, even though the commit adds a new schedules.use_browser column that is now part of the runtime schema (apps/operator/src/db/schema.ts) and read during scheduled execution. On environments where db:migrate:remote has not been run yet, the newly deployed worker can fail with no such column: use_browser as soon as it queries schedules. Add a migration step (for example pnpm --filter @repo/operator db:migrate:remote) before the operator deploy step.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
A dedicated workflow at .github/workflows/migrations.yml already applies D1 migrations on push to main whenever apps/operator/migrations/** changes (wrangler d1 migrations apply switch-operator-db --remote). Duplicating that step inside the main deploy job would either run it twice or couple the two workflows. There is a known race window where the deploy can land before the migration workflow finishes; the current mitigation is to merge migration-only changes ahead of the feature commit when the gap matters. Happy to add an explicit guard if the race becomes a real problem.
What
Adds a separate
apps/browser-scraperWorker that runs Playwright on Cloudflare Browser Rendering, called from operator via a Service Binding. Nativefetch()stays the default; per-monitoruse_browserflag opts into the browser path. Includes the previously committed logger metadata-key redaction.switch-operator-browser-scraperwithworkers_dev: false(only reachable via service binding); SSRF guards on top-level URL, post-redirect URL, and every subresource viapage.route()use_browsercolumn (migration0005),BROWSER_SCRAPERservice binding,scrapeUrlswitched to options-object signaturepackages/url-validatorworkspace, used by both WorkersHow to test
pnpm typecheck && pnpm lint && pnpm format:check && pnpm test— all green (210 tests)wrangler tailflowSecurity review
workers_dev: false); SSRF validator blocks private/loopback/non-HTTPS on three paths (initial URL, post-redirect, subresources); 2 MB body cap and 15s navigation timeout protect the daily browser budget.@cloudflare/playwright@1.2.0added to the new Worker only; operator bundle unchanged.