Skip to content

fix(scrape-webpage): harden analyze-webpage.js against bot detection and HTTP/2 errors #82

Description

@arumsey

Summary

analyze-webpage.js uses a hardcoded Chromium executable path and a fragile browser
configuration that can be blocked by sites that detect headless browsers (e.g. HTTP/2
rejection, bot fingerprinting).

Changes needed

  • Replace hardcoded executablePath (/ms-playwright/chromium-1208/chrome-linux/chrome)
    with dynamic Chrome-first auto-detection: try channel: 'chrome' first (real TLS
    fingerprint), fall back to bundled Chromium silently. The hardcoded path breaks if the
    Chromium version changes and doesn't work outside Docker.
  • Add --disable-http2 to browser args to prevent ERR_HTTP2_PROTOCOL_ERROR from
    servers that reject HTTP/2 connections from headless browsers.
  • Switch navigation to domcontentloaded (60s timeout + 5s settle) instead of
    networkidle. Many sites never reach networkidle and the script times out unnecessarily.
  • Add explicit timeout: 60000 to the screenshot call to prevent failures on large pages.
  • Align browser context config with run-bulk-import.js: Chrome 131 UA, realistic
    sec-ch-ua headers, locale, timezone, ignoreHTTPSErrors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions