Heuristic logic reorganized and inconsistency fix by Abdul03Rafay · Pull Request #17 · benevolentbandwidth/beacon

Abdul03Rafay · 2026-06-07T20:51:18Z

1st Commit — Jun, 7th.

Reorganized heuristics .ts file into three separate files;

contentHeuristics.ts — page content (text, title, meta description)
urlHeuristics.ts — the main page URL
linkHeuristics.ts — links found on the page
Renamed heuristics_link.ts → linkHeuristics.ts for naming consistency.

Implemented `urlHeuristics.ts`

Moved our EDA-backed URL rules out of contentHeuristics.ts where they didn't belong:

Tier 1: IP address hostname, @ credential injection, URL > 144 chars (all 100% phishing in EDA)
Tier 2: URL > 75 chars + hyphen-stacking or percent-encoded path (compound signal)

Stripped `contentHeuristics.ts` to content-only

Removed all URL rules and mismatched link detection. Now contains only:

sparsityNoMeta — sparse body text + no meta description (EDA proxy for largestlinelength + lineofcode)
Scam phrase detection across title, meta, and body

Upgraded `linkHeuristics.ts`

Added mismatched link detection (moved from contentHeuristics.ts) — visible text claims domain X, href goes to domain Y
Fixed scoring from 0.0–1.0 ratio to a proper 0–10 scale
Requires 3+ flags per link before counting it as suspicious (avoids false positives from weak signals like HTTP-only links)

Wired all three modules into the live pipeline

Updated content.ts to call all three modules on every page load and combine their results:

analyzeUrl + analyzeContent + analyzeLinks → combineResults → stored in background → shown in popup

Inverted the scoring system

Score is now a safety score: 10 = safe, 0 = scam. Each module computes a threat score internally, inverts it at return.
combineResults converts safety → threat → sums → re-inverts so compound signals still push the score down correctly.

Tests and build

Updated contentHeuristics.test.ts to content-only cases
Created urlHeuristics.test.ts for URL rules
All tests run via npx tsx src/heuristics/<file>.test.ts
npm run build produces clean dist — extension loads from extension/ folder in Chrome

2nd Commit — Jun, 18th.

Heuristics refinement, additions and documentation

Overview

This PR strengthens Beacon's phishing detection engine with 4 new URL rules, 5 DOM-level content rules, a pipeline-level URL veto, and a trusted aggregator bypass. It also fixes two correctness bugs and adds a full system diagram.

URL Heuristics (`urlHeuristics.ts`)

4 new rules:

Rule	Tier	Weight	Catches
`brandSubdomainSpoofing`	1	5	`paypal.verify-accounts.evil.xyz` — brand name in non-brand hostname
`freeHostingFinancialSubdomain`	1	5	`v0-greendotfinance.vercel.app` — financial keyword in Vercel/Netlify/etc. subdomain
`suspiciousTld`	1	5	`.icu`, `.tk`, `.ml`, `.ga`, `.cf`, `.gq`, `.cfd`, `.cyou`
`gibberishDomainLabel`	2	3	`nvkcy.icu` — 4+ char label with zero vowels (auto-generated domains)

urlLengthHard made compound: path > 144 chars alone no longer triggers — the domain must also be suspicious (high-risk TLD or ≥ 2 hyphens). Fixes false positive on Workday/ATS job-board URLs like pwc.wd3.myworkdayjobs.com/....

Content Heuristics (`contentHeuristics.ts`)

5 new DOM-level rules:

suspiciousButtonText — CTA buttons containing "Claim Now", "Access My Funds", "Unlock My Account", etc. (wt 3 each)
credentialFormCrossDomain — password form that POSTs to a different domain (wt 6 — survives URL veto by design)
fakeSecurityBadge — Norton/McAfee/DigiCert/Comodo/Trustwave in image alt texts (wt 2)
countdownUrgency — clock pattern + urgency word ("expires", "remaining", "hurry") in body (wt 2)
overlayScamPhrases — scam phrases detected inside open <dialog> or [role="dialog"] modals (wt 3 each)

Expanded SCAM_PHRASES: 9 financial credential phishing phrases added — "your account has been suspended", "your funds are on hold", "your card has been blocked", etc.

Pipeline (`content.ts` + `types/heuristics.ts`)

DOM extraction: extractPageData() now collects forms, buttonText, badgeAltTexts, and overlayText from the live page; FormInfo type and new fields added to ExtractedPageData
URL veto: when URL threat = 0, content/link threats below 6 are discarded — prevents ad copy and cookie banners on clean pages from producing false positives
Trusted aggregator bypass: Google, Bing, Wikipedia, Reddit, YouTube, Twitter skip content+link analysis when the URL is clean

Bug Fixes

linkHeuristics.ts: analyzeLinks() returned score 0 (maximum danger) for an empty links array — fixed to 10 (safe). Was causing false positives on pages like fifa.com with no outbound links.
popup/App.tsx: LLM risk_score (0 = safe, 10 = dangerous) was being used directly instead of being inverted to match the heuristic safety scale (10 = safe, 0 = dangerous).

Documentation & Tests

HEURISTICS.md: New file with a full Mermaid pipeline diagram (extractPageData → trusted aggregator gate → URL/content/link modules → combineResults → LLM Tier 2) and rule weight reference tables
urlHeuristics.test.ts: Regression tests for all new rules; Workday long-URL false positive test
contentHeuristics.test.ts: EMPTY_DOM spread helper for existing tests; new cases for credentialFormCrossDomain, suspiciousButtonText, countdownUrgency

3rd Commit — Jun, 18th.

Evaluation notebook updated to reflect heuristics additions

Overview

Keeps eval/beacon_eval.ipynb in sync with the heuristics changes landed in the previous commit. All Python rule implementations now match their TypeScript counterparts exactly, and the labeled dataset is extended with targeted test cases for each new rule.

Heuristics Engine (`beacon_eval.ipynb`)

URL heuristics (Cell 6) — 4 new rules + 1 fix:

Added BRAND_LEGITIMATE_DOMAINS, HIGH_RISK_TLDS, FREE_HOSTING_PLATFORMS, FINANCIAL_SUBDOMAIN_KEYWORDS constants
Added _check_brand_subdomain_spoofing, _check_free_hosting_financial, _check_suspicious_tld, _check_gibberish_domain_label
Fixed _check_url_length_hard: now checks path-only length (query string excluded) and requires domain-level suspicion alongside long path — matches the compound fix in TypeScript
URL_RULES list updated from 4 to 8 entries

Content heuristics (Cell 7) — expanded phrases + 5 DOM rules:

SCAM_PHRASES extended with 9 financial credential phishing phrases
SUSPICIOUS_CTA_PHRASES constant added
analyze_content() now handles buttonText, forms, badgeAltTexts, and overlayText fields from page data
Implements suspiciousButtonText, credentialFormCrossDomain, fakeSecurityBadge, countdownUrgency, and overlayScamPhrases rules

Combined pipeline (Cell 9):

Added TRUSTED_AGGREGATORS list and is_trusted_aggregator() helper
combine_results() now applies URL veto (discards content/link threats < 6 when URL threat = 0)

Page fetcher (Cell 11):

fetch_page_data() extended to extract forms (action URL + password field presence), buttonText (pipe-joined from buttons and submit inputs), badgeAltTexts (Norton/McAfee/DigiCert/etc. img alts), and overlayText (from <dialog> and [role="dialog"] elements)

Full pipeline (Cell 21):

Added trusted aggregator bypass before content/link analysis — mirrors content.ts behaviour

Per-rule breakdown (Cell 18):

all_rules list updated from 4 entries to all 8 URL rules

Dataset (`dataset/urls.csv`)

6 new labeled entries added (59 → 65 URLs):

URL	Label	Expected	Rules fired
`v0-greendotfinance.vercel.app`	phishing	uncertain	`freeHostingFinancialSubdomain`
`mygslobotsb-online.icu`	phishing	uncertain	`suspiciousTld`
`nvkcy.icu`	phishing	scam	`suspiciousTld` + `gibberishDomainLabel`
`pwc.wd3.myworkdayjobs.com/...`	legitimate	safe	none — regression test for `urlLengthHard` fix
`paypal.verify-accounts.malicious.xyz/login`	phishing	uncertain	`brandSubdomainSpoofing`
`secure-paypal-login.attacker.com/account/verify`	phishing	uncertain	`brandSubdomainSpoofing`

Abdul03Rafay and others added 4 commits June 7, 2026 16:39

Heuristic logic reorganized and inconsistency fix

30f8bf3

Merge branch 'main' into main

104d06a

Heuristics refinement, additions and documentation.

f58e746

Beacon Heuristics Evaluation Notebook and Datasets.

1cef6de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heuristic logic reorganized and inconsistency fix#17

Heuristic logic reorganized and inconsistency fix#17
Abdul03Rafay wants to merge 4 commits into
benevolentbandwidth:mainfrom
Abdul03Rafay:main

Abdul03Rafay commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Abdul03Rafay commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1st Commit — Jun, 7th.

Reorganized heuristics .ts file into three separate files;

Implemented urlHeuristics.ts

Stripped contentHeuristics.ts to content-only

Upgraded linkHeuristics.ts

Wired all three modules into the live pipeline

Inverted the scoring system

Tests and build

2nd Commit — Jun, 18th.

Heuristics refinement, additions and documentation

Overview

URL Heuristics (urlHeuristics.ts)

Content Heuristics (contentHeuristics.ts)

Pipeline (content.ts + types/heuristics.ts)

Bug Fixes

Documentation & Tests

3rd Commit — Jun, 18th.

Evaluation notebook updated to reflect heuristics additions

Overview

Heuristics Engine (beacon_eval.ipynb)

Dataset (dataset/urls.csv)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abdul03Rafay commented Jun 7, 2026 •

edited

Loading

Implemented `urlHeuristics.ts`

Stripped `contentHeuristics.ts` to content-only

Upgraded `linkHeuristics.ts`

URL Heuristics (`urlHeuristics.ts`)

Content Heuristics (`contentHeuristics.ts`)

Pipeline (`content.ts` + `types/heuristics.ts`)

Heuristics Engine (`beacon_eval.ipynb`)

Dataset (`dataset/urls.csv`)