Heuristic logic reorganized and inconsistency fix#17
Open
Abdul03Rafay wants to merge 4 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1st Commit — Jun, 7th.
Reorganized heuristics .ts file into three separate files;
contentHeuristics.ts— page content (text, title, meta description)urlHeuristics.ts— the main page URLlinkHeuristics.ts— links found on the pageImplemented
urlHeuristics.tsMoved our EDA-backed URL rules out of
contentHeuristics.tswhere they didn't belong:Stripped
contentHeuristics.tsto content-onlyRemoved all URL rules and mismatched link detection. Now contains only:
sparsityNoMeta— sparse body text + no meta description (EDA proxy for largestlinelength + lineofcode)Upgraded
linkHeuristics.tscontentHeuristics.ts)— visible text claims domain X, href goes to domain YWired all three modules into the live pipeline
Updated
content.tsto call all three modules on every page load and combine their results:Inverted the scoring system
combineResultsconverts safety → threat → sums → re-inverts so compound signals still push the score down correctly.Tests and build
contentHeuristics.test.tsto content-only casesurlHeuristics.test.tsfor URL rulesnpx tsx src/heuristics/<file>.test.tsnpm run buildproduces clean dist — extension loads fromextension/folder in Chrome2nd Commit — Jun, 18th.
Heuristics refinement, additions and documentation
Overview
This PR strengthens Beacon's phishing detection engine with 4 new URL rules, 5 DOM-level content rules, a pipeline-level URL veto, and a trusted aggregator bypass. It also fixes two correctness bugs and adds a full system diagram.
URL Heuristics (
urlHeuristics.ts)4 new rules:
brandSubdomainSpoofingpaypal.verify-accounts.evil.xyz— brand name in non-brand hostnamefreeHostingFinancialSubdomainv0-greendotfinance.vercel.app— financial keyword in Vercel/Netlify/etc. subdomainsuspiciousTld.icu,.tk,.ml,.ga,.cf,.gq,.cfd,.cyougibberishDomainLabelnvkcy.icu— 4+ char label with zero vowels (auto-generated domains)urlLengthHardmade compound: path > 144 chars alone no longer triggers — the domain must also be suspicious (high-risk TLD or ≥ 2 hyphens). Fixes false positive on Workday/ATS job-board URLs likepwc.wd3.myworkdayjobs.com/....Content Heuristics (
contentHeuristics.ts)5 new DOM-level rules:
suspiciousButtonText— CTA buttons containing "Claim Now", "Access My Funds", "Unlock My Account", etc. (wt 3 each)credentialFormCrossDomain— password form that POSTs to a different domain (wt 6 — survives URL veto by design)fakeSecurityBadge— Norton/McAfee/DigiCert/Comodo/Trustwave in image alt texts (wt 2)countdownUrgency— clock pattern + urgency word ("expires", "remaining", "hurry") in body (wt 2)overlayScamPhrases— scam phrases detected inside open<dialog>or[role="dialog"]modals (wt 3 each)Expanded
SCAM_PHRASES: 9 financial credential phishing phrases added — "your account has been suspended", "your funds are on hold", "your card has been blocked", etc.Pipeline (
content.ts+types/heuristics.ts)extractPageData()now collectsforms,buttonText,badgeAltTexts, andoverlayTextfrom the live page;FormInfotype and new fields added toExtractedPageDataBug Fixes
linkHeuristics.ts:analyzeLinks()returned score0(maximum danger) for an empty links array — fixed to10(safe). Was causing false positives on pages likefifa.comwith no outbound links.popup/App.tsx: LLMrisk_score(0 = safe, 10 = dangerous) was being used directly instead of being inverted to match the heuristic safety scale (10 = safe, 0 = dangerous).Documentation & Tests
HEURISTICS.md: New file with a full Mermaid pipeline diagram (extractPageData → trusted aggregator gate → URL/content/link modules → combineResults → LLM Tier 2) and rule weight reference tablesurlHeuristics.test.ts: Regression tests for all new rules; Workday long-URL false positive testcontentHeuristics.test.ts:EMPTY_DOMspread helper for existing tests; new cases forcredentialFormCrossDomain,suspiciousButtonText,countdownUrgency3rd Commit — Jun, 18th.
Evaluation notebook updated to reflect heuristics additions
Overview
Keeps
eval/beacon_eval.ipynbin sync with the heuristics changes landed in the previous commit. All Python rule implementations now match their TypeScript counterparts exactly, and the labeled dataset is extended with targeted test cases for each new rule.Heuristics Engine (
beacon_eval.ipynb)URL heuristics (Cell 6) — 4 new rules + 1 fix:
BRAND_LEGITIMATE_DOMAINS,HIGH_RISK_TLDS,FREE_HOSTING_PLATFORMS,FINANCIAL_SUBDOMAIN_KEYWORDSconstants_check_brand_subdomain_spoofing,_check_free_hosting_financial,_check_suspicious_tld,_check_gibberish_domain_label_check_url_length_hard: now checks path-only length (query string excluded) and requires domain-level suspicion alongside long path — matches the compound fix in TypeScriptURL_RULESlist updated from 4 to 8 entriesContent heuristics (Cell 7) — expanded phrases + 5 DOM rules:
SCAM_PHRASESextended with 9 financial credential phishing phrasesSUSPICIOUS_CTA_PHRASESconstant addedanalyze_content()now handlesbuttonText,forms,badgeAltTexts, andoverlayTextfields from page datasuspiciousButtonText,credentialFormCrossDomain,fakeSecurityBadge,countdownUrgency, andoverlayScamPhrasesrulesCombined pipeline (Cell 9):
TRUSTED_AGGREGATORSlist andis_trusted_aggregator()helpercombine_results()now applies URL veto (discards content/link threats < 6 when URL threat = 0)Page fetcher (Cell 11):
fetch_page_data()extended to extractforms(action URL + password field presence),buttonText(pipe-joined from buttons and submit inputs),badgeAltTexts(Norton/McAfee/DigiCert/etc. img alts), andoverlayText(from<dialog>and[role="dialog"]elements)Full pipeline (Cell 21):
content.tsbehaviourPer-rule breakdown (Cell 18):
all_ruleslist updated from 4 entries to all 8 URL rulesDataset (
dataset/urls.csv)6 new labeled entries added (59 → 65 URLs):
v0-greendotfinance.vercel.appfreeHostingFinancialSubdomainmygslobotsb-online.icususpiciousTldnvkcy.icususpiciousTld+gibberishDomainLabelpwc.wd3.myworkdayjobs.com/...urlLengthHardfixpaypal.verify-accounts.malicious.xyz/loginbrandSubdomainSpoofingsecure-paypal-login.attacker.com/account/verifybrandSubdomainSpoofing