Skip to content

s3ak6i-dev/Locus

Repository files navigation


 ██╗      ██████╗  ██████╗██╗   ██╗███████╗
 ██║     ██╔═══██╗██╔════╝██║   ██║██╔════╝
 ██║     ██║   ██║██║     ██║   ██║███████╗
 ██║     ██║   ██║██║     ██║   ██║╚════██║
 ███████╗╚██████╔╝╚██████╗╚██████╔╝███████║
 ╚══════╝ ╚═════╝  ╚═════╝ ╚═════╝ ╚══════╝

AI-powered drug target discovery — from a plain-English query to ranked variants, 3D structure, and approved drugs in seconds.


Next.js TypeScript Tailwind CSS Groq Neon CI


Scientific Engineering Production


Type EGFR non-small cell lung cancer → get ranked driver mutations, a live 3D structure, 12 approved drugs with IC₅₀ values, and a research-grade AI report. One query. Under 60 seconds.


What is Locus?

Locus connects five major bioinformatics databases in a single automated pipeline, ranks pathogenic variants by a multi-factor composite score grounded in clinical evidence, and generates a structured research summary using a 70B language model. It is built for computational biologists, pharmacologists, and drug discovery teams who need rapid, evidence-based variant prioritisation without writing database queries.


The 8-Step Pipeline

 User query (plain English)
         │
         ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 1 · Parse Query                            🤖 Groq  │
 │  Llama 3.3 70B extracts { gene, disease, tissue }         │
 └────────────────────────────┬──────────────────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 2 · Validate Gene                        🧬 Ensembl │
 │  Confirms symbol exists in GRCh38; rejects typos early    │
 └────────────────────────────┬──────────────────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 3 · Fetch Variants                       📋 ClinVar │
 │  esearch + esummary — pathogenic + likely pathogenic       │
 │  Extracts rsIDs (SNPs) and HGVS (exon deletions, indels)  │
 │  Reads ClinVar review status → 0–4 star evidence rating   │
 └────────────────────────────┬──────────────────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 4 · Annotate — VEP Batch                 🧬 Ensembl │
 │  /id endpoint (rsIDs) + /hgvs endpoint (exon del, indels) │
 │  → protein position, SIFT, PolyPhen, consequence, HGVS    │
 │  → gnomAD allele frequency from colocated variants        │
 └────────────────────────────┬──────────────────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 5 · Score Variants                       🔢 Locus   │
 │  Composite = 0.55 × pathScore × starWeight                │
 │            + 0.45 × rarityScore                           │
 │            + 0.15 if active/binding site hit (UniProt)    │
 └────────────────────────────┬──────────────────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 6 · Rank Variants                          🤖 Groq  │
 │  Llama 3.3 70B ranks top 10 with clinical reasoning       │
 └────────────────────────────┬──────────────────────────────┘
                              │
             ┌────────────────┴────────────────┐
             ▼                                 ▼
 ┌─────────────────────────┐   ┌───────────────────────────┐
 │  Step 7a · Structure    │   │  Step 7b · Drugs          │
 │  UniProt → RCSB PDB     │   │  UniProt → ChEMBL target  │
 │  → AlphaFold fallback   │   │  → mechanisms + IC₅₀      │
 └───────────┬─────────────┘   └─────────────┬─────────────┘
             └────────────────┬────────────────┘
                              │
                              ▼
 ┌───────────────────────────────────────────────────────────┐
 │  Step 8 · Generate Report                        🤖 Groq  │
 │  Structured research summary: mechanism, structure, drugs │
 └───────────────────────────────────────────────────────────┘

Features

🧬 Scientific Accuracy

score

  • Two-pass ClinVar query — missense variants preferred, falls back to all pathogenic types
  • VEP dual-endpoint — rsID batch via /id, HGVS via /hgvs for exon deletions (EGFR exon 19 del, etc.)
  • ClinVar star weighting — expert-panel variants (4★) score 5× higher than single-submitter (0★)
  • Active/binding site boost — +0.15 when protein position overlaps UniProt functional annotation
  • gnomAD allele frequency — 6-tier rarity scoring from ultra-rare (1.0) to common (0.05)
  • 3D domain overlays — UniProt features (Domain, Active site, Binding site, DNA binding) coloured on structure
  • AlphaFold fallback — predicted structure for all reviewed human proteins when no PDB entry exists
  • ChEMBL drug discovery — approved drugs + clinical candidates with IC₅₀, mechanism, and phase

⚙️ Engineering Robustness

score

  • fetchWithRetry — exponential backoff (1s → 2s → 4s) on HTTP 408 / 429 / 5xx and network errors; fresh AbortSignal per attempt
  • Zod validation — all 5 external API responses validated at parse boundary; schema drift logged with field path
  • Sidecar health check — 2s ping at pipeline start; offline → skip 60s timeout, log clearly, use fallback formula
  • Per-day rate limiting — 10 analyses / user / 24h with resetsAt ISO timestamp in 429 response
  • 30-day variant cache TTLcachedAt checked before serving; stale entries re-fetched
  • 7-day drug cache TTL — UniProt accession as key (stable, unambiguous vs. gene name aliases)
  • SSE disconnect detectionrequest.signal.aborted checked each poll loop iteration
  • Env var validation — all 7 required variables checked at module load with actionable error message

🚀 Production Readiness

score

  • GitHub Actions CItsc --noEmit + next build on every push and pull request
  • Content Security Policy — strict allowlist for scripts, connects, images; frame-ancestors: none
  • /api/health — live JSON status of DB, sidecar, Ensembl, ChEMBL
  • GDPR deletionDELETE /api/account removes all analyses and revokes the session
  • Mobile sidebar — hamburger + slide-in overlay below lg breakpoint
  • Cursor-based pagination — analysis history loads 10 at a time, "Load more" button
  • 404 handling — stale analysis URLs surface an error state rather than an infinite spinner
  • Input sanitisation — gene name validated against /^[A-Z0-9_-]{1,20}$/i; query 5–500 chars

🖥️ Interface

  • Score chart — SIFT / PolyPhen / gnomAD bars render when regulatory data is unavailable (instead of invisible zero bars)
  • Expandable variant rows — full VEP annotation on click (protein pos, HGVS, consequence badge, all scores)
  • Click-to-zoom 3D viewer — selecting a variant row zooms the structure to that residue with a white flash
  • Hover tooltips — amino acid, chain, and domain name on mouseover in the 3D viewport
  • Domain legend — colour-coded by type (Domain → blue, Active site → red, Binding site → orange)
  • Consequence badges — colour-coded inline (stop gained → red, missense → orange, synonymous → blue)
  • Guide page — plain-English glossary of 22 terms, pipeline walkthrough, column reference for non-biologists

Composite Scoring Formula

compositeScore = min(1.0,
    0.55 × pathScore × starWeight    ←  deleterious prediction × evidence quality
  + 0.45 × rarityScore               ←  population allele frequency
  + 0.15 × isFunctionalSite          ←  UniProt active/binding site overlap
)

Components

Component Source Range
pathScore 1 − SIFT score, or PolyPhen score if no SIFT 0.0 – 1.0
starWeight (clinvarStars + 1) / 5 0.20 – 1.00
rarityScore gnomAD AF tier 0.05 – 1.00
isFunctionalSite UniProt Active site / Binding site overlap 0 or 1

gnomAD Rarity Tiers

Allele Frequency Score Interpretation
< 1 × 10⁻⁶ 1.00 Ultra-rare — almost certainly pathogenic
< 1 × 10⁻⁵ 0.90 Very rare
< 1 × 10⁻⁴ 0.75 Rare
< 1 × 10⁻³ 0.50 Low frequency
< 1 × 10⁻² 0.20 Common variant
≥ 1 × 10⁻² 0.05 Benign frequency

ClinVar Evidence Weights

Review Status Stars starWeight
Practice guideline ⭐⭐⭐⭐ 1.00
Reviewed by expert panel ⭐⭐⭐ 0.80
Multiple submitters, no conflicts ⭐⭐ 0.60
Single submitter / conflicting 0.40
No assertion provided 0.20

External APIs

API Used for Timeout Retries
Ensembl REST Gene validation, VEP /id batch, VEP /hgvs batch 10s / 45s 3 ×
NCBI ClinVar Pathogenic variant search (esearch + esummary) 15s 3 ×
UniProt REST Protein accession lookup + domain features 8s 3 ×
RCSB PDB Experimental structure search + resolution + ligands 10s / 6s 3 ×
AlphaFold DB Predicted structure for all reviewed human proteins 10s 3 ×
ChEMBL Drug mechanisms, max phase, IC₅₀ bioactivity 10s / 6s / 5s 3 ×
Groq Llama 3.3 70B — parse, rank, report (3 calls) n/a 1 ×
AlphaGenome sidecar Regulatory scoring — always offline; 2s health check 2s 0 ×

All fetches use fetchWithRetry — each retry creates a fresh AbortSignal.timeout() so an expired signal from attempt 1 does not immediately fail attempt 2.


Tech Stack

LayerTechnologyVersion
FrameworkNext.js App Router + Turbopack16
LanguageTypeScript (strict, zero `any`)5
StylingTailwind CSS v44
AuthAuth.js (Google OAuth)v5 beta
DatabaseNeon Postgres (serverless WebSocket)
ORMDrizzle ORM0.44
LLMGroq — Llama 3.3 70B Versatile
3D Viewer3Dmol.js (WebGL)2.5
ChartsRecharts3.9
ValidationZod3.25
UI ComponentsRadix UI + Lucide React
CIGitHub Actions

Project Structure

locus/
├── .github/
│   └── workflows/ci.yml            # tsc + next build on every PR
│
├── app/
│   ├── (app)/
│   │   ├── analyze/[id]/page.tsx   # Result page: variants, 3D, drugs, report
│   │   ├── dashboard/              # Recent analyses overview
│   │   ├── history/page.tsx        # Cursor-paginated analysis history
│   │   └── guide/page.tsx          # Plain-English terminology guide
│   └── api/
│       ├── analyze/route.ts        # POST: start pipeline; rate-limited
│       │   └── [id]/stream/        # SSE: real-time progress
│       ├── health/route.ts         # GET: dependency status
│       ├── history/route.ts        # GET: cursor-paginated
│       └── account/route.ts        # DELETE: GDPR erasure
│
├── lib/
│   ├── env.ts                      # Startup env var validation (throws on missing)
│   └── pipeline/
│       ├── fetchWithRetry.ts       # Exponential backoff wrapper
│       ├── apiSchemas.ts           # Zod schemas for all 5 external APIs
│       ├── parseQuery.ts           # Groq: extract gene / disease / tissue
│       ├── fetchVariants.ts        # ClinVar: variants + star rating + HGVS
│       ├── annotateVariants.ts     # Ensembl VEP: rsID batch + HGVS batch
│       ├── scoreVariants.ts        # Composite score + sidecar health check
│       ├── rankVariants.ts         # Groq: rank top 10 with reasoning
│       ├── fetchStructure.ts       # UniProt + RCSB PDB + AlphaFold
│       ├── fetchDrugs.ts           # ChEMBL: drugs + IC₅₀
│       └── generateReport.ts       # Groq: research summary
│
├── components/
│   ├── analysis/
│   │   ├── ProteinViewer.tsx       # 3Dmol.js viewer with domain colour overlays
│   │   └── DrugPanel.tsx           # Drug cards with IC₅₀ and mechanism
│   └── layout/
│       └── Sidebar.tsx             # Responsive nav (hamburger on mobile)
│
└── next.config.ts                  # CSP headers, server action config

Getting Started

Prerequisites

  • Node.js ≥ 20
  • A Neon Postgres database (free tier is sufficient)
  • A Groq API key (free)
  • Google OAuth credentials from Google Cloud Console

1 — Clone and install

git clone https://github.com/s3ak6i-dev/Locus.git
cd Locus
npm install

2 — Configure environment

Create .env.local in the project root:

# Database
DATABASE_URL=postgresql://user:pass@ep-xxx.neon.tech/locus?sslmode=require

# Auth
NEXTAUTH_SECRET=<run: openssl rand -hex 32>
NEXTAUTH_URL=http://localhost:3000
GOOGLE_CLIENT_ID=your-client-id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-client-secret

# AI
GROQ_API_KEY=gsk_...

# Bioinformatics APIs (public, no key required)
ENSEMBL_API_BASE=https://rest.ensembl.org
CLINVAR_API_BASE=https://eutils.ncbi.nlm.nih.gov/entrez/eutils

# AlphaGenome sidecar (optional — app falls back gracefully when offline)
GENOME_SERVICE_URL=http://localhost:8000
GENOME_SERVICE_SECRET=any-local-secret

3 — Push the database schema

npm run db:push

4 — Run the development server

npm run dev

Open http://localhost:3000, sign in with Google, and run your first analysis.


Database Schema

-- Core
analyses (
  id             UUID        PRIMARY KEY,
  user_id        TEXT        REFERENCES users(id),
  query          TEXT,                          -- original user query
  gene           TEXT,                          -- extracted gene symbol e.g. "EGFR"
  disease        TEXT,
  tissue         TEXT,
  status         TEXT,                          -- 'running' | 'complete' | 'error'
  variant_count  INTEGER,
  top_targets    JSONB,                         -- RankedVariant[]
  structure_meta JSONB,                         -- StructureMeta
  drug_data      JSONB,                         -- Drug[]
  report         TEXT,
  error_message  TEXT,
  created_at     TIMESTAMP,
  completed_at   TIMESTAMP
)

-- 30-day TTL: variant scores keyed by (gene, variant_id)
variant_cache (
  id              UUID  PRIMARY KEY,
  gene            TEXT,
  variant_id      TEXT,                         -- rsID or "hgvs_NM_..." for exon deletions
  rna_delta       REAL,
  atac_delta      REAL,
  splice_delta    REAL,
  composite_score REAL,
  cached_at       TIMESTAMP,
  UNIQUE (gene, variant_id)
)

-- 7-day TTL: drug data keyed by UniProt accession (not gene name)
drug_cache (
  id        UUID  PRIMARY KEY,
  gene      TEXT  UNIQUE,                       -- stores UniProt accession e.g. "P00533"
  drugs     JSONB,                              -- Drug[]
  cached_at TIMESTAMP
)

-- 7-day TTL: structure metadata
structure_cache (
  id            UUID  PRIMARY KEY,
  gene          TEXT  UNIQUE,
  pdb_id        TEXT,
  source        TEXT,                           -- 'rcsb' | 'alphafold'
  structure_url TEXT,
  resolution    REAL,
  plddt         REAL,
  uniprot_id    TEXT,
  cached_at     TIMESTAMP
)

API Reference

POST /api/analyze

Start a new analysis pipeline (async — returns immediately with an ID).

Request

{ "query": "EGFR non-small cell lung cancer" }

Response 200

{ "analysisId": "3f4a2b1c-..." }

Errors

Status Reason
400 Query outside 5–500 characters, or invalid gene name extracted
401 Not authenticated
429 Rate limit: 10 analyses per user per 24 hours

429 body:

{ "error": "Rate limit: 10 analyses per day", "resetsAt": "2026-06-26T09:00:00.000Z" }

GET /api/analyze/:id/stream

Server-Sent Events stream. Polls the DB every 1 second until the analysis reaches complete or error, or 120 seconds pass. Stops immediately if the client disconnects.

Event Payload
update { status, gene, disease, variantCount }
done { analysisId }
error { message }

GET /api/health

Live dependency status — useful for monitoring and debugging.

Response

{
  "status": "degraded",
  "db": "ok",
  "genome_sidecar": "offline",
  "ensembl": "ok",
  "chembl": "ok",
  "timestamp": "2026-06-25T14:00:00.000Z"
}

GET /api/history?cursor=&limit=10

Cursor-based paginated history for the authenticated user. cursor is the ISO timestamp of the last item received.

Response

{
  "analyses": [...],
  "nextCursor": "2026-06-20T12:00:00.000Z",
  "hasMore": true
}

DELETE /api/account

Permanently delete all analyses for the authenticated user and sign them out. Implements the GDPR right to erasure.


Content Security Policy

Locus ships strict CSP headers from next.config.ts:

default-src   'self'
script-src    'self' 'unsafe-eval' 'unsafe-inline' cdn.jsdelivr.net
style-src     'self' 'unsafe-inline'
connect-src   'self'
              rest.ensembl.org
              eutils.ncbi.nlm.nih.gov
              rest.uniprot.org
              search.rcsb.org  data.rcsb.org  files.rcsb.org
              alphafold.ebi.ac.uk  www.ebi.ac.uk
img-src       'self' data: lh3.googleusercontent.com
frame-ancestors 'none'
object-src    'none'

'unsafe-eval' is required by 3Dmol.js WebGL shader compilation. 'unsafe-inline' is required by Tailwind CSS v4's runtime style injection.


Design System

Locus uses a bespoke dark research-instrument aesthetic.

Role Hex Used for
App background #0A0A0A Page background
Card background #111111 All card surfaces
Hover background #1A1A1A Interactive hover states
Border default #2A2A2A Card and table borders
Border hover #3A3A3A Focus and hover borders
Text primary #F0F0F0 Headings, values
Text secondary #888888 Labels, descriptions
Text muted #555555 Placeholders, icons
Teal #1D9E75 Biology data, rank 1–3 variants
Blue #378ADD AI outputs, rank 4–7 variants
Orange #BA7517 Binding sites, warnings
Red #E24B4A Deleterious predictions, errors
Purple #8B5CF6 Splice variants, AlphaFold

Fonts: Space Grotesk (headings) · Inter (body) · JetBrains Mono (IDs, scores)


CI / Deployment

GitHub Actions

Every push and pull request to main runs:

- npm ci
- npx tsc --noEmit     # Type check — zero errors required
- npx next build        # Full production build

Vercel (recommended)

  1. Connect your GitHub repo to Vercel
  2. Add all environment variables from the Getting Started section in the Vercel dashboard
  3. Run npm run db:push against your production Neon database

AlphaGenome Sidecar (optional)

The sidecar is a Python FastAPI service (/genome-service) that wraps the AlphaGenome model. The app detects its status with a 2-second health check at the start of every pipeline run. When offline (which is the current default), the pipeline falls back to the SIFT / PolyPhen / gnomAD formula with no user-visible degradation beyond logging.


Contributing

  1. Fork the repository
  2. Create a feature branch — git checkout -b feat/your-feature
  3. Ensure npm run build passes with zero TypeScript errors
  4. Open a PR — CI runs tsc and next build automatically

Acknowledgements


License

MIT © Surya Krishna Bharadwaj Kunapuli


Built with Next.js · Groq · Ensembl · UniProt · RCSB PDB · ChEMBL · Neon

About

AI-powered drug target discovery — AlphaGenome regulatory scoring + Groq LLM reasoning + 3D protein structure + ChEMBL drug landscape. From sequence to insight in under 15 seconds.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors