BioScreen

Structure-based biosecurity screening for AI-designed proteins.

Current DNA synthesis screening tools compare sequences against databases of known pathogens using sequence homology. AI protein design tools (RFdiffusion, ProteinMPNN) can now generate novel sequences that fold into the same 3D structure as known toxins while sharing near-zero sequence similarity — over 75% of these bypass conventional screening (Wittmann et al., Science 2025).

BioScreen has two components: (1) per-sequence screening that evaluates what a protein does, not what it looks like, using 3D structure prediction, active-site geometry, and functional annotation; and (2) session/behavioral monitoring that tracks queries across a session to detect convergent optimization — someone iterating toward a dangerous structure even if each individual query looks benign.

How It Works

Component 1 — Per-Sequence Screening

Amino Acid Sequence
       │
       ├──→ ESM-2 Embedding ──→ FAISS cosine sim vs toxin DB ──→ Embedding Risk Score
       │
       ├──→ ESMFold (3D structure) ──→ Foldseek structural alignment ──→ Structure Risk Score
       │
       └──→ Function Prediction (GO terms / EC numbers) ──→ Function Risk Score
       │
       └──→ Combined Risk Assessment (0–1)

Fast path (~seconds, CPU): ESM-2 embeddings → cosine similarity against pre-computed toxin database.

Full path (~15s, GPU): Adds ESMFold structure prediction → Foldseek structural search → function prediction.

Component 2 — Session/Behavioral Monitoring

Incoming Queries (per session)
       │
       ├──→ Fingerprint (ESM-2 embedding + sequence hash)
       │         │
       │         └──→ Sliding-window session store (keyed by session/client ID)
       │
       ├──→ Convergent Optimization Detection
       │         └──→ Pairwise cosine sim across rolling window
       │             → flag when mean similarity climbs above threshold across ≥N queries
       │
       ├──→ Multi-Provider Perturbation Detection
       │         └──→ Flag near-identical queries (cosine sim > 0.95 + low edit distance)
       │
       └──→ Anomaly Score (0–1)
                 └──→ 0.6 × convergence + 0.4 × perturbation

Detects patterns that per-sequence screening misses:

Convergent optimization — a user iterating toward a toxin structure across multiple queries, even when no single query is flagged
Multi-provider probing — near-identical sequences with slight perturbations submitted within a time window, suggesting attempts to probe screening thresholds
Anomaly scoring — cosine similarity of embeddings in a rolling window + edit-distance clustering → simple, interpretable, and demo-ready

Tech Stack

Structure Prediction: ESMFold via NVIDIA NIM API (or local HuggingFace)
Embeddings: ESM-2 (facebook/esm2_t33_650M_UR50D)
Structural Search: Foldseek (3Di alphabet, 4000× faster than TM-align)
Function Prediction: GO term / EC number classification
Vector DB: FAISS for embedding similarity search
Reference Data: UniProt Tox-Prot + PDB toxin-annotated structures
Backend: FastAPI (async)
Frontend: Streamlit

Quick Start

# Clone
git clone https://github.com/seanchiuai/bioscreen.git
cd bioscreen

# Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # Add your NVIDIA NIM API key

# Build toxin reference database
python scripts/build_db.py

# Run API
uvicorn app.main:app --reload

# Run frontend (separate terminal)
streamlit run frontend/streamlit_app.py

API

Endpoint	Method	Description
`/api/screen`	POST	Screen a single protein sequence
`/api/batch`	POST	Screen multiple sequences
`/api/health`	GET	Health check
`/api/toxins`	GET	Reference database stats
`/api/session/{id}`	GET	Current session state and query count
`/api/session/{id}/alerts`	GET	Latest behavioral anomaly assessment

Screen a sequence

curl -X POST http://localhost:8000/api/screen \
  -H "Content-Type: application/json" \
  -d '{"sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH..."}'

Response

{
  "risk_score": 0.87,
  "risk_level": "HIGH",
  "matched_toxins": [
    {"name": "Botulinum neurotoxin type A", "similarity": 0.91, "uniprot_id": "P0DPI1"}
  ],
  "predicted_functions": ["metalloprotease activity", "neurotoxin"],
  "explanation": "High structural similarity to known neurotoxin active site despite <15% sequence identity."
}

Project Structure

bioscreen/
├── app/
│   ├── main.py              # FastAPI entry point
│   ├── config.py             # Settings from .env
│   ├── api/routes.py         # API endpoints
│   ├── pipeline/
│   │   ├── sequence.py       # Input validation, DNA→protein
│   │   ├── structure.py      # ESMFold prediction
│   │   ├── embedding.py      # ESM-2 embeddings
│   │   ├── similarity.py     # Foldseek + cosine similarity
│   │   ├── function.py       # GO/EC prediction
│   │   └── scoring.py        # Combined risk scoring
│   ├── monitoring/
│   │   ├── session_store.py  # In-memory sliding-window session store
│   │   ├── analyzer.py       # Convergence + perturbation detection
│   │   └── schemas.py        # Session/alert Pydantic models
│   ├── database/
│   │   ├── toxin_db.py       # FAISS index management
│   │   └── build_db.py       # Build reference DB from UniProt
│   └── models/schemas.py     # Pydantic models
├── frontend/streamlit_app.py # Web UI
├── scripts/
│   ├── build_db.py           # DB builder CLI
│   └── demo.py               # Demo script
├── data/                     # Toxin embeddings/structures
└── tests/
    ├── test_pipeline.py      # Pipeline tests
    └── test_session_monitoring.py  # Session monitoring tests

Key References

Wittmann et al. (2025) "Strengthening nucleic acid biosecurity screening against generative protein design tools" — Science. DOI: 10.1126/science.adu8578
Wheeler et al. (2026) "The Limits of Sequence-Based Biosecurity Screening Tools in the Age of AI-Assisted Protein Design" — bioRxiv. DOI: 10.64898/2026.03.04.709671
Baker & Church (2024) "Protein design meets biosecurity" — Science.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
app		app
blast		blast
data		data
docs/plans		docs/plans
frontend		frontend
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DESIGN.md		DESIGN.md
FRONTEND_GUIDE.md		FRONTEND_GUIDE.md
README.md		README.md
SETUP.md		SETUP.md
requirements.txt		requirements.txt
test_sequences.txt		test_sequences.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioScreen

How It Works

Component 1 — Per-Sequence Screening

Component 2 — Session/Behavioral Monitoring

Tech Stack

Quick Start

API

Screen a sequence

Response

Project Structure

Key References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioScreen

How It Works

Component 1 — Per-Sequence Screening

Component 2 — Session/Behavioral Monitoring

Tech Stack

Quick Start

API

Screen a sequence

Response

Project Structure

Key References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages