OpenTargetGraph is an end-to-end bioinformatics platform designed to identify and visualise potential drug targets using state-of-the-art Protein Language Models (PLMs).
It demonstrates a modern TechBio stack, combining robust data engineering (Polars/Parquet), scalable orchestration (Dagster), and AI-driven structural biology (ESM-2 Embeddings) to bridge the gap between raw biological data and therapeutic insights.
A note on the name: The "Graph" in "OpenTargetGraph" is a bit of a misnomer. No graph database or knowledge graph is used in this project.
This platform provides suggestions for drugs that could inhibit kinase protein targets, based on the structural similarity of those targets with other kinases the drugs are known to inhibit. A literature search can be conducted to provide additional evidence for the potential of the suggested drugs.
The application includes the following components:
- Data Ingestion: Automates the retrieval of high-value drug targets (e.g., Kinases) from UniProt and bioactive small molecules from ChEMBL.
- AI Analysis: Generates high-dimensional vector embeddings for protein sequences using Meta AI's ESM-2 (Evolutionary Scale Modeling) transformer.
- Vector Search & Relational Storage: Stores target metadata and drug activity in a PostgreSQL database, and stores ESM-2 embeddings using pgvector. This enables semantic similarity searches to group related protein targets and infer target-drug associations based on structural proximity.
- Visualisation: A Streamlit dashboard that offers:
- 3D Protein Structure rendering (via Py3Dmol).
- An "Embedding Space" t-SNE projection to visualise clusters of similar targets.
- Semantic search for drug candidates based on protein similarity.
- Autonomous Research Assistant: Deep-dive literature analysis via PubMed and LLM-driven research reports.
├── open_target_graph/
│ ├── assets/ # Dagster Software-Defined Assets
│ │ ├── db/ # Loads data into PostgreSQL using pgvector
│ │ ├── ingestion/ # ETL logic for UniProt/ChEMBL
│ │ └── modeling/ # Hugging Face Transformers & PyTorch inference logic for ESM-2 embeddings
│ ├── agents/ # Agentic logic
│ │ ├── researcher.py # The Pydantic output schema and LLM system prompt
│ │ └── workflow.py # The LangGraph state machine
│ └── dashboard/ # Streamlit frontend application
├── data/ # Local storage for Parquet files (gitignored)
├── docker-compose.yml # Docker Compose file for local development
├── Dockerfile.dagster # Dockerfile for Dagster
├── Dockerfile.streamlit # Dockerfile for Streamlit
└── pyproject.toml # Python package and dependency management
graph TD
subgraph "Data Ingestion (Dagster + Polars)"
A[UniProt API] -->|Fetch Kinases| B(Raw Kinase Data)
C[ChEMBL API] -->|Fetch Molecules| B
B -->|Clean & Join| D(Processed Data 'Silver' Tables)
end
subgraph "AI Modeling (Hugging Face + PyTorch)"
D -->|Protein Sequence| E[ESM-2 Transformer Model]
E -->|Generate Vector| F(Vector Embeddings)
end
subgraph "Storage & Application Serving"
D -->|Load Metadata| G[(PostgreSQL DB)]
F -->|Load Vectors| G
G -.->|pgvector query| H[Streamlit UI Dashboard]
H -->|Literature Search| I[PubMed API]
H -->|Report Generation| J[Gemini LLM Agent]
end
The modeling pipeline downloads the facebook/esm2... model from the Hugging Face Hub. To avoid rate limits and enable faster downloads, you should use an access token.
- Create a free account on HuggingFace.co.
- Go to your Access Tokens and create a new token with
readpermissions. - Create a
.envfile in the root of the project. - Add your token to the
.envfile. Dagster will automatically load this for you.HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx - Ensure
.envis added to your.gitignorefile to avoid committing secrets.
The dashboard uses the Gemini API for the research assistant. To use this feature, you need a Gemini API key.
- Create a free account on Google AI Studio.
- Go to Get API Key and create a new API key.
- Add your API key to the
.envfile.GEMINI_API_KEY=your_gemini_api_key - Ensure
.envis added to your.gitignorefile to avoid committing secrets.
Local Setup
- Python 3.9+
- uv: A fast Python package installer and resolver, used for environment management.
Clone the repository and create a virtual environment using uv.
git clone https://github.com/edwardchalstrey/open_target_graph.git
cd open_target_graph
uv venv
uv pip install -e ".[dev]"To update the dependencies:
uv syncThe project uses Dagster to orchestrate data fetching and ML model inference. Run the following command to launch the Dagster UI:
uv run dagster devThis will start the Dagster UI, typically at http://localhost:3000
Navigate to the Dagster UI in your browser and click on Lineages. To configure the number of kinases fetched:
- Select the
raw_uniprot_kinasesasset. - Click the dropdown arrow next to Materialize all and select Launchpad.
- In the configuration editor, specify the
num_kinases. - Click Materialize selected to materialize the first asset.
- Click off the
raw_uniprot_kinasesasset and click the dropdown arrow again and choose Materialize unsynced to materialize the remaining assets.
Alternatively, you can simply click Materialize all to use the default of 100. This will execute the pipeline, download the data from UniProt and ChEMBL, generate embeddings, and load the results into the PostgreSQL database.
Note: For the local setup, the load_to_postgres asset requires a PostgreSQL database to be running locally.
The platform requires a PostgreSQL database with the pgvector extension to store and query the generated embeddings.
TODO: Add instructions for setting up PostgreSQL locally (this has only been tested in Docker).
Ensure this database is running before executing the data pipeline or launching the dashboard.
Once the data assets from the pipeline have been materialized and loaded into the PostgreSQL database, you can launch the interactive Streamlit dashboard.
uv run streamlit run open_target_graph/dashboard/app.pyThe application will now be running and accessible at http://localhost:8501.
See manual setup above.
uv run pytestTo run the entire application stack including Dagster, PostgreSQL (with pgvector), and the Streamlit dashboard all at once:
- Ensure Docker is installed and running.
- Clone the repository:
git clone https://github.com/edwardchalstrey/open_target_graph.git cd open_target_graph - Run the following command from the project root. By default, this will pull pre-built images from Docker Hub. To build locally, use the
--buildflag:docker compose up -d
Navigate to the Dagster UI in your browser and click on Lineages. To configure the number of kinases fetched:
- Select the
raw_uniprot_kinasesasset. - Click the dropdown arrow next to Materialize all and select Launchpad.
- In the configuration editor, specify the
num_kinases. - Click Materialize selected to materialize the first asset.
- Click off the
raw_uniprot_kinasesasset and click the dropdown arrow again and choose Materialize unsynced to materialize the remaining assets.
Alternatively, you can simply click Materialize all to use the default of 100. This will execute the pipeline, download the data from UniProt and ChEMBL, generate embeddings, and load the results into the PostgreSQL database.
Wait for the data ingestion to finish, then open the Streamlit GUI at http://localhost:8501.
To stop the application, run the following command:
docker compose downTo run the tests in the Docker container:
docker compose exec dagster uv run pytest

