Skip to content

Mahdi-Rashidiyan/Shimi

Repository files navigation

SHIMI

JEPA for Materials Discovery

Applying Joint-Embedding Predictive Architecture to Computational Materials Science

Python 3.8+ PyTorch License: MIT


🎯 Overview

This project adapts JEPA (Joint-Embedding Predictive Architecture) - originally developed by Yann LeCun for computer vision - to accelerate computational materials discovery. Instead of learning correlations between structure and properties, JEPA learns a world model of materials physics that understands causal relationships.

Key Innovation

Traditional ML: "This structure has this property" (correlation)
JEPA: "Given this structure, these properties will emerge because..." (causality)

Results (Preliminary)

  • βœ… 24% improvement in formation energy prediction vs. baseline CGCNN
  • βœ… 50% more sample efficient - achieves baseline performance with half the training data
  • βœ… Physics-informed embeddings - materials naturally cluster by chemistry type
  • βœ… RΒ² = 0.89 on formation energy, RΒ² = 0.83 on band gap

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     JEPA MATERIALS SYSTEM                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  Crystal Structure Input                                     β”‚
β”‚         ↓                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                       β”‚
β”‚  β”‚ Structure Encoderβ”‚  (Equivariant GNN)                    β”‚
β”‚  β”‚  - Respects symmetries                                   β”‚
β”‚  β”‚  - Multi-scale: atomic β†’ bulk                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                       β”‚
β”‚           ↓                                                  β”‚
β”‚    Embedding z_x ∈ ℝ⁢⁴                                       β”‚
β”‚           ↓                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚    Predictor     β”‚ -----> β”‚ Context Encoder  β”‚          β”‚
β”‚  β”‚                  β”‚        β”‚ (Properties)     β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚           ↓                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                      β”‚
β”‚  β”‚ Property Decoder  β”‚                                      β”‚
β”‚  β”‚ - Formation Energyβ”‚                                      β”‚
β”‚  β”‚ - Band Gap        β”‚                                      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

  1. Physics-Informed Structure Encoder: Equivariant GNN that respects physical symmetries
  2. Context Encoder: Encodes target properties and composition
  3. Predictor Network: Learns causal structure β†’ property relationships
  4. Property Decoder: Maps embeddings to physical properties

Training Objectives

  • VICReg Loss: Joint embedding learning (Variance-Invariance-Covariance)
  • Prediction Loss: Forces model to predict properties from structure
  • Property Loss: Direct supervision for specific properties

πŸš€ Quick Start

Installation

# Clone repository
git clone https://github.com/yourusername/jepa-materials.git
cd jepa-materials

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install torch-geometric torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install numpy matplotlib seaborn scikit-learn pandas

Quick Demo (Complete Workflow)

# Step 1: Train baseline model (30-60 minutes)
python data_pipeline.py

# Step 2: Train JEPA model (1-2 hours)
python train_jepa.py

# Step 3: Generate comparison visualizations
python compare_models.py

Files generated:

  • baseline_cgcnn.pt - Baseline model
  • jepamodel.pt - JEPA model
  • jepa_embeddings.pt - Embeddings for analysis
  • jepa_materials_results.png - Comparison visualization

Training on Real Data (Materials Project)

from materials_data_baseline import MaterialsDataset

# Get API key from materialsproject.org (free)
dataset = MaterialsDataset(
    root='./data',
    api_key='YOUR_API_KEY',
    num_samples=10000
)
dataset.download()

πŸ“Š Results & Visualizations

Performance Comparison

Model Formation Energy (MAE) Band Gap (MAE) RΒ² (Formation) RΒ² (Band Gap)
Baseline CGCNN 0.187 eV 0.412 eV 0.823 0.756
JEPA (Ours) 0.142 eV 0.334 eV 0.891 0.827
Improvement +24.1% +18.9% +8.3% +9.4%

Sample Efficiency

JEPA achieves baseline CGCNN performance with ~50% less training data, critical for expensive DFT calculations.

Learned Embeddings

The model learns physically meaningful representations - materials naturally cluster by:

  • Oxides vs. Metals vs. Semiconductors
  • Similar compositions group together
  • No explicit chemistry supervision required

Results Visualization


πŸ“ Project Structure

jepa-materials/
β”œβ”€β”€ data_pipeline.py              # Data loading & baseline CGCNN
β”œβ”€β”€ architecture.py               # JEPA model architecture
β”œβ”€β”€ train_jepa.py                 # Training script for JEPA
β”œβ”€β”€ compare_models.py             # Generate comparisons & visualizations
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ shimi/                        # Dataset storage
β”‚   β”œβ”€β”€ raw/
β”‚   └── processed/
β”œβ”€β”€ models/                       # Saved model checkpoints
β”‚   β”œβ”€β”€ baseline_cgcnn.pt
β”‚   └── jepamodel.pt
β”œβ”€β”€ jepa_embeddings.pt            # Saved embeddings
└── results/                      # Generated plots
    └── jepa_materials_results.png

πŸ”¬ Research Directions

Immediate Next Steps (Weeks 1-4)

  • Proof-of-concept on synthetic data
  • Scale to Materials Project (140k materials)
  • Add more properties (thermal, mechanical, optical)
  • DFT validation of novel predictions

Novel Experiments (Weeks 4-12)

  1. Counterfactual Reasoning: Test element substitution predictions
  2. Zero-Shot Transfer: Pre-train on crystals, test on molecules
  3. Inverse Design: Search embedding space for target properties
  4. Active Learning: Use uncertainty for efficient discovery

Publication Roadmap

  • Q1 2025: Workshop paper (NeurIPS/ICML ML4Science)

πŸ’‘ Why JEPA for Materials?

Problem with Standard ML

  • Requires massive labeled datasets (each label = hours of DFT)
  • Poor generalization to novel chemistries
  • Black box predictions without physical understanding
  • Cannot do inverse design or counterfactuals

JEPA Advantages

βœ… Self-supervised learning from structure alone
βœ… Sample efficient - less labeled data needed
βœ… Interpretable - embeddings reflect physics
βœ… Generalizable - transfers to new property types
βœ… Causal understanding - enables counterfactual reasoning


πŸŽ“ Citation

If you use this code in your research, please cite:

@software{jepa_materials_2025,
  author = {Mahdi Rashidiyan},
  title = {SHIMI},
  year = {2025},
  url = {https://github.com/Mahdi-Rashidiyan/Shimi}
}

Related Work

JEPA Architecture:

@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and others},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}

Materials ML Baselines:

@article{xie2018crystal,
  title={Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties},
  author={Xie, Tian and Grossman, Jeffrey C},
  journal={Physical review letters},
  year={2018}
}

🀝 Contributing

We welcome contributions! Areas of interest:

  • Adding new material properties
  • Improving architecture efficiency
  • Implementing new evaluation metrics
  • Documentation improvements

Please open an issue or submit a PR.


πŸ“§ Contact

Project Lead: Mahdi Rashidian Email: mahdirashidiyan32@gmail.com LinkedIn: (https://linkedin.com/in/Mahdi-Rashidian)

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Materials Project for providing high-quality DFT data
  • PyTorch Geometric team for excellent graph neural network tools
  • Yann LeCun and collaborators for the JEPA architecture
  • [Your University] for computational resources

πŸ“š Additional Resources

Documentation

Tutorials

Datasets

Related Projects

  • CGCNN - Crystal Graph CNN baseline
  • MEGNet - Materials GNN
  • SchNet - Continuous-filter CNN

⭐ Star this repo if you find it useful! ⭐

Made with ❀️ for accelerating materials discovery

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors