🤓 NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

NMRTrans is a transformer-based framework that performs structure elucidation from experimental NMR spectra. By leveraging Set Transformers with Induced Set Attention Blocks (ISAB) and Pooling by Multihead Attention (PMA), NMRTrans encodes unordered NMR peak sets into modality-specific representations. The framework fuses these representations with optional molecular formula constraints and employs a T5 decoder for autoregressive SMILES generation, effectively handling the permutation-invariant nature of spectral data while maintaining chemical validity.

📢 Latest News

2026.7.2: Pretrained checkpoints released and the codebase has been refactored.
2026.5.17: NMRTrans was accepted to the KDD 2026 AI4S Track!
2026.3.6： 🚀 Release training, inference code & datasets.
2026.2.10： 📄 Our paper is now available on arXiv.

💻 Installation

Clone the repository:

git clone https://github.com/little1d/NMRTrans.git
cd NMRTrans

Create a Python 3.10 environment and install dependencies with uv:

conda create -n nmrtrans python=3.10 -y
conda activate nmrtrans

pip install uv
uv pip install -e .

⚙️ Configuration

NMRTrans uses a single YAML entry point for both training and inference:

cp configs/config.yaml configs/local.yaml
# Edit configs/local.yaml for local paths, checkpoints, data paths, and runtime settings.

Paths in YAML files are resolved relative to the project root. At least one NMR modality must be enabled with USE_C_NMR or USE_H_NMR; molecular formula guidance is optional through USE_FORMULA_GUIDANCE.

src/config_local.py is still supported for compatibility only when no YAML file is passed. When --config_path is provided, configuration is resolved from project defaults plus that YAML file, without inheriting values from config_local.py. To use local Python overrides instead of YAML, copy the template and run python src/train.py without --config_path:

cp src/config_local.py.example src/config_local.py

🔬 Inference

Pretrained checkpoints are available from the NMRTrans Hugging Face collection. Download the checkpoint that matches the input features you want to use.

For the C-NMR + H-NMR + Formula baseline:

mkdir -p checkpoints/pretrained
huggingface-cli download little1d/C-H-Formula nmrtrans-c-h-nmr-formula.ckpt --local-dir checkpoints/pretrained

python src/test.py \
  --config_path configs/local.yaml \
  --ckpt_path checkpoints/pretrained/nmrtrans-c-h-nmr-formula.ckpt \
  --features c_nmr,h_nmr,formula

Other released checkpoints can be downloaded in the same way by substituting the Hugging Face repository and checkpoint filename:

Input features	Hugging Face repo	Checkpoint file	`--features`
C-NMR	`little1d/C`	`nmrtrans-c-nmr.ckpt`	`c_nmr`
H-NMR	`little1d/H`	`nmrtrans-h-nmr.ckpt`	`h_nmr`
C-NMR + H-NMR	`little1d/C-H`	`nmrtrans-c-h-nmr.ckpt`	`c_nmr,h_nmr`
C-NMR + Formula	`little1d/C-Formula`	`nmrtrans-c-nmr-formula.ckpt`	`c_nmr,formula`
H-NMR + Formula	`little1d/H-Formula`	`nmrtrans-h-nmr-formula.ckpt`	`h_nmr,formula`
C-NMR + H-NMR + Formula	`little1d/C-H-Formula`	`nmrtrans-c-h-nmr-formula.ckpt`	`c_nmr,h_nmr,formula`

Template:

huggingface-cli download <repo_id> <checkpoint_file> --local-dir checkpoints/pretrained

python src/test.py \
  --config_path configs/local.yaml \
  --ckpt_path checkpoints/pretrained/<checkpoint_file> \
  --features <feature_list>

Make sure configs/local.yaml points to the corresponding data/cache paths before running evaluation on the released test split.

📊 Results

The table reports greedy top-1 autoregressive decoding on the released test split. Epochs refer to the selected validation-best checkpoints used for release.

Input features	Epoch	Sequence acc.	Token acc.	Tanimoto similarity
C-NMR	8529	0.0427	0.3816	0.3296
H-NMR	7469	0.1994	0.5647	0.5516
C-NMR + H-NMR	7469	0.3646	0.6824	0.6997
C-NMR + Formula	8189	0.1813	0.5062	0.5153
H-NMR + Formula	9589	0.3719	0.6673	0.6902
C-NMR + H-NMR + Formula	5409	0.4447	0.7229	0.7569

Notes

All metrics are computed under greedy top-1 autoregressive decoding.

These numbers may differ slightly from those reported in the paper because they were re-evaluated with the refactored codebase and released checkpoints.

🏋️ Training

Training from scratch requires the T5 backbone and the preprocessed NMRTrans data cache.

Download the T5 backbone:

mkdir -p models
huggingface-cli download t5-small --local-dir models/t5-small

Download the released training, validation, and test splits from the NMRTrans-Data dataset repository:

mkdir -p cache
huggingface-cli download little1d/NMRTrans-Data --repo-type dataset --local-dir cache

The released cache contains the pre-split train, validation, and test files used by the paper and the released checkpoints. Place them under NMRTrans/cache as shown above, then make sure the dataset paths in configs/local.yaml point to these files.

Train with the local YAML configuration:

export CUDA_VISIBLE_DEVICES=0,1,2,3

mkdir -p checkpoints

python src/train.py --config_path configs/local.yaml

The default example configuration is designed for 4 GPUs with BATCH_SIZE=1024. You can reduce the GPU count and batch size to run on smaller hardware, for example a single GPU with BATCH_SIZE=128. When changing the effective batch size, consider tuning related optimization parameters such as Learning rate and ACCUM_GRAD_BATCHES. These changes may affect convergence speed and final performance.

To train a different input combination, edit USE_C_NMR, USE_H_NMR, and USE_FORMULA_GUIDANCE in configs/local.yaml, or use one of the prepared experiment YAML files under configs/:

python src/train.py --config_path configs/experiment_c_h_formula.yaml

Resume training from a Lightning checkpoint:

python src/train.py \
  --config_path configs/local.yaml \
  --ckpt_path checkpoints/path/to/checkpoint.ckpt

We have open sourced the full training curves and experiment parameters on SwanLab for reproducibility.

🔧 Finetuning

NMRTrans checkpoints can be used as initialization for further training, but finetuning on new data is not always a plug-and-play data replacement. For in-distribution 1D NMR datasets with the same preprocessed format, users can usually start from an existing checkpoint and update the dataset paths in the YAML config.

For new spectral settings or additional modalities, such as 2D NMR or HSQC, the data pipeline and model interface should be adapted consistently. In practice, this may require updating the raw-data parser, the serialized dataset format, MergedDataset, the collate function, feature normalization, modality masks, and the corresponding encoder/fusion inputs in the model. After these changes, a pretrained NMRTrans checkpoint can still provide a useful initialization for compatible parts of the architecture, while newly introduced modules may need to be initialized and trained from scratch.

To continue training from a compatible checkpoint:

python src/train.py \
  --config_path configs/local.yaml \
  --ckpt_path checkpoints/path/to/checkpoint.ckpt

📝 Citation

If you use NMRTrans in your research, please cite:

@article{yang2026nmrtrans,
      title={NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers},
      author={Liujia Yang* and Zhuo Yang* and Jiaqing Xie* and Yubin Wang* and Ben Gao and Tianfan Fu and Xingjian Wei and Jiaxing Sun and Jiang Wu and Conghui He and Yuqiang Li and Qinying Gu},
      year={2026},
      eprint={2602.10158},
      archivePrefix={arXiv},
      primaryClass={physics.chem-ph},
      url={https://arxiv.org/abs/2602.10158},
}

📬 Contact

Thank you for your interest in NMRTrans. If you have questions about the algorithm, implementation details, or issues encountered while running the code, please open a GitHub issue so the discussion can help other users as well. You can also reach us by email at yzachary1551@gmail.com.

📄 License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
assets		assets
configs		configs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤓 NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

📢 Latest News

💻 Installation

⚙️ Configuration

🔬 Inference

📊 Results

🏋️ Training

🔧 Finetuning

📝 Citation

📬 Contact

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤓 NMRTrans: Structure Elucidation from Experimental NMR Spectra via Set Transformers

📢 Latest News

💻 Installation

⚙️ Configuration

🔬 Inference

📊 Results

🏋️ Training

🔧 Finetuning

📝 Citation

📬 Contact

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages