The purpose of this repository is to collect useful scripts which mainly use RDKit. Individual scripts can be combined in pipes if input/ouput is SMILES.
Contributions are welcome!
Some scripts may require further dependencies.
- There is a
read_input.pyscript which contains the functionread_input. It reads molecules from SMI, SDF, SDF.GZ and PKL (pickled molecules as tuples of mol and mol_title) files and STDIN (SMI and SDF formats are supported) and it returns tuples of (mol, mol_title). This is a generator and can be applied to process large collections of molecules. I advise to use this function if you do not need other data from input files. - There is
_template.pyfile which can be used as a template for new scripts. Please do not change names for input, output, ncpu and verbose arguments. This will help to make command line arguments consistent across scripts. - Add help messages to your scripts.
- Scripts should be able to communicate with STDIN and STDOUT to combine scripts with pipes. STDIN and STDOUT should be treated as SMILES strings. Currently, it was implemented in all scripts, where this is relevant.
- All scripts can contain errors, so use them on your own risk. If you find a mistake please create the issue, we will fix it. However, we constantly revise old scripts and fix errors because every found mistake is penultimate.
| Script | Description |
|---|---|
add_prefix |
Add a prefix to molecule names in SDF file. |
extractsdf |
Extract molecule names and field values from input SDF. |
extract_mol_by_name |
Extract molecules by name (partial name matching) to new SDF file. |
insert_sdf |
Add data from a text file as additional fields to input SDF file. |
remove_dupl_by_field |
Remove entries from SDF file having duplicated mol title or field value. |
rename_mols |
Identify identical entries (conformers) and rename consistently. |
rename_mols_simple |
Rename SDF titles using a tab-separated old/new names file. |
sdf_field2title |
Insert field values into molecular title (or SMILES, or sequential titles). |
sdf_title2field |
Insert molecular title into a given SDF field. |
strip_blank_lines |
Remove empty lines in multi-line field values in input SDF. |
| Script | Description |
|---|---|
cansmi |
Return canonical SMILES of input molecules. |
frags2mols |
Save disconnected components as individual molecules with suffix in name. |
molchemaxon2pdb |
Convert molecules to separate PDB files using RDKit & ChemAxon. |
mols2pdb |
Convert molecules (SMI/SDF) to PDB, adding hydrogens and conformers. |
pkl2sdf |
Convert PKL to SDF (e.g. conformers generated by gen_conf_rdkit). |
sdf2mols |
Split SDF into multiple MOL files. |
sdf2pkl |
Convert SDF to multi-conformer PKL (requires sequential mol titles). |
smi2sdf |
Convert SMILES to SDF including extra fields if present. |
split_pdb |
Split PDB by chains and save to separate PDB files. |
Manipulate with Mol objects (calc properties, generate conformers/stereoisomers, filter compounds, etc):
| Script | Description |
|---|---|
add_h |
Add hydrogens to molecules. |
calc_center_rdkit |
Calculate geometric center of atoms. |
clust_scaffolds |
Cluster molecules by Murcko scaffold network levels. |
compare_charged_centers |
Get SMILES patterns of charged centers in two sets of molecules. |
count_undefined_stereocenters |
Count undefined stereocenters and print names + counts. |
discard_compounds_rdkit |
Remove multi-component & non-organic molecules. |
discard_radicals |
Discard structures containing radical electrons. |
filter_organic |
Split structures into organic (H, B, C, N, O, F, P, S, Cl, Br, I only) and inorganic sets. |
draw_mols |
Return PNG images of molecules. |
filter_conf |
Filter conformers by RMS value. |
filter_conf_adv |
Select representative conformers using clustering and advanced features. |
gen_conf_rdkit |
Generate conformers. |
gen_stereo_rdkit |
Enumerate stereoisomers (tetrahedral & double bond). |
gen_stereo_rdkit_native |
Enumerate stereoisomers using RDKit’s built-in function. |
get_map |
Calculate UMAP/t-SNE coordinates for input structures. |
get_mol_center |
Return geometric center of molecule. |
get_contact_scaffold |
Extract contact-anchored ligand scaffolds from SDF poses using ProLIF. |
get_substr |
Filter molecules by SMARTS (supports multiple patterns & negative matches). |
get_total_charge |
Calculate total formal charge. |
keep_largest |
Keep largest fragment by heavy atom count. |
mirror_mols |
Generate mirrored 3D structures (enantiomers). |
murcko |
Return Murcko scaffolds ignoring stereochemistry. |
neutralize |
Neutralize structures. |
physchem_calc |
Calculate physicochemical properties (MW, logP, TPSA, QED, etc.). |
pmapper_descriptors |
Calculate 3D pharmacophore descriptors (with pmapper). |
remove_isotopes |
Remove isotope labels and removable explicit hydrogens. |
remove_stereo |
Remove stereoconfiguration from all centers. |
remove_dupl_rdkit |
Remove duplicates via InChi key comparison. |
rmsd_rdkit |
Calculate RMSD (MCS if atom matching fails, with symmetry checks). |
sanitize_rdkit |
Remove molecules with sanitization errors + annotate stereocenters, etc. |
sphere_exclusion |
Select diverse subset of compounds. |
test_pains |
Return list of molecules matching PAINS. |
| Script | Description |
|---|---|
binning |
Take a table with values and return binned values based on thresholds. |