Change heterocycle to another isoster with HCIE #memo #cheminformatics #RDKit

Aromatic heterocycles are often used drug design as central core of molecule and they are often replaced another isosteric heterocycles for improvement of potency, ADMET or finding new IP space.

There are few tools for finding isosteric heterocycles of molecule. For example Brood is one of the famous tool for bioisosteric replacement developed by OpenEye but it requires commercial license for industry. (If readers are academia, you can use academic license!)

Today I would like to introduce very useful package for finding isosteric heterocycle named ‘HCIE’ :)

The original article is open access. You can read the article from following URL.
https://pubs.acs.org/doi/10.1021/acs.jmedchem.5c03118

The arthrors used VEHICLe and exapanded dataset as a data source. And they defined similarity with exit vector, ESPsim and Shape score. So it is really reasonable approach for finding bioisostere.

Fortunately HCIE is disclosed OSS as MIT license. So I tried to use HCIE.

HCIE can get from following URL (https://github.com/BrennanGroup/HCIE/tree/main).

HCIE supports one and two exit vector. I tried to test both case. Let’s write code! (I still write code by myself not AI such as github copilot…)

At first I use a example molecule as template.

from hcie import DatabaseSearch
from rdkit import Chem
from rdkit.Chem import Draw
# check template mol
smi = Chem.MolToSmiles(Chem.MolFromSmiles('C1=CC(=CC(=C1)Cl)NC2=C3C(=NC=NC3=NN2)N'))
Draw.MolToImage(Chem.MolFromSmiles(smi))

One-vector case: Following example is molecule which as one variable part.

# query with 1 variable 
query = "Nc1ncnc2n[nH]c([R])c12"
querymol = Chem.MolFromSmarts(query)
Draw.MolToImage(querymol)
    # search isoster wich hcie
    search = DatabaseSearch(query, 'core1')
    
    cores = [core for core in Chem.SDMolSupplier('core1_hcie_results/core1_aligned_results.sdf')]
    for core in cores:
        print(Chem.MolToSmiles(core))
    
    cores4zip = []
    for core in cores[1:]:
        for atm in core.GetAtoms():
            if atm.GetAtomicNum() == 0:
                atm.SetAtomMapNum(1)
        cores4zip.append(Chem.MolFromSmiles(Chem.MolToSmiles(core)))
    
    > output
    # first molecule is query so we should skip the molecule for generate new molecule.
    Nc1ncnc2n[nH]cc12
    *c1[nH]nc2ncnc(N)c12
    *c1[nH]nc2ncnc(Cl)c12
    *n1cnc2ncnc(N)c21
    *c1[nH]nc2nnnc(N)c12
    

    I added atom map number for result cores because I would like to use molzip function at next step.

    #define side chain which conbine cores.
    sidechain1 = 'c1ccc(Cl)cc1N[*:1]'
    sc1 = Chem.MolFromSmiles(sidechain1)
    Draw.MolToImage(sc1)
    
    sidechain2 = 'CCN[*:2]'
    sc2 = Chem.MolFromSmiles(sidechain2)
    Draw.MolToImage(sc2)
    

    Generated new molecules with searched cores and side chain1 with molzip function.

    genmols1 = [Chem.molzip(sc1, core4zip) for core4zip in cores4zip]
    Draw.MolsToGridImage(genmols1[:10])
    

    Wow it worked fine! Then try to get cores which has two vectors.

    # query with 2 variables 
    query2 = "[R]c1ncnc2n[nH]c([R])c12"
    querymol2 = Chem.MolFromSmarts(query2)
    Draw.MolToImage(querymol2)
    
    search2 = DatabaseSearch(query2, 'core2')
    search2.search()
    

    It is worth to know that results from query with two-vector has atom map number it means that I can attach side chaines very easily.

    cores2 = [core2 for core2 in Chem.SDMolSupplier('core2_hcie_results/core2_aligned_results.sdf')]
    cores24zip = []
    for core2 in cores2:
        print(Chem.MolToSmiles(core2))
        cores24zip.append(Chem.MolFromSmiles(Chem.MolToSmiles(core2)))
    > output
    c1ncc2c[nH]nc2n1
    c1nc([*:1])c2c([*:2])[nH]nc2n1
    c1cc([*:1])c2c([*:2])[nH]nc2n1
    n1nc([*:1])c2c([*:2])[nH]nc2n1
    c1cc2n[nH]c([*:2])c2c([*:1])n1
    c1nc([*:1])c2c([*:2])onc2n1
    c1nc([*:1])c2c([*:2])[nH]cc2n1
    c1nnc2n[nH]c([*:2])c2c1[*:1]
    c1[nH]c([*:1])c2c([*:2])nnnc12
    ..snip..
    
    genmols2 = [Chem.molzip(Chem.molzip(sc1, core2), sc2) for core2 in cores24zip[1:]]
    Draw.MolsToGridImage(genmols2[:10])
    

    Yah! I could get new molecules which have side chain 1 and 2 with new core.

    In summary HCIE is useful tool for finding aromatic heterocycles of drug like molecules. I would like to say thank developer for sharing such as useful code :)

    Rendering molecular image on Dataframe and Plot with marimo #Memo #RDKit #Cheminformatics

    As lots of readers know that RDKit has useful functions for coding with jupyter-lab. PandasTools and IPythonConsole is useful because by using these functions rdkit can render mol objects as SVG on pandas dataframe.

    Recently I found that marimo is growing famous as next genration of jupyter-notebook. I recommned to read Pat’s blog post if you have not read.

    https://patwalters.github.io/Practical-Cheminformatics-with-Marimo/

    Pat developed useful code of cheminformatics for marimo. So I tried to use it.

    At first, I made test env with pixi.

    $ pixi init
    $ pixi add marimo
    $ pixi add rdkit
    $ pixi add scikit-learn
    $ pixi add pandas
    $ pixi add searborn
    $ pixi add altair
    $ pixi add --pypi marimo-chem-utils
    
    $ pixi shell
    
    # my pixi.toml
    [workspace]
    channels = ["https://conda.modular.com/max-nightly", "conda-forge"]
    name = "marimo_dev"
    platforms = ["linux-64"]
    version = "0.1.0"
    
    [tasks]
    
    [dependencies]
    marimo = ">=0.20.2,<0.21"
    rdkit = ">=2025.9.6,<2026"
    pandas = ">=3.0.1,<4"
    scikit-learn = ">=1.8.0,<2"
    
    

    Now I made and activated enviromnet for test. Then launch marimo editor and write code.

    To run the marimo, just type ‘$ marimo edit’

    import marimo as mo
    import pandas as pd
    import numpy as np
    from rdkit import Chem
    from rdkit.Chem.Draw import rdDepictor
    from marimo_chem_utils import (
        add_fingerprint_column,
        add_image_column,
        add_inchi_key_column,
        add_tsne_columns,
        interactive_chart
    )
    from rdkit.Chem import PandasTools
    from rdkit.Chem import Descriptors
    
    df = PandasTools.LoadSDF('./cdk2.sdf')
    df
    
    

    Now all columns seem to be defined as strings type.

    So I changed columns type. It’s just simple pandas operation. After the operation, marimo can render chart from numerical dataset. It seems cool ;)

    df['Cluster'] = df['Cluster'].astype(np.int32)
    df['r_mmffld_Potential_Energy-OPLS_2005'] = df['r_mmffld_Potential_Energy-OPLS_2005'].astype(np.float64)
    df['r_mmffld_RMS_Derivative-OPLS_2005'] = df['r_mmffld_RMS_Derivative-OPLS_2005'].astype(np.float64)
    df
    

    Then I remove 3D conformation properties from molecule for rendering 2D image and added image data to the dataframe. useful_rdkit_utils is really useful, it can generate base64 image just calling the mol_to_base64_image method. And column name should be ‘image’ because I use the image as tooltip.

    import useful_rdkit_utils as uru
    df.ROMol.apply(rdDepictor.Compute2DCoords)
    df['image'] = df.ROMol.apply(uru.mol_to_base64_image, target='altair')
    df
    

    After running the code described avobe, I could add image to the dataframe.

    Then I calculated molecular descriptors for making scatter plot with compound images. It’s really easy to do it with current version of rdkit.

    descs = [Descriptors.CalcMolDescriptors(m) for m in df.ROMol]
    desc_df = pd.DataFrame(descs)
    m_df = df.join(desc_df)
    

    Finally I made x and y selector for rendering scatter plot and use altair for making the figure.

    x = mo.ui.dropdown(options=m_df.columns)
    x
    
    y = mo.ui.dropdown(options=m_df.columns)
    y
    
    import altair as alt
    mo.ui.altair_chart(alt.Chart(m_df).mark_point().encode(
        x=x.value,
        y=y.value,
        tooltip=alt.Tooltip(['image']))
                      )
    

    Add Tootipl with name ‘image’ is important because by using the name altair can render image on tooltip.

    In summary Marimo is really cool package and useful for cheminformatics.

    Split PROTAC molecule to 3 compornents with protac_splitter #RDKit #cheminfomratics #memo #python

    Proteolysis-targeting chimeras (PROTACs) are one of the interesting modalities in these days because the modality can engage protein of interest (POI) and E3 ubiquitin ligase and then causes degradation of POI.

    PROTAC molecules are build from 3 components, POI binder, Linker and E3 binder. So chemists and cheminformatitian would like to analyse these molecules by each compornents.

    However sometime it’s difficult to split molecule to these components. Because the structure of PROTAC has diversity it will be simple problem if the all linker is PEG :P but real world is more complex. It means that data curation of PROTAC molecules is taugh work.

    Today I would like to introduce interesting article from Astrazeneca’s team. They reported PROTAC split program called protac_splitter. The orignal publication is open access. The URL is below.
    PROTAC-Splitter: A Machine Learning Framework for Automated Identification of PROTAC Substructures

    Fortunately all code is available from github.
    https://github.com/ribesstefano/PROTAC-Splitter

    I tried to use the package with test data which is provided by ChEMBL team!

    At first I build environment with pixi as always (I like pixi for package management recently)

    $ mkdir protac_split
    $ cd protac_split
    $ pixi init
    $ pixi add python=3.10.8
    $ pixi add --pypi "protac_spliter @ git+https://github.com/ribesstefano/PROTAC-Splitter.git"
    $ pixi add --pypi jupyter
    $ pixi shell
    # All required package will be installed !
    

    Then I got PROATC dataset from following URL.

    EBI’s blog https://chembl.blogspot.com/2026/01/exploring-targeted-protein-degradation.html

    Link for datahttps://docs.google.com/spreadsheets/d/1JAeBkxyp5wq4-4vGqLdT-6ZPeVqAeA80pwdAs-JaE0Q/edit?gid=1946153757#gid=1946153757

    OK, let’s launch jupyter-lab.

    from rdkit import Chem
    from rdkit.Chem import Draw
    from rdkit.Chem.Draw import IPythonConsole
    from rdkit.Chem import PandasTools
    from protac_splitter import split_protac
    from protac_splitter import split_prediction
    from protac_splitter import split_protac_graph_based
    import numpy as np
    import pandas as pd
    # I renamed csv file after downloading from the link avobe due to the name contained white space.
    df = pd.read_csv('TPD_combined_v36_Jan_20.csv')
    print(df.shape)
    
    >(21657, 20)
    

    Some column has nan so I removed these columns.

    df_no_na = df.dropna(subset=['CANONICAL_SMILES']).copy()
    PandasTools.AddMoleculeColumnToFrame(df_no_na, smilesCol='CANONICAL_SMILES')
    Draw.MolsToGridImage(df_no_na['ROMol'][:10], molsPerRow=3, subImgSize=(300,100))
    

    Ok, I tried to split top 50 molecules.

    pred_res = split_protac(df_no_na.iloc[0:50,:], protac_smiles_col='CANONICAL_SMILES')
    PandasTools.AddMoleculeColumnToFrame(pred_res, smilesCol='default_pred_n0')
    Draw.MolsToGridImage(pred_res.ROMol[:10], subImgSize=(300,300))
    

    Hmm, it seems that protac_spitter works fine such as simple molecules which have liner linker.

    By the way how about more difficult case, such as molecules with have rigid linkers. The data set has rigid molecules, I mean rigid is moleucles which has low number of rotatable bonds.

    from rdkit.Chem.Descriptors import rdMolDescriptors
    df_no_na['NumRotBond'] = df_no_na['ROMol'].apply(rdMolDescriptors.CalcNumRotatableBonds)
    df_no_na['NumRotBond'].plot.hist(bins=100)
    
    rigid_protac = df_no_na[df_no_na['NumRotBond']<=5].copy()
    pred_res2 = split_protac(rigid_protac[:50], protac_smiles_col='CANONICAL_SMILES')
    PandasTools.AddMoleculeColumnToFrame(pred_res2, smilesCol='default_pred_n0')
    Draw.MolsToGridImage(rigid_protac.ROMol[300:320], molsPerRow=3, subImgSize=(300,200))
    
    
    PandasTools.AddMoleculeColumnToFrame(pred_res2, smilesCol='default_pred_n0')
    Draw.MolsToGridImage(pred_res2['ROMol'][:20], subImgSize=(300,400))
    

    It worked.

    protac splitter uses XGboost and it trained with PROTAC dataset and synthetic data. As arthors dicussed in the article protac splitter has limitations and it is not perfect but it’s worth to know because splitting these molecules by writing lots of SMARTs rules by your self….
    Thanks for reading.

    My code is uploaded to gist.

    Loading
    Sorry, something went wrong. Reload?
    Sorry, we cannot display this file.
    Sorry, this file is invalid so it cannot be displayed.

    New clustering algorithm for cheminformatics #bblean #cheminformatics #RDKit

    Clustering is one of the common but really important task of cheminformatics. There are lots of clustering algorithms are know as readers know, but now we should struggle with huge amount of compound dataset such as Enamine Real, WuXi Galaxy, ZINC and so on in the Era of AI driven drug discovery. It’s becomming taugh task for calcuating huge amount of compound properties, fingerprints and clustring them. So we need efficient clustring algorithm :) Also it it works on my PC with out GPU will be preferred (IMHO, IMHO because I don’t have powerful private GPU machine :P).

    Recently I treid to use bblean which is repored Ramón Alain Miranda-Quintana’s group. I heared his presentation at ACS spring 2025 and had interested the program but could not have time to test it unitl now. I tried it in this weekend :)

    The code is open, you can get code from github. Let’s start!

    At first, I made experimental envrioment with pixi :)

    $ mkdir bblean_test
    $ cd bblean_test
    $ gh repo clone mqcomplab/bblean
    $ pixi import -f bblean/environment.yaml
    $ pixi add jupyter
    $ pixi shell
    # install with C++ extension
    $ BITBIRCH_BUILD_CPP=1 pip install -e bblean/
    

    Now I could build bblean env with pixi.

    I tested bblean with chembl 36 dataset. The data could get from chembl download site. https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/

    $ wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_36.sdf.gz
    

    Then I extracted gz and got smiles list which has 150 < MW < 700 with RDKit (the process is common way so I would like to skip to write the process in this post)

    Almost there! Ok, let’s use bblean from CLI.

    iwatobipen ~/dev/bblean_test📦 default  󱫌 8s272ms
    on 🌀 ➜ time bb fps-from-smiles chembl36.smi 
    - Total time elapsed: 47.0539 s
    Finished. Outputs written to /home/iwatobipen/dev/bblearn/packed-fps-uint8-ecfp4-6e41d8c2.npy
    
    real	0m47.664s
    user	0m3.280s
    sys	0m0.769s
    
    iwatobipen ~/dev/bblean_test📦  
    on 🌀 ➜ wc chembl36.smi 
      2701131   5402262 180481148 chembl36.smi
    

    As you can see calculate fingerprint from over 2million compounds are finished only < 1min with 20 cores !

    Then run clustering.

    iwatobipen ~/dev/bblean_test📦 default  󱫌 9s954ms
    on 🌀 ➜ time bb run packed-fps-uint8-ecfp4-6e41d8c2.npy -o output
    
            ______ _ _  ______ _          _        
            | ___ (_) | | ___ (_)        | |       ______                      
            | |_/ /_| |_| |_/ /_ _ __ ___| |__     ___  / ___________ _______  
            | ___ \ | __| ___ \ | '__/ __| '_ \    __  /  _  _ \  __ `/_  __ \ 
            | |_/ / | |_| |_/ / | | | (__| | | |   _  /___/  __/ /_/ /_  / / / 
            \____/|_|\__\____/|_|_|  \___|_| |_|   /_____/\___/\__,_/ /_/ /_/  
    
    
    BitBirch-Lean is developed by the Miranda-Quintana Lab https://github.com/mqcomplab
    If you find this software useful please cite the following articles:
        • BitBIRCH: efficient clustering of large molecular libraries:
            https://doi.org/10.1039/D5DD00030K
        • BitBIRCH Clustering Refinement Strategies:
            https://doi.org/10.1021/acs.jcim.5c00627
        • BitBIRCH-Lean:
            (preprint) https://www.biorxiv.org/content/10.1101/2025.10.22.684015v1
    
    Running single-round, serial (1 process) clustering
    
    - Branching factor: 254
    - Merge criterion: diameter
    - Threshold: 0.3
    - Num. files loaded: 1
    - Num. fingerprints loaded for each file: 2,701,131
    - Total num. fingerprints: 2,701,131
    - Output directory: /home/iwatobipen/dev/bblean_test/output
    
    - Total time elapsed: 127.9181 s
    - Peak RAM use: 1.8784 GiB
    
    real	2m12.539s
    user	2m13.463s
    sys	0m1.521s
    
    

    The clustring process finished within few minutes without GPU. It’s amazing for me. Wait wait… I should check the clustring results sometime the results are not good.

    I cheked results with jupyter. (I would like to move to marimo but today I used jupyter)

    Following simple visualization seems that clustering works well, similar compounds are clustered in same cluster id.

    The check code is uploaded to my gist.

    Loading
    Sorry, something went wrong. Reload?
    Sorry, we cannot display this file.
    Sorry, this file is invalid so it cannot be displayed.

    Reader who had interest BBLEAN let’s try it!

    And original publication can read from biorxiv.
    https://www.biorxiv.org/content/10.1101/2025.10.22.684015v1.full.pdf+html

    Useful utils for analysing chemical reaction #cheminformatics #rdkit #rxnutils

    Recently lots of users know that generative model of molecule is really useful for drug design. But one of the big challenge is how to make designed molecules. So predicting sythetic route AI is hot topics in these area I think.

    There are useful retro synthesis AIs are reported such as Spaya.AI, Synthia, Reaxysis, Scifinder as commercial packages and AIZynfinder as OSS.

    There are Pros and Cons in these tools but these tools use AI(ML) or reaction rules. So the important part is analysing reaction to prepare training data or to prepare reaction template.

    Extract reaction template from reaction information is difficult task in cheminformatrics area. To do it, we need to do atom-atom mapping of each reaction at first then extract reaction information from each reactions.

    Today, I would like to share useful packages for analysing reactions named reaction_utils which is developed by AstraZeneca :) The github url is below.
    https://github.com/MolecularAI/reaction_utils

    I built enviroment with pixi :)

    $ mkdir reaction_utils_test
    $ cd reaction_uitls_test
    $ pixi init
    $ pixi add python=3.12
    $ pixi add jupyter
    $ pixi add --pypi reaction-utils
    $ pixi shell
    

    Now I could built reaction_utils env and activate it.

    Let’s enjoy the package :)

    $ wget https://raw.githubusercontent.com/snu-micc/LocalMapper/refs/heads/main/comparison/USPTO_sampled.csv
    
    #Following code is came from jupyter-lab
    from rdkit import Chem
    from rdkit.Chem import Draw
    from rdkit.Chem.Draw import IPythonConsole
    from rxnutils.chem.reaction import ChemicalReaction
    import pandas as pd
    df = pd.read_csv('./USPTO_sampled.csv')
    
    ns_rxn_mapper = df.RXNMapper.to_list()
    reaction = rxns_rxn_mapper[4]
    # read reaction
    rxn = ChemicalReaction(reaction)
    
    from rdkit.Chem import rdChemReactions
    rdrxn = rdChemReactions.ReactionFromSmarts(reaction, useSmiles=True)
    rdrxn
    

    reaction_utils can extract reaction template with user difined radius convenientry.

    rxn.generate_reaction_template(radius=1)
    rxn.retro_template
    rdChemReactions.ReactionFromSmarts(rxn.retro_template.smarts)
    
    rxn.generate_reaction_template(radius=2)
    rxn.retro_template
    rdChemReactions.ReactionFromSmarts(rxn.retro_template.smarts)
    

    As you can see the package can extract reaction information with few code lines. It’s really useful for building your own retro synthesi AI ;)

    I would like to recommend readers to use the package if you have interest.

    I uploaded my code on gist. Thanks for reading!

    Loading
    Sorry, something went wrong. Reload?
    Sorry, we cannot display this file.
    Sorry, this file is invalid so it cannot be displayed.
    view raw rxn_utils.ipynb hosted with ❤ by GitHub

    Look Back at 2025 #diary

    I’m writing this post at 20:30 ~ JST.

    I would like to look back at 2025 in my blog post ;)

    1. Running
      In this year my running total distance was 1078km. It was shorter than last year. It is as same as 2024. It was lack of long distance running at weekend. I would like to keep at least 1200km in next year!
    2. Coding
      From github profile, there are 76 commits in this year. Hmm… I should contribute and commit code more and more in next year. My job role is changed but I should keep going.
    3. My blog-site
      I posted 24 posts(exclude this post) in this year. All the pace is decreased compared to the last year. It is worth for me to write blog post for keep learning. So I would like to continue for wring blog post next year too :)
    4. About my work
      My role is changed in this year. Management was really tough task. It’s difficult but rewarding. I would like to promote AI driven drug discovery work flow!
    5. For next new year
      Current progress of AI technolgy is really amazing. Now my colleague deveplot web app with new language for them rapidly with AI agents. Now we can use deep research or similar AI agents for searching something instead of googling. The world is dramatically changed. I should keep learning!

    This will be the last post of this year. Thanks for reading.

    I want to wish you and your family a safe, beautiful and happy New Year.

    Considering predictive models for using cheminfomratics tasks #cheminfomratics #memo #journal

    In this, year my role is changed from resarcher to manager. It’s really big change for better or worse…

    Fortunately I’m still working in cheminformatics field :) And in this year I could have an opportunity to have a hands on training at CBI annual meeting. I and my collegaue disclosed the materials on github.
    https://github.com/cbi-society/cheminfo_tutorial_20251027_pub

    The topic is making predictive model with Deep learing based algorithm ‘chemprop’ and fine tune the model with ADMET dataset which came from Polaris. During the hands-on session I could have lots of furitful discussions with participants and Greg Landrum (thanks!).

    In the real world, the are lots of new deep learning based predictive methods are available. So which models should we use is very difficult task for cheminformatician ;) And sould we use Deep Learning approach instead of classical model is also difficult question I think… Because tree based model such as Light GBM shows stable performance as always so in this case DL model is not reqired.

    For people who have same questions, I would like to recommend to read the publication from ‘Yaëlle Fischer’ et al.

    ‘Deep Learning vs Classical Methods in Potency and ADME Prediction: Insights from a Computational Blind Challenge

    The article can get from chemarxiv!
    https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/68a4412023be8e43d6e2e7eb/original/deep-learning-vs-classical-methods-in-potency-adme-prediction-insights-from-the-polaris-antiviral-challenge.pdf

    In this article, athors take benchmark data with deepl learning and classical marchine learning by using polaris data(potency and ADME) as input data.

    They used pIC50, LogD, HLM, MLM, Kinetic solbility and MDCK(permiability) from polaris as dataset. And used classical and learned embeddings as input for classic ML or Deep learning listed below.

    Classical descriptors: ECFP4, Avalon. RDKit-2D, Mordred
    Classic ML: Random Forest, XGBoost, LGBM, SVM

    Learned Embeddings: CDDD, MolFormer, ChemBERTa, CheMelon.
    DeepLearning: ChemProp, TabPFN

    In figure4 shows classical descriptors outperformed deep learning embeddings. It indicates that general embeddings are not outperform compared to traditional descriptors.

    In figure6 shows that deep learning models shows outperformed in ADME tasks but classcical model outperformed in potency tasks (pIC50 of SARS/MERS) . TabPFN showed best performance in all tasks.

    There are more available data of ADMET compared to potency so it good for deep learning models. Because huge amount of data is required to train deep learning models (IMHO).

    The conclution of the arthors are reasonable for me.

    The key to success lies in collecting high quarity data cost-effectively in era of AI even if open source deep learning models are available for everyone.

    Use joblib with free-threaded python #memo #diary

    Python 3.14 was released on 7 October 2025. The new version of python support for free-threaded (No GIL). I asked it about AI and reasonable description shown below.

    To understand the merit of “No-GIL” Python (officially known as free-threaded Python), we first need to quickly establish what the Global Interpreter Lock (GIL) is and why it has historically been a bottleneck.

    In standard CPython (the default Python implementation), the GIL is a mutex that allows only one thread to control the Python interpreter at a time1 Even if you have a CPU with 32 cores, a standard multi-threaded Python program can effectively only use one of those cores at any given instant for Python bytecode.

    The removal of the GIL (spearheaded by PEP 703) is a massive shift. Here are the primary merits of a No-GIL Python:


    1. True Multi-Core Parallelism

    The most significant benefit is the ability to run Python threads in parallel on multiple CPU cores.

    • Current State (With GIL): Threads provide concurrency (they take turns running), but not parallelism (running simultaneously).2 This works fine for waiting on data (I/O-bound tasks like downloading files), but fails for calculation-heavy tasks.
    • No-GIL State: Threads can execute Python bytecode on different cores at the exact same time.

    2. Significant Speedup for CPU-Bound Tasks

    For applications that require heavy computation (data processing, image manipulation, complex mathematics), No-GIL Python allows you to scale performance by adding more threads.3

    • Example: If you have a loop processing a massive list of numbers, splitting that work across 4 threads on a 4-core machine in standard Python might actually be slower than 1 thread due to the overhead of context switching.
    • With No-GIL: That same 4-thread operation could theoretically run nearly 4x faster.

    3. Reduced Reliance on multiprocessing

    Currently, to bypass the GIL, Python developers use the multiprocessing library. This spawns entirely separate processes (separate memory spaces) rather than threads. While effective, it comes with heavy “costs” that No-GIL eliminates:

    • No Pickling Overhead: In multiprocessing, data must be serialized (pickled) and sent between processes. This is slow and uses extra CPU. With free-threading, threads share memory directly, so there is zero serialization overhead.
    • Lower Memory Footprint: Processes consume much more RAM than threads because each process needs its own copy of the Python runtime and resources. Threads share the same heap.

    4. Simplified Extension Development

    Many popular libraries (like NumPy or PyTorch) currently have to write complex C/C++ code to strictly release the GIL whenever they do heavy lifting, just so other threads can run.

    • The Merit: With the GIL gone, library maintainers can write parallel code more naturally without constantly juggling the lock state, potentially leading to simpler and more robust library codebases.

    5. Better Responsiveness

    In mixed-workload applications (e.g., a web server doing some background number crunching), the GIL can cause latency spikes where the background thread “hogs” the lock, making the main thread unresponsive to user requests.4 Free-threading allows the UI or request handler to run uninterrupted on one core while the computation happens on another.5


    The Trade-offs (The Cost of Merit)

    It is important to note that removing the GIL is not “free.” There are reasons it existed for so long:

    1. Single-Threaded Performance Hit: To make Python thread-safe without a global lock, the interpreter must use many smaller locks (fine-grained locking) on individual objects.6 This adds overhead. The current experimental No-GIL builds are roughly 10-15% slower on single-threaded code compared to standard Python.
    2. Thread Safety Risks: The GIL implicitly protected developers from many race conditions.7 Without it, developers must be much more careful about locking shared data, or they risk data corruption.
    3. Ecosystem Compatibility: Many existing C-extensions (libraries written in C) rely on the GIL for safety.8 They need to be updated to support the free-threaded build, or they may crash.

    Summary: Who benefits the most?

    Use CaseBenefit LevelWhy?
    Data Science / AIHighFaster preprocessing; easier parallelization of custom logic.
    Web ServersMediumBetter handling of mixed CPU/IO loads; lower latency.
    Simple ScriptsNoneMay actually run slightly slower due to overhead.
    Game DevHighLogic updates and physics can run on separate cores more easily.

    Current Status

    “Free-threaded” Python is now available as an experimental option in Python 3.13 (released October 2024).9 It is not yet the default; you must download a specific build or compile it yourself to test it.

    Would you like me to explain how to install the Python 3.13 free-threaded experimental build to test this yourself?

    After reading the description, I would like to use free threaded pyhon :) So I tried to create new python environment with pixi. I also installed scikit-learn because new version of scikit-learn supports free-threaded python.
    https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html

    $ mkdir free_thread_py314
    $ cd free_thread_py314
    $ pixi init
    $ pixi add free python-freethreading==3.14
    $ pixi add scikit-learn
    $ pixi add --pypi jupyter
    $ pixi shell
    $ jupyterlab
    

    Following code is my test code on jupyternotebook. I uploaded the code on Gist.

    Loading
    Sorry, something went wrong. Reload?
    Sorry, we cannot display this file.
    Sorry, this file is invalid so it cannot be displayed.

    I tested Grid search for parameter optimization, but the perfomance is not improved compared to default ‘loky’ seetings and ‘threading’ without GIL settings.

    Then I tried simple calculation with 3 backends ‘loky’, ‘threading’ and ‘multiprocessing.

    As you can see default ‘loky’ settings was the fastest option of the three trials.

    But I found that NoGIL work when I checked process with ‘top’ command.

    When I run the code with ‘loky’ option, there are 5 processes which use ~20% cpu were launched.

    But when I run the code with ‘threading’ option, I found same process which uses 200%~ cpu.

    I think it means that python-free threading enviromnent can use cpus more efficientry for CPU bounded tasks.

    BTW, rdkit has not support free threaded python yet. I don’t whether the performance of RDKit is improved with No-GIL or not.

    Similarity screening with RDKit #RDKit #SimilarityScreener #memo #cheminformatics

    Similarity based screening is one of the common way to explore SAR rapidly. For example if you got hit compound but lack of human resources for making analogue compounds, catalogue SAR is useful way to expand SAR.

    Of course ‘SIMILARITY’ is really difficut term in cheminfomratics. There are lots of metrics for measuring compound similarity. I will not describe the topic in this post :)

    As most of readers know that RDKit offers lots of cheminformatics functions. So we can screen similar compounds with RDKit basic functions such as calculate Morgan FP and compare Tanimoto similarity between probe compound and data supplier such as catalogue data base, then pick top X similar compounds. I often use the procedure in my task. Today I tried to use another method named SimilarityScreener.

    https://www.rdkit.org/docs/source/rdkit.Chem.Fingerprints.SimilarityScreener.html#

    SimilarityScreener is really easy to use. I tested the method with KinaseSARFari dataset. Kinase SARFari is one of the reagacy dataset of ChEMBL..
    https://chembl.gitbook.io/chembl-interface-documentation/legacy-resources

    OK let’s write code. At first I imported some required methods.

    from rdkit import Chem
    from rdkit.Chem import rdFingerprintGenerator
    from rdkit.Chem.Draw import IPythonConsole
    from rdkit import Chem, DataStructs
    from rdkit.Chem.Fingerprints import SimilarityScreener
    from rdkit.Chem import Draw
    from rdkit import RDLogger
    RDLogger.DisableLog('rdApp.*') #Disable all rdkit related log (it's not recommended way :P)
    import gzip
    

    Then define fingerprinter function and load data. Any suitable fingerprint is also available such as Morgan, Atompair and Avalon etc.

    def fingerprinter(mol):
        fpgen = rdFingerprintGenerator.GetRDKitFPGenerator(maxPath=7,
                                                           minPath=2,
                                                           fpSize=2024)
        return fpgen.GetFingerprint(mol)
    
    inf = gzip.open('./ks_compound.sdf.gz')
    # FowardSDMolSupplier did not work, so I useded SDMolSupplier
    suppl = Chem.ForwardSDMolSupplier(inf)
    mols = [m for m in suppl if m != None]
    print(len(mols))
    >>53962
    w = Chem.SDWriter('kinase_sar.sdf')
    for m in mols:
        w.write(m)
    w.close()
    suppl = Chem.SDMolSupplier('kinase_sar.sdf')
    # tofacitinib as probe
    prob = fingerprinter(Chem.MolFromSmiles('C[C@@H]1CCN(C[C@@H]1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N'))
    

    I used tofacitinib as probe molecule. Following code is example of TopN screening and threshfold based screening.

    metric = DataStructs.TanimotoSimilarity
    screener = SimilarityScreener.TopNScreener(10,
                                               metric=metric,
                                               probe=prob,
                                               fingerprinter=fingerprinter,
                                               dataSource=suppl,
                                              )
    matches = [m for m in screener]
    print(len(matches))
    >>> 10
    Draw.MolsToGridImage([row[1] for row in matches], legends=[f"{row[0]:.2}" for row in matches], molsPerRow=5)
    
    metric = DataStructs.TanimotoSimilarity
    screener = SimilarityScreener.ThresholdScreener(0.6,
                                               metric=metric,
                                               probe=prob,
                                               fingerprinter=fingerprinter,
                                               dataSource=suppl,
                                              )
    matches = [m for m in screener]
    print(len(matches))
    >>> 81
    Draw.MolsToGridImage([row[1] for row in matches[:10]], legends=[f"{row[0]:.2}" for row in matches[:10]], molsPerRow=5)
    

    As the code shows that SimilarityScreener offers easy way to screening compounds. Few lines of code is requred to perform similarity based screening. It’s worth to know the function.

    But the function does not so fast, so if you would like to screen million, billion or trillion of compounds I would not recommend to use the function :)

    It is interesting for me to write code with not familier functions of rdkit.

    Install roshambo2 to pixi env #memo #cheminformatics #RDKit

    Recently I’m using pixi for environment managment. Because it works very fast and can handle not only conda but also pypi in local .pixi env.

    Today, I tried to install roshambo2 to pixi env. The original article is published from JCIM and the code is available under MIT lisence.
    Article
    https://pubs.acs.org/doi/10.1021/acs.jcim.5c01322
    Code
    https://github.com/molecularinformatics/roshambo2/tree/main

    In this article arthors shows performance of ROSHAMBO2 on Gaming GPU. For example they reported that it took only ~6minuntes for screening 160million unique query-ligand overlap evaluation woth RTX 4090 x 1 (24GB VRAM). It seems amazing performance! If reader who has intereste the article please check it :)

    I introduced the code previous post and I installed roshambo2 by basic way which is introduced in README.

    Let’s install roshambo to pixi env! Before to do it gcc, g++, cmake, nvidia-driver and cuda-toolkit should be installed (of course GPU is required)

    $ gh repo clone molecularinformatics/roshambo2
    $ cd roshambo2
    $ pixi import --import environment.yaml
    # install packages wich listed in environment.yaml
    $ pixi shell
    

    Next I modified CMakeLists.txt because during build process, CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES is not fould automatically. So I hard coded the variable as absolute path.

    project(_roshambo2 LANGUAGES CXX CUDA)
    
    # Find OpenMP package
    find_package(OpenMP REQUIRED)
    
    # Find CUDA
    find_package(CUDA REQUIRED)
    
    # Find pybind
    find_package(pybind11)
    
    # I modified the line.
    #include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")
    include_directories(/usr/include)
    
    ~snip~
    

    Then, install roshambo2 with pip command instead of ‘pixi add’. The other error messages were obtained when I use pixi add file://absolutepath –pypi.

    $ pip install .
    # Install process worked :)
    

    I also build document.

    $ pixi add myst-parser
    $ pixi add myst-parser
    $ cd doc
    $ make html
    

    After typing the command above, documents were generated in doc/build directory.

    The documents is useful for checking API and examples.

    Roshambo2 support preparation proces via CLI interface.

    From the document,
    “The slowest part of the program is the assignment of color features using RDKit. The second slowest part is reading in the 3D SDF files. To ease the searching of large datasets we created a Roshambo2 data format (h5 file) and a script that will read in 3D SDF files, assign color features if requested, and create formatted Roshambo2 H5 files. The H5 files can be read in very quickly. The idea is you can prepare the dataset H5 files ahead of time and then run searches quickly each time you have a new query molecule.

    $ prepare_dataset_from_sdf.py --color dataset.sdf processed_dataset.h5
    

    I will try to use roshambo2 against large compound dataset.

    Run MD simulation with Openff&Openmm on pixi’s env #cheminformatics #RDKit #pixi #memo

    My background was organic chemistry but now I’m working as cheminformatitian. So I have lots of experiences of cheminformatics and medicinal chemistry but not many experiences of molecular simulation. But I’m interested in the simulation field and openmm and openff are one of my favorite packages to learn MD.

    OpenMM and OpenFF are developed attractively and Jeff shared useful notebook at RDKit UGM 2024 and 2025. I listened his lecutre at RDKit UGM 2024 but didn’t listen in this year unfortunately because I did not participate the meeting.

    The link of the materials is following.

    https://github.com/openforcefield/rdkit_ugm_2025_demo

    Fortunately the notebook is shared on openforcefield github site. So I tried to run the code on pixi’s env. It’s my first trial to build virtual environment for conda by using pixi :)

    OK let’s try it!

    # clone code with github CLI (I like github CLI :-))
    $ gh repo clone openforcefield/rdkit_ugm_2025_demo
    $ cd rdkit_ugm_2025_demo
    

    If readers who would like to build env with conda, the procedure is well documented in the README.md. Just type the command below.

    # using mamba instead of conda is highly recommended I think
    mamba env create -y -f environment.yml
    mamba activate openff-ugm-2025
    

    But I would like to use pixi. So I need to modify the procedure.

    # at rdkit_ugm_2025_demo
    $ cp environment.yml environment.yml.bk
    # edit following line from environment.yml because pixi can not parse the git+https protocol
    # openff-toolkit 0.16.8 does not support NAGL FF which is used in the notebook
     -  - openff-toolkit-examples =0.16.8 
     +  - openff-toolkit-examples =0.17.1
     -   - pip:
     -       - git+https://github.com/openforcefield/openff-pablo.git@v0.0.1a1
    

    Then I use pixi init command with import option. By using the import option, pixi make pixi.toml file for enviroment building.

    $ pixi init --import environment.yml
    # type pixi shell, all packages are installed (as same as conda env create -f environment.yml)
    $ pixi shell
    # after building the env, I installed openff-pablo to the env. It's easy to use pixi add command. It is same as conda install.
    $ pixi add --git https://github.com/openforcefield/openff-pablo.git openff-pablo --tag v0.0.1a1 --pypi
    

    Then I can run the notebook on pixi env.

    # before activating env
    iwatobipen ~/dev/rdkit_ugm_2025_demo  main(+16224/-15133)[📝?✓] 🐍 v3.11.14 📦  
    on 🌀 ➜ which python
    
    iwatobipen ~/dev/rdkit_ugm_2025_demo  main(+16338/-15149)[📝?✓] 🐍 v3.11.14 📦  
    on 🌀 ➜ pixi shell
    
    # after activating env
    iwatobipen ~/dev/rdkit_ugm_2025_demo  main(+16338/-15149)[📝?✓] 🐍 v3.11.14 📦 default 
    on 🌀 ➜ which python
    /home/iwatobipen/dev/rdkit_ugm_2025_demo/.pixi/envs/default/bin/python
    
    

    The following code is as same as original code. There are nothing new.

    from pathlib import Path
    
    import ipywidgets as widgets
    import mdtraj
    import nglview
    import numpy
    import numpy as np
    import openmm
    import openmm.unit as omm_unit
    import rdkit
    from openff.interchange import Interchange
    from openff.interchange.drivers.openmm import get_openmm_energies
    from openff.toolkit import AmberToolsToolkitWrapper, ForceField, Molecule, Topology
    from openff.toolkit.utils.nagl_wrapper import NAGLToolkitWrapper
    from openff.units import Quantity, ensure_quantity, unit
    from openff.units.openmm import from_openmm
    from openmm.app import Simulation
    from pdbfixer import PDBFixer
    
    # Warm up NAGL
    ntkw = NAGLToolkitWrapper()
    ntkw.assign_partial_charges(Molecule.from_smiles('C'), "openff-gnn-am1bcc-0.1.0-rc.3.pt")
    

    I think openff-interchange is the key for making MD simulations more easily. openff-interchange handles required information for MD of ligand, solvent and proteins and it can connect various MD engins not only openmm but also GROMACS, AMBER, LAMPS etc. You can learn openff-interchange from original documenetation. It’s worth to read. https://docs.openforcefield.org/projects/interchange/en/stable/using/intro.html

    And another interesting code is last two lines, it uses GNN for calculating parcial charge of ligands. Recently Deep learning based method is becoming common for calculating parcial charges because it is time consuming step but GNN works very fast (of course it is not always accurate.).

    The example of the repository is not recommended for production usage becase it set with very short simulation time. But I think such kinds of materials are really useful for people who would like to learn how to use MD package for theire research such as me :)

    In summary I introduced (memorized) the way to build environment with pixi which is written for conda usage. And it will useful (at least for me) for people who would like to build env with pixi from github repo.

    Build environment from github repository with pixi #cheminformatics #memo #pixi

    I could have really fruitful discussions last week at CBI2025 annual meeting. Really appreciate all participants and presenters.

    I realized that I love open science and I would like to contribute it :)

    I wrote new package for library management called pixi. In the previous post, I introduced pixi for making new environment.

    There are lots of github repositories are available and lots of codes are installed by pip, conda command. So I would like to know how to install these code by pixi. Because I think it’s not so useful if it is difficult to install such as code with pixi.

    I tried to make chemprop env with pixi.

    The procedure is below.

    $ gh repo clone chemprop/chemprop
    $ cd chemprop
    $ pixi init --format pyproject
    ✔ Added package 'chemprop' as an editable dependency.
    ✔ Added environments 'hpopt', 'test'
    

    After running the code, pyproject.toml was modified.

    [build-system]
    requires = ["setuptools>=45", "wheel", "setuptools_scm[toml]>=6.2"]
    build-backend = "setuptools.build_meta"
    
    [project]
    name = "chemprop"
    description = "Molecular Property Prediction with Message Passing Neural Networks"
    version = "2.2.1"
    authors = [
        {name = "The Chemprop Development Team (see LICENSE.txt)", email="chemprop@mit.edu"}
    ]
    readme = "README.md"
    license = {text = "MIT"}
    classifiers = [
    	"Programming Language :: Python :: 3",
    	"Programming Language :: Python :: 3.11",
        "License :: OSI Approved :: MIT License",
        "Operating System :: OS Independent"
    ]
    keywords = [
        "chemistry",
        "machine learning",
        "property prediction",
        "message passing neural network",
        "graph neural network",
        "drug discovery"
    ]
    requires-python = ">=3.11"
    dependencies = [
        "lightning >= 2.0",
        "numpy",
        "pandas",
        "rdkit",
        "scikit-learn",
        "scipy",
        "torch >= 2.1",
        "astartes[molecules]",
        "ConfigArgParse",
        "rich",
        "descriptastorus",
    ]
    
    [project.optional-dependencies]
    hpopt = ["ray[tune]", "hyperopt", "optuna"]
    dev = ["black == 23.*", "bumpversion", "autopep8", "flake8", "pytest", "pytest-cov", "isort"]
    docs = ["nbsphinx", "sphinx", "sphinx-argparse != 0.5.0", "sphinx-autobuild", "sphinx-autoapi", "sphinxcontrib-bibtex", "sphinx-book-theme", "nbsphinx-link", "ipykernel", "docutils < 0.21", "readthedocs-sphinx-ext", "pandoc"]
    test = ["pytest >= 6.2", "pytest-cov"]
    notebooks = ["ipykernel", "matplotlib"]
    
    [project.urls]
    documentation = "https://chemprop.readthedocs.io/en/latest/"
    source = "https://github.com/chemprop/chemprop"
    PyPi = "https://pypi.org/project/chemprop/"
    
    [project.scripts]
    chemprop = "chemprop.cli.main:main"
    
    [tool.black]
    line-length = 100
    target-version = ["py311"]
    skip-magic-trailing-comma = true
    required-version = "23"
    
    [tool.autopep8]
    in_place = true
    recursive = true
    aggressive = 2
    max_line_length = 100
    
    [tool.pytest.ini_options]
    addopts = "--cov chemprop"
    markers = [
        "integration",
        "CLI",
    ]
    
    [tool.isort]
    profile = "black"
    line_length = 100
    force_sort_within_sections = true
    
    [tool.setuptools.packages.find]
    include = ["chemprop"]
    exclude = ["tests", "examples", "docs", "requirements", ".github"]
    
    [tool.pixi.workspace]
    channels = ["https://conda.modular.com/max-nightly", "conda-forge"]
    platforms = ["linux-64"]
    
    [tool.pixi.pypi-dependencies]
    chemprop = { path = ".", editable = true }
    
    [tool.pixi.environments]
    default = { solve-group = "default" }
    dev = { features = ["dev"], solve-group = "default" }
    docs = { features = ["docs"], solve-group = "default" }
    hpopt = { features = ["hpopt"], solve-group = "default" }
    notebooks = { features = ["notebooks"], solve-group = "default" }
    test = { features = ["test"], solve-group = "default" }
    
    [tool.pixi.tasks]
    

    Then I modified one line, because python=3.14 env will create with the toml file but ray[tune] does not support python3.14.

    - requires-python = ">=3.11"
    + requires-python = "<=3.12"
    

    Then type following command.

    $ pixi shell
    # installed required packages
    $ which python
    /home/iwatobipen/dev/chemprop/.pixi/envs/default/bin/python
    $ pixi add jupyter #add additinal packages with pixi add command.
    

    By using pixi, all environment information is stored in .pixi/envs/ directory.

    I showed how to build environment with pixi and githubrepo today. It’s really useful because we have lots of issues of code dependencies and pixi will be savior of cheminformatics

    Try to use new package manager for python #pixi #memo

    October has arrived, and Japan has become quite cool. It’s a great season for running now.

    By the way, package management is important task for lots of data scientists. Becase they use lots of packages and each packages depend on other packages. Especially CUDA, I often struggle the issues.

    Anaconda is one of the useful package manager but it takes long time for solving dependency sometime. Mamba is an one of solution. I like it :)

    Recently I tried to move from conda to uv for package management. But uv does not support management of conda package. So it is difficult to migrate all environment to uv. Because most of codes which are shared on github depend on conda and it provids dependencies as envriomnet.yml file.

    I googled how to manage conda package with uv but there are no suitabile solution but I found new package manager called pixi. Pixi is developed with Rust as same as uv. So it works very fast!

    Today I tried to use pixi. Install of pixi is really easy.

    curl -fsSL https://pixi.sh/install.sh | bash
    

    After the install process, I could use pixi command from terminal. OK let’s make test env.

    pixi init pixi-cheminfo --format pyproject
    iwatobipen🌱 /home/iwatobipen/dev took 7s 
    ➜ tree pixi-cheminfo/
    pixi-cheminfo/
    ├── pyproject.toml
    └── src
        └── pixi_cheminfo
            └── __init__.py
    
    3 directories, 2 files
    cd pixi-cheminfo
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.12.11 
    ➜ cat pyproject.toml 
    [project]
    authors = [{name = "iwatobipen", email = "seritala@gmail.com"}]
    dependencies = []
    name = "pixi-cheminfo"
    requires-python = ">= 3.11"
    version = "0.1.0"
    
    [build-system]
    build-backend = "hatchling.build"
    requires = ["hatchling"]
    
    [tool.pixi.workspace]
    channels = ["conda-forge"]
    platforms = ["linux-64"]
    
    [tool.pixi.pypi-dependencies]
    pixi_cheminfo = { path = ".", editable = true }
    
    [tool.pixi.tasks]
    
    

    Now I could make new env with pixi. Then activate the env like ‘conda activate’. pixi shell command like conda activate shown below.

    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.12.11 
    ➜ which python
    /home/iwatobipen/conda/bin/python
    
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.12.11 
    ➜ pixi shell
    
    
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.14.0 via 󰏗 v0.56.0 (default) 
    ➜ which python
    /home/iwatobipen/dev/pixi-cheminfo/.pixi/envs/default/bin/python
    
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.14.0 via 󰏗 v0.56.0 (default) 
    
    

    pixi add command is same as conda install command.

    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.14.0 via 󰏗 v0.56.0 (default) 
    ➜ pixi add rdkit
    ✔ Added rdkit >=2025.9.1,<2026
    
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.13.7 via 󰏗 v0.56.0 (default) took 10s 
    ➜ pixi add jupyter
    ✔ Added jupyter >=1.1.1,<2
    iwatobipen🌱 /home/iwatobipen/dev/pixi-cheminfo is 󰏗 v0.1.0 via  v3.13.7 via 󰏗 v0.56.0 (default) took 16s 
    ➜ ipython
    Python 3.13.7 | packaged by conda-forge | (main, Sep  3 2025, 14:30:35) [GCC 14.3.0]
    Type 'copyright', 'credits' or 'license' for more information
    IPython 9.6.0 -- An enhanced Interactive Python. Type '?' for help.
    Tip: Put a ';' at the end of a line to suppress the printing of output.
    
    In [1]: from rdkit import Chem
    
    In [2]: mol = Chem.MolFromSmiles('CC')
    
    In [3]: mol.GetNumAtoms()
    Out[3]: 2
    
    exit
    
    ➜ cat pyproject.toml 
    [project]
    authors = [{name = "iwatobipen", email = "seritala@gmail.com"}]
    dependencies = []
    name = "pixi-cheminfo"
    requires-python = ">= 3.11"
    version = "0.1.0"
    
    [build-system]
    build-backend = "hatchling.build"
    requires = ["hatchling"]
    
    [tool.pixi.workspace]
    channels = ["conda-forge"]
    platforms = ["linux-64"]
    
    [tool.pixi.pypi-dependencies]
    pixi_cheminfo = { path = ".", editable = true }
    
    [tool.pixi.tasks]
    
    [tool.pixi.dependencies]
    rdkit = ">=2025.9.1,<2026"
    jupyter = ">=1.1.1,<2"
    
    

    Pixi is useful for package management as same as conda, uv.

    I would like to check more details of pixi’s documentation.

    GPU based fast Shape alignment of molecules #RDKit #Roshambo2 #Cheminformatics

    Most of readers know that power of GPU changes the way of cheminformatics. I introduced nvMolkit which is developed by NVIDIA for cheminformatics tasks. Now we can handle huge amount of data in short time :)

    nvMolKit can calculate compound similarity and conformation rapidly with GPU but GPU assisted alignment is not implemented.

    Ligand based approach is common way of drug discovery evenif alpha fold3 is availabe in these days. So rapid ligand based alignment method is useful for finding new scaffold. Last year, I introduced ROSHAMBO for GPU based molecular alignment.
    https://iwatobipen.wordpress.com/2024/08/08/new-cheminformatics-package-for-molecular-alignment-and-3d-similarity-scoring-cheminformatics-rdkit-memo/comment-page-1/

    ROSHAMBO ver1 worked very well, but there is little bit difficulity to install the package. Fortunately I found new version of ROSHAMBO at JCIM. And the author disclosed the code on git hub!
    Article https://pubs.acs.org/doi/full/10.1021/acs.jcim.5c01322
    Github https://github.com/molecularinformatics/roshambo2/tree/main

    I could have time for writing code in this weekend so I tried to use ROSAMBO2.

    At first, I build env for rosambo2.

    gh repo clone molecularinformatics/roshambo2
    cd roshambo2
    conda env create -n roshambo2 -f environment.yaml
    conda activate roshambo2
    pip install .
    pip install moleculekit #optional for feature vizualization
    mamba install -c conda-forge jupyter pymol-open-source #optional for feature vizualization
    

    After installation, I tried to align molecule with CDK2.sdf dataset.

    from rdkit import Chem
    mols = [m for m in Chem.SDMolSupplier('./cdk2.sdf', removeHs=False)]
    # make query molecule from cdk2.sdf
    w = Chem.SDWriter('top.sdf')
    w.write(mols[0])
    w.close()
    
    # shape based align and calculate score (Shape Tanimoto)
    from roshambo2 import Roshambo2
    roshambo2_calc = Roshambo2('top.sdf', 'cdk2.sdf')
    scores = roshambo2_calc.compute()
    # save aligned molecules as sdf
    roshambo2_calc.write_best_fit_structures(hits_sdf_prefix='hits_for_query')
    

    After running the code above I could get CSV and sdf which contains aligned molecule. CSV file has score and smiles but there is no color score of course.

    Let’s calculate Tanimoto combo (shape and color). It’s almost same as previous code but added color=True option.

    roshambo2_calc_col = Roshambo2("top.sdf", "cdk2.sdf", color=True)
    scores = roshambo2_calc_col.compute(optim_mode='combination') 
    
    feature_to_symbol = {'Donor':'H', 'Acceptor':'He', 'PosIonizable':'Li', 'NegIonizable':'Be', 'Aromatic':'B', 'Hydrophobe':'C'}
    
    roshambo2_calc_col.write_best_fit_structures(hits_sdf_prefix='hits_for_query',
     feature_to_symbol_map=feature_to_symbol)
    

    In this case I could get color score too.

    And pymol views are above. Dot and sphare is features, sky bule is aligned molecule and pink is original orientation of target molecule. Yellow is query molecule. As you can see target molecule is well aligned.

    Roshambo2 works not only CLI but also sever mode with multiple GPU environment and can generate dataset as hdf5 format. So it is suitable for handing large dataset.

    In summary ROSHAMBO2 is powerful package for cheminfomratics. Thanks for developing such as a useful package!

    RDKit meets GPU #RDKit #nvmolkit #nvidia #cheminformatics

    Unfortunately I could not participate RDKit UGM 2025 in this year…. I would like to join the meeting at next year.

    By the way, recently we can use GPUs for acceralate chem/bio informatics calculations such as Deep Learning application or clustering tasks. Nvidia’s rapids is one of the famous package for GPU based data science.

    But there are no package which accelarate RDKit function directry. RDKit works natively fast because most of parts are implemented with C++ evenif GPU is not availabe.

    In this week, NVIDIA’s team disclosed really cool package named nvmolkit. You can check the details in following URL.
    https://research.nvidia.com/labs/dbr/blog/nvMolKit/

    Fortunately they shared code! So I tried to install it and use nvmolkit.

    nvMolKit requires an NVIDIA GPU with compute capability 7.0 (V100) or higher. My notebook PC has Gaming GPU (GeForce 4060) so it meets the requirements.

    At first, I installed some pacakges as same way as README procedure.

    # Update package list
    sudo apt-get update
    
    # Install build tools and development headers
    sudo apt-get install build-essential libeigen3-dev
    sudo apt-get install libstdc++-12-dev libomp-15-dev
    
    # nvMolKit requires a C++ compiler. You can install it system-wide or via conda:
    
    # Example: Install clang on Ubuntu:
    sudo apt-get install clang-15 clang-format-15 clang-tidy-15
    

    Then I installed cuda tookit.

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    # newer version does not support compute 7.0. So I installed old version. It depends on your GPU.
    sudo apt-get -y install cuda-toolkit-12-8
    

    Then install nvMolkit.

    # Remove old CMake
    sudo apt remove --purge --auto-remove cmake
    
    # Install CMake 3.30.1
    wget https://github.com/Kitware/CMake/releases/download/v3.30.1/cmake-3.30.1-linux-x86_64.sh
    chmod +x cmake-3.30.1-linux-x86_64.sh
    sudo ./cmake-3.30.1-linux-x86_64.sh --prefix=/usr/local --skip-license
    
    # Create and activate environment
    conda create --name nvmolkit_dev_py312 python=3.12.1
    conda activate nvmolkit_dev_py312
    
    # Install RDKit with development headers
    conda install -c conda-forge rdkit=2024.09.3 rdkit-dev=2024.09.3
    
    # Install Boost subpackages in case RDKit install did not include them transitively
    conda install -c conda-forge libboost libboost-python libboost-devel libboost-headers libboost-python-devel
    
    # Install Torch, make sure it's a GPU-enabled version. If having trouble install, check out the
    # torch installation guidelines: https://pytorch.org/get-started/locally/
    pip install torch torchvision torchaudio
    python -c "import torch; print(torch.__version__); print(f'Is a CUDA build? {torch.cuda.is_available()}')"
    
    
    # Activate your environment
    conda activate nvmolkit_dev_py312
    
    gh repo clone NVIDIA-Digital-Bio/nvMolKit
    cd nvMolKit
    CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) pip -v install .
    

    Version of GCC should be <=13.x for my environment. The process was failed when I used GCC ver 14.

    After complition, I installed jupyter via conda.

    Now everything is finished. I worte test code for GPU/CPU comparison.

    The test code is almost same as example notebook.

    I could run the code from terminal.

    python example_conf.py
    >
    Loaded 100 molecules from ../benchmarks/data/MPCONF196.sdf
    # with GPU
    Conformer generation completed in 7.38 seconds
    Generated 471 total conformers
    Rate: 63.8 conformers/second
    # with CPU
    Conformer generation completed in 127.52 seconds
    Generated 500 total conformers
    Rate: 3.9 conformers/second
    
    

    CPU implementation use loop and it might very unefficient way (bad code). Compared to the code, GPU version worked very fast. Because nvMolkit embed molecules as a batch. If user has multiple GPUs nvMolkit can use them so it will increase the performance more.

    Current nvmolkit supports folllowing functions.

    • MorganFP calculation, Tanimoto and Cosine similarity, MMFF Minimization and conformer Generation.

    These functions are time consuming step for handing huge amount of compounds dataset. In summary nvMolKit is really cool and useful package for cheminformatics.

    I hope the package will be maintained sustainably.

    Contribute open science #memo #dirary

    I am enjoying a weekend with free time for the first time in a while :)
    And I checked my blog site and found that I stared the site from 2012/08/10. So it means that I’ve kept writing blog for almost 13 years.

    When I stareted the blog post I was a wet medicinal chemist but now I’m a cheminformatitian. When I first started working, I never imagined that I would end up pursuing such a career path.

    But I can have lots of opportunities to discuss science through the activities including writing blog post, SNS (X or Blue sky) and conference.

    And now AI technologies are moving too fast. I could not expected the movement when I started my carrier. Young researchers in my group can use such technologies and apply their issues quickly. I feel that I need catch up cutting edge of AI technology as soon as possilbe when I work with them.

    Recent AI technologies are mixed bag IMHO. So we need evaluate these technolgies based on right science.

    Some people say or think that by using AI technology chemist don’t need to design molecule just make it. Hmm.. I don’t think so. AI should be a good partner for researchers. I’m getting tired of the hype surrounding AI technology.

    Current progress of AI technology is supported by open science community. For example Boltz or Chai are one of the hot area of current science. And also in the cheminfomratics area, RDKit OpenBabel and CDK are important tools. Most of current AI/Cheminformatics tools use these packages. So I think open science is really important for progress of science. And I would like to contribute these community.

    Finally, I would like to discuss feature of drug discovery with AI. How do readers think it?

     Useful workflow of medicinal chemistry knowledge #Knime #Cheminformatics #MMP

    I spent really busy summer vacation in this year. So it was difficult to make time to post the blog ;-) Today I could have time to write short post. So I wrote new post, thank you for reading as always.

    Bioisosteric replacement is one of the important strategy for drug design. For example improve biological activity without any affect to ADME properties or improve ADME properties without reducing biological activities.

    There are lots of isosters of benzen ring. Because benze ring is common part of drug but it increase lipophilicity so medicinal chemists try to replace benzen ring to reduce compound LogP for improve ADMET properties.

    Some days ago, I read useful article of JMC. The title is ‘A Data-Driven Perspective on Bioisostere Evaluation: Mapping the Benzene Bioisostere Landscape with BioSTAR’ and link of the article is below (open access!!!)

    The arthor shows cutting edge of bioisosteric replacement of benzen ring and they disclosed analysis workflow of Knime.

    I’m interested in the WF, so I tried to use it.

    It’s really easy to use it thanks arthor for sharing such as useful workflow!

    Reader who has interest the WF, you can get it from following URL ;)
    https://hub.knime.com/polhllado/spaces/Public/BioSTAR%20workflow~5Ns3wwqRutwZl3y0/current-state

    I installed knime 5.5.1 on my PC and run the WF.

    New version of Knime provides data preview window for each node. I think it’s really useful for checking the process

    I like coding but feel knime is powerful tool for cheminformatics without coding.

    Knime is really cool tool for data analysis!

    New code for evaluating ligand strain calculation #RDKit #StrainRelief #Cheminfo

    Japanese summer is very hot and humid most days. It’s not so good for running.
    It’s important to stay hydrated. I’m keeping my lunch time 5km Running when I work from home.

    I could have time to read scientific articles in this weekend fortunately and found interesting article in Arxiv. I would like to share the nice research.

    https://arxiv.org/abs/2503.13352 And the article published from JCIM recently.

    https://pubs.acs.org/doi/10.1021/acs.jcim.5c00586

    Ligand strain is important for binding target protein. If the moleucle binds protein with high strain energy, it’s not a favable pose. So evaluting strain energy is important for filtering unwanted conformers from dataset a/o considering favolable pose for SBDD.

    In this article arthor proposed new evaluating method named StrainRelife which can aculate evaluation of ligand strain energy. It uses neural potential which is trained with QM dataset.

    Ligand strain is defined delta Local minmum energy and Global minimum energy.

    The article shows that their propsed method shows outperform compared to other methods such as MMFF94s, ANI based method.

    Fortunately the method is shared from github! Today I tried to used the code.

    Following example is almost as same as example code of original repository.

    https://github.com/prescient-design/StrainRelief

    Let’s make environment at first!

    $ gh repo clone prescient-design/StrainRelief
    $ cd StrainRelief
    $ mamba env create -f env.yml
    $ mamba activate strain
    $ pip install -e .
    
    $ pip install --force-reinstall e3nn==0.5 fairchem-core
    
    $ pre-commit install
    $ conda activate strain
    

    Then I tried to check strain energy with the code.

    from hydra import compose, initialize
    from omegaconf import OmegaConf
    from rdkit import Chem
    from rdkit.Chem import rdDetermineBonds
    from rdkit.Chem.Draw import IPythonConsole
    import pandas as pd
    from strain_relief.cmdline import strain_relief as sr
    IPythonConsole.ipython_3d = True
    
    def sdf_to_parquet(sdf_file, parquet_file):
        suppl = Chem.SDMolSupplier(sdf_file, sanitize=False, removeHs=False)
        mols = [mol for mol in suppl if mol is not None]
        print(f'{len(mols)} MOLS are LOADED!')
        df = pd.DataFrame([{"mol_bytes": mol.ToBinary(), **mol.GetPropsAsDict()} for mol in mols])
        df = df.reset_index(drop=False, names='id')
        df.to_parquet(parquet_file)
    
    sdf_path = './data/example_ligboundconf.sdf'
    parquet_path = './data/example_ligboundconf_input.parquet'
    
    sdf_to_parquet(sdf_path, parquet_path)
    
    with initialize(version_base="1.1", config_path="./src/strain_relief/hydra_config"):
        cfg = compose(
            config_name="default", 
            overrides=[
                #"experiment=mmff94s",
                "experiment=mace",
                f"io.input.parquet_path={parquet_path}"
                ]
            )
    
    results = sr(cfg)
    
    # The molecule passed strain energy check
    
    docked = Chem.Mol(results.mol_bytes[0])
    local_min = Chem.Mol(results.local_min_mol[0])
    global_min = Chem.Mol(results.global_min_mol[0])
    rdDetermineBonds.DetermineBonds(docked)
    rdDetermineBonds.DetermineBonds(local_min)
    rdDetermineBonds.DetermineBonds(global_min)
    

    The molecule passed strain energy check. So all pose seems similar.

    Next shows failed molecule.

    docked2 = Chem.Mol(results.mol_bytes[1])
    local_min2 = Chem.Mol(results.local_min_mol[1])
    global_min2 = Chem.Mol(results.global_min_mol[1])
    rdDetermineBonds.DetermineBonds(docked2)
    rdDetermineBonds.DetermineBonds(local_min2)
    rdDetermineBonds.DetermineBonds(global_min2)
    
    

    The bottom conformation, global minimam shows hydroxyl grop seems making hydrogen bond but local minimum conformation does not.

    In summary, StrainRelief is useful code for evaluating conformation of ligands.

    If readers who have interest the code, please try to use it.

    Thanks for sharing the code!

    Apply molecular filter more easily #RDKit #Datamol #Cheminformatics

    Molecular filter is important for removing unwanted molecules from huge amout of comound dataset. And there are lots of filters are available in these days. For example, PAINS, Role of 5, Role of CNS etc…

    These filters are publically available but it’s difficult to use from one platform. However datamol, medchem package can support lots of compound filters!

    You can read documentation of datamol from following URL.
    https://docs.datamol.io/stable/
    https://datamol.io/#medchem

    Today I would like to introduce breaf introducition of compound filter with datamol medchem.

    MedChem package also supports Lilly medchem rules. I introduced the function almost two years ago :) https://iwatobipen.wordpress.com/2023/12/17/useful-package-for-filtering-molecules-of-python-rdkit-python-memo/

    Today I would like to introduce basic example of medchem filters. I uploaded my example code to gist.

    Loading
    Sorry, something went wrong. Reload?
    Sorry, we cannot display this file.
    Sorry, this file is invalid so it cannot be displayed.

    The code is almost same as example of original documentation. I think the packages are really useful for cheminformatician because datamol supports lots of tasks which related cheminformatics tasks.

    Recently we can generate huge amount of molecules with AI. So importance of comounts fiters are increasing I think because it’s difficult to filter molecules by visual inspection.

    If reader who has interested in the package, I would like to recommend to use it.

    Use MCP for making query of Database #cheminformatics #AI

    Recently I have interest to MCP. Because MCP can engage task and API with natural language so we don’t need code (BTW, I love writing code!!!!).

    Today I tried to make SQL query with MCP.

    I used Claude desktop. At first I modfied config file shown below. postgres part is added.

    $ vim ~/.config/Claude/claude_desktop_config.json
    {
        "mcpServers": {
            "weather": {
                "command": "uv",
                "args": [
                    "--directory",
                    "/home/iwatobipen/dev/weather",
                    "run",
                    "weather.py"
                ]
            },
    	"chemblmcp": {
    	"command": "uv",
    	"args": [
    	"--directory",
    	"/home/iwatobipen/dev/chembl_mcp",
    	"run",
    	"chemblmcp.py"
    	]
    	},
    # this part is used for communicating postgres!!!
    	"postgres": {
    	"command":"npx",
    	"args":[
    	"-y",
    	"@modelcontextprotocol/server-postgres",
    	"postgresql://iwatobipen@localhost:5432/chembl_35"
    	]
    	}
        }
    }
    
    

    Then I launched claud desktop and asked following question.

    Then LLM make query and seach information.

    リクエスト
    
    {
      `sql`: `SELECT * FROM information_schema.tables WHERE table_name LIKE '%chembl%' OR table_name LIKE '%compound%' OR table_name LIKE '%activity%' OR table_name LIKE '%bioactivity%';`
    }
    レスポンス
    
    [
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activity_stds_lookup",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activity_properties",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activity_smid",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activity_supp_map",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activity_supp",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_properties",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_structural_alerts",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "chembl_id_lookup",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "chembl_release",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_structures",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "pg_catalog",
        "table_name": "pg_stat_activity",
        "table_type": "VIEW",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "NO",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_records",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      }
    ]
    Great! I can see this is a ChEMBL database. Let me look for the main activities table and compounds table:
    
    
    リクエスト
    
    {
      `sql`: `SELECT * FROM information_schema.tables WHERE table_name LIKE '%activities%' OR table_name LIKE '%compound%' OR table_name LIKE '%molecule%';`
    }
    レスポンス
    
    [
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "activities",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_properties",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_structural_alerts",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_structures",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_atc_classification",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_hierarchy",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_synonyms",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_dictionary",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_frac_classification",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_hrac_classification",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "compound_records",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      },
      {
        "table_catalog": "chembl_35",
        "table_schema": "public",
        "table_name": "molecule_irac_classification",
        "table_type": "BASE TABLE",
        "self_referencing_column_name": null,
        "reference_generation": null,
        "user_defined_type_catalog": null,
        "user_defined_type_schema": null,
        "user_defined_type_name": null,
        "is_insertable_into": "YES",
        "is_typed": "NO",
        "commit_action": null
      }
    ]
    ...
    
    

    Finally I could get following summary.

    Here's one bioactivity data record for CHEMBL941 (Imatinib) from the database:
    Bioactivity Record:
    
    Activity ID: 388530
    Standard Type: IC50 (half maximal inhibitory concentration)
    Standard Value: 40.0 nM
    Standard Units: nM (nanomolar)
    Standard Relation: = (equals)
    pChEMBL Value: 7.40
    Target: Tyrosine-protein kinase ABL (CHEMBL2111414)
    Assay Description: Inhibition of Abl tyrosine kinase
    Assay Type: B (Binding assay)
    
    This bioactivity data shows that Imatinib (CHEMBL941) has an IC50 of 40 nM against the ABL tyrosine kinase, indicating potent inhibitory activity. The pChEMBL value of 7.40 (which is -log10 of the molar IC50) confirms this is a highly potent interaction. This is consistent with Imatinib's mechanism of action as a tyrosine kinase inhibitor used in treating chronic myelogenous leukemia.
    

    It was really amazing for me because LLM made SQL query from my request.

    And MCP ccould communicate local postgresql chembl_35 db.

    It was failed when I used SMILES as a query. So there is still room for improvement of the technolgy or my prompt.

    But MCP is powerfull method to improve efficiency of lots of tasks ;P

    Design a site like this with WordPress.com
    Get started