Skip to content

InChi LargeMolecules Switch #974

@cubbardo

Description

@cubbardo

When trying to generate isomeric or canonical smiles from SDF with large numbers of atoms the error:

An InChI could not be generated and used to canonise SMILES: null
Could not generate InChI Numbers: Too many atoms [did you forget 'LargeMolecules' switch?]

is thrown. See Mailing List

Attaching detailed explanation from Andrew Dalke of the issue below:

CDK uses InChI to generate absolute SMILES. Here's a comment from the code:

 * Create a absolute SMILES generator. Unique SMILES uses the InChI to
 * canonise SMILES and encodes isotope or stereo-chemistry. The InChI
 * module is not a dependency of the SMILES module but should be present
 * on the classpath when generation absolute SMILES.

If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag from your output flavor then you'll get a SMILES, though it won't be an absolute SMILES.

More specifically, CDK uses InChI to generate the atom labels used during canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code path which looks like:

        // apply the canonical labelling
        if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) {

            // determine the output order
            int[] labels = labels(flavour, molecule);

where the labels() is:

private static int[] labels(int flavour, final IAtomContainer molecule) throws CDKException {
    // FIXME: use SmiOpt.InChiLabelling
    long[] labels = SmiFlavor.isSet(flavour, SmiFlavor.Isomeric) ? inchiNumbers(molecule)
            : Canon.label(molecule,
                          GraphUtil.toAdjList(molecule),
                          createComparator(molecule, flavour));

Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to do the atom order assignments, via the 'auxiliary information':

public static long[] getNumbers(IAtomContainer atomContainer) throws CDKException {
    String aux = auxInfo(atomContainer, new InchiFlag[0]);
  ...

static String auxInfo(IAtomContainer container, InchiFlag... flags) throws CDKException {
    InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
    boolean org = factory.getIgnoreAromaticBonds();
    factory.setIgnoreAromaticBonds(true);
    InChIGenerator gen = factory.getInChIGenerator(container, flags);
    factory.setIgnoreAromaticBonds(org); // an option on the singleton so we should reset for others
    if (gen.getStatus() == InchiStatus.ERROR)
        throw new CDKException("Could not generate InChI Numbers: " + gen.getMessage());
    return gen.getAuxInfo();

That calls into the InChI, which has the check (actually, it's in a few places, all with the same idea):

max_num_at = ip->bLargeMolecules ? MAX_ATOMS : NORMALLY_ALLOWED_INP_MAX_ATOMS;
if (nNumAtoms >= max_num_at)
{
    TREAT_ERR( *err, 0, "Too many atoms [did you forget 'LargeMolecules' switch?]" );
    *err = 70;
    orig_inp_data->num_inp_atoms = -1;
    goto err_exit;
}

where

#define MAX_ATOMS 32766
#define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024

The InChI flag is enabled with the flag 'LargeMolecules', https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47

/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating beta status of resulting identifiers]*/

so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49 from:

    String aux = auxInfo(atomContainer, new InchiFlag[0]);

to have LargeMolecules in that 'new InchiFlag' would make this work.

However, I'm not a Java developer and don't know how to make this change nor test it. I can say it does not seem to be user-configurable.

I am a Python developer, and I can reproduce the error using my 'chemfp translate' tool, which uses a Java/Python bridge to work with the CDK. The following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --out sdf | head -6
megatryp
RDKit

0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1079 1232 0 0 0

I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to produce the SMILES generation failure:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI could not be generated and used to canonise SMILES: null, file '', line 1, record #1: first line is '>megatryp'. Skipping.

(the --via defaults to 'sdf' so I'll omit that in the rest).

I can configure CDK SMILES writer to use the Default flavor, but without the 'Canonical' option, to show that work-around gives a (non-canonical) SMILES:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)

Here I'll disable Isomeric instead, so it should be canonical but not isomeric, which might be okay for you:

% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit --in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2
O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(

That's the flavor you pass into SmilesGenerator().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions