Abstract
The rapid expansion of ultralarge chemical libraries has revolutionized drug discovery, providing access to billions of compounds. However, this growth poses relevant challenges for traditional virtual screening (VS) methods. To address these limitations, synthon-based approaches have emerged as scalable alternatives, exploiting combinatorial chemistry principles to prioritize building blocks over enumerated molecules. In this work, we present exaScreen and exaDock, two novel synthon-based methodologies designed for ligand-based and structure-based VS, respectively. In the former case, synthon selection is guided by the 3D hydrophobic/philic distribution pattern in conjunction with a specific synthon alignment protocol based on a quadrupolar expansion over the atoms that participate in the linking bonds between fragments. On the other hand, accommodation to the binding site under a geometrically restrained docking of synthon-based hybrid compounds is used in the selection of the optimal synthon combinations. These strategies exhibit comparable performance to the search performed using fully enumerated libraries in identifying active compounds with significantly lower computational cost, offering computationally efficient strategies for VS in ultralarge chemical spaces.


Introduction
The emergence of ultralarge chemical libraries has expanded the frontiers in drug discovery, enabling the exploration of a wider spectrum of chemically diverse libraries and enriching the chances to identify new chemical intellectual property. , In the last years, the scientific community has witnessed a huge expansion of the accessible chemical space, as exemplified by libraries such as EnamineREAL, WuXi GalaXi, Otava CHEMriya, or eMolecules eXplore, , which provide access to billions of on-demand compounds. However, this unprecedented growth raises serious challenges for the efficient screening of these libraries with traditional methods, which would require a vast amount of computational effort and resources. ,
To address these challenges, active learning and synthon-based methods have emerged as promising alternatives to mine huge chemical libraries. − In the former case, the screening of the chemical space is guided by machine learning (ML) models trained with more computationally expensive methods, such as docking tools. − While this strategy is compatible with current screening tools, the scalability can become a bottleneck, as the ML models rely on fully enumerated libraries. Some innovative approaches have addressed this specific issue by exploiting the available information about the chemical reactions and reagents listed in the library instead of the complete enumerated library. However, these sampling strategies require a given percentage of the library (0.1%–2.4%) to be explored with the computationally expensive tool, which may be a limiting factor due to the sustained expansion of the accessible chemical universe. On the other hand, synthon-based methods leverage the principles of combinatorial chemistry to avoid screening fully enumerated libraries. − Synthons are virtual fragments derived from building block suited for synthetic procedures corresponding to the chemical moieties preserved in the final product of a given reaction. This strategy relies on the selection of synthons rather than the generated molecules that will be subsequently used to enumerate a focused chemical database, eliminating the need to handle the entire enumerated library. In this scenario, the computational cost is proportional to the number of synthons rather than the exhaustive list of final molecules, which ameliorates the scalability in processing ultralarge chemical spaces.
Synthon-based methods have been applied in both ligand-based (LB) and structure-based virtual screening (VS), taking advantage of the lessons learned from fragment-based screening in the last two decades through exploitation of 2D and shape similarity. Recent advances have enabled the development of 3D synthon-based tools such as Shape-Aware Synthon Search and SpaceGrow, which include molecular volume as the driving descriptor for synthon selection. The key challenge in LB methodologies lies in the alignment of synthons while considering the orientation of the linker. In current approaches, the linker is either treated as a pharmacophoric point or used to define a rotational axis. The former can lead to incorrect overlays, particularly when linkers are oriented in opposite directions. The latter addresses this challenge by employing an alignment algorithm that represents molecules as cylinders, with the linker serving as the central axis, computing molecular similarity from a binary interpretation of shape.
On the other hand, significant progress has also been made in SBVS, , as exemplified by the work of Beroza et al., who developed a successful synthon-docking campaign on the billion space over ROCK1 receptor. Their approach involved docking synthons, followed by selectively growing promising candidates to form complete molecules. Within the same workflow, V-SYNTHES was also successfully applied to explore a library of 11 billion compounds against the CB1 and CB2 cannabinoid receptors, retrieving ligands in the low micromolar activity range. In spite of these successful applications, the performance of these approaches largely depend on the accurate placement of synthons within the receptor cavity. As a result, they perform full enumeration starting from an initial pool of promising docked synthons, which limits scalability to ensure proper docking poses, a challenge that becomes relevant as library sizes continue to grow.
In this work, we exploit the fragmentation of a known bioactive molecule (i.e., the ligand bound to the protein target in the X-ray crystallographic structure), which will be denoted as reference compound hereafter, into fragment-like moieties to overcome the limitations of current state-of-the-art synthon-based methods. For SBVS, we propose using reference fragments to guide synthons placement. The linker atoms of the reference fragments are connected to the suitable linking atoms in the library’s synthons. Thus, simultaneous searches for both synthons in two-component reactions can be performed, scaling linearly with 2N, where N is the number of synthons in the library. Moreover, the linker atoms of these synthons can also be used to define a quadrupole tensor from atomic contributions. This process preserves the orientation of the growing-synthon vector while enabling the use of a discretized field for robust molecular similarity assessments in LBVS. To this end, two strategies denoted exaScreen and exaDock have been developed in the context of LBVS and SBVS, respectively.
In exaScreen, the selection of synthons is guided by the 3D hydrophobic/philic (Hyphar descriptors) distribution patterns of molecules, which are derived from the decomposition of the n-octanol/water partition coefficient (logP) of a molecule into atomic contributions. These descriptors are exploited to assess the hydrophobic/philic similarity between synthons and chemical fragments derived from the partitioning of the reference molecule. To this end, the position of the linker atom is used as the expansion center of the quadrupolar tensor of the logP to maximize the similarity between synthon and reference fragment. Although this framework can be a priori applied to any atomic property, the use of Hyphar descriptors is motivated by their successful application in predicting molecular overlays and pharmacophore-guided drug discovery.
In exaDock, the selection of synthons is performed via docking of hybrid compounds that combine synthons with the fragments originated upon partitioning of the reference compound. Accordingly, the synthon-based hybrid compounds should encode a proper signature for the proper accommodation to the distinct subpockets occupied by each fragment of the reference compound. If the linkage of the synthon to the remaining fragments of the reference compound enables the placement of the hybrid compound in the binding cavity, this strategy should disclose the optimal synthons recurrently overlaid to the reference fragments that could be reconstituted in novel compounds tailored to the distinct subpockets of the cavity.
In the following sections, a detailed description of the methodological framework of exaScreen and exaDock is provided in conjunction with a comprehensive analysis of the performance and computational cost obtained from these techniques in comparison with the application of brute force screening tools using the fully enumerated library of compounds.
Methods
Combinatorial methods applied to VS evaluate synthons rather than enumerated compounds to effectively navigate huge chemical spaces. Since each synthon contains information about the chemical reactions that may be involved with and its role as reactant, they can be easily combined to generate a focused library of purchasable enumerated compounds. Remarkably, this approach should reduce the computational cost of the VS from O( ) (i.e., the size of the fully enumerated library) to O( ), where n represents the number of synthons per reaction component, R is the number of reaction components, and k stands for the number of reactions. Building on this framework, exaScreen and exaDock have been developed for an efficient sampling of ultralarge chemical libraries for the LBVS and SBVS, respectively. In this study, we focus on two-component reactions, as they represent the predominant synthesis pathway for EnamineREAL compounds. Future work will extend to three or more component reactions, incorporating the generation of a 3D structure from multiple growth vectors positioned far apart in space.
ExaScreen: Synthon-Based LBVS
ExaScreen enumerates compounds derived from synthons that maximize the hydrophobic/philic similarity relative to the fragments generated from the reference compound. At this point, we assume that a template of the 3D bioactive conformation of the reference compound is a priori known by the user (for instance, through the analysis of the conformational minima, often coupled to the use of pharmacophore data, or even from the available structural information derived from crystallographic structures or NMR data in solution). −
Fragmentation of the reference compound is performed using retrosynthetic rules based on retrosynthetic combinatorial analysis procedure (RECAP). Atoms at the cleaved bond will be denoted hereinafter as linking atoms. Noteworthy, the fragments generated in this way retain the original 3D structural information on the reference compound in the bioactive species (Figure ).
1.

Schematic representation of the procedure followed to align synthons onto the fragments generated from the 3D bioactive conformation of the reference molecule. (A) The reference molecule is split into fragments and the synthons are overlapped to the reference fragments generated using the hydrophobic/philic quadrupole (Q) determined relative to the linking atoms as center of quadrupolar expansion. (B) Representation of the hydrophobic (white) and hydrophilic (blue) fields generated from the overlaid reference fragment and synthon (shown as sticks).
The similarity between synthons and reference fragments is estimated maximizing the overlay in their 3D hydrophobicity/philicity pattern described by the atomic contributions to the logP (Hyphar descriptors). These atomic descriptors (logPi) are obtained from quantum mechanical calculations of the solvation free energy of the solute in n-octanol (ΔG n‑oct) and water (ΔG wat) using the MST continuum solvation model (eq ), − where the solvation free energy (ΔG sol) is determined as the addition of electrostatic (ΔG x ) and nonelectrostatic (ΔG x ) contributions (eq ).
| 1 |
where N denotes the number of atoms in the solute.
| 2 |
where x stands for n-octanol or water.
Although ΔG sol, and hence logP, is a property of the whole molecule, the decomposition into atomic contributions is facilitated by the dependence of (i) ΔG x on the atomic contribution to the molecular surface, and (ii) ΔG x on the solvent response through the surface generated by the atoms in the solute. Thus, ΔG x involves the cavitation of atom i (ΔG x,i ) weighted by the ratio between the surface of such an atom (S vW,i ) and the total van der Waals surface (S vW), and the linear relationship between the van der Waals term of atom i (ΔG x,i ) and its atomic surface (eq ).
| 3 |
where ΔG P,i stands for the cavitation term determined using Pierotti’s scaled particle theory, and ξ i denotes the surface tension of atom i.
Finally, ΔG x can be determined using a simple expression corresponding to the interaction of the gas phase molecular wave function (Ψo) with the set of point charges (q k ) that reflect the solvent response over the solvent-exposed surface of the each atom (eq ; for a detailed description, the reader is addressed to previous reviews of the MST method). ,
| 4 |
where M stands for the set of point charges reflecting the solvent’s repose to the electrostatic charge of the solute.
Let us note that Hyphar descriptors for the data set of available synthons can be precalculated during the library preparation stage, leading to a significant reduction in the computational cost of the VS.
The 3D hydrophobic/philic fields generated from Hyphar descriptors are used to evaluate the degree of pairwise resemblance between synthons and fragments of the reference compound. To this end, the atomic hydrophobic/philic contributions are projected onto a grid using an exponential function (p(q); eq ), which follows the formalism implemented in COMSIA. The measurement of the hydrophobic/philic similarity is supplemented by the similarity between acceptor/donor hydrogen-bond fields generated from polar atoms in both fragment and synthon, thus mimicking the formalism adopted in PharmScreen to guide the overlay between molecules.
| 5 |
To determine the overlay between synthons and fragments, the quadrupole moment of the 3D hydrophobic/philic distribution (Q; eq ) is computed, using the linking atom’s position as the center of expansion for both the reference fragment and the synthon. In the case of multiple linkers for the same synthon (e. g., ring formation), the expansion center is defined as the midpoint between the original linking atoms. The three principal axes of Q, which are invariant to translation and rotation, represent the canonical axes that define the orientations of the synthon.
| 6 |
Once the reference fragment and synthon are overlaid, the hydrophobic/philic + HB fields similarity is used to identify the optimal synthons (Figure ). This ranking is utilized to generate compounds that combine the top-ranked synthons identified for each fragment in the reference molecule. Generation of these compounds exploits the information on the chemical reaction that would link the synthons as complementary reactants, subject to a restrained energy minimization to relax the internal geometrical parameters that define the bond between the linking atoms, also including first neighbors, in the synthon-based compound. In this process, positional restraints are introduced to maintain the synthons onto the reference fragments to permit a slight structural adjustment upon formation of the bond between synthons while preserving the 3D overlay of the synthon-based compound onto the bioactive conformation of the reference molecule (Supporting Information Figure S1). For our purposes here, the restrained minimization was performed with the MMFF94 force field, with a spatial constraint of 0.5 Å for the nonlinking atoms and their direct neighbors. Finally, the compounds are enumerated and ranked according to the arithmetic sum of the individual rank assigned to the combined synthons against the corresponding reference fragments.
2.

Diagram of exaScreen workflow. Starting from the HyPhar parametrization of the fragments generated upon partitioning of the reference compound, maximization of the hydrophobic/philic overlay guides the selection of the optimal synthons from precalculated Hyphar parameters. Synthons are then combined in synthon-based compounds.
ExaDock: Synthon-Based SBVS
The computational protocol in exaDock relies on the 3D structure of the ligand–receptor complex derived from X-ray crystallography or alternatively from docking the ligand in the target protein. This information permits to exploit the ligand pose in the bound complex to explore the proper positioning of synthons through docking in the subpockets filled by the fragments of the reference compound. For our purposes here, synthon docking was performed with Glide. − Nevertheless, other docking software, including restraints from known pharmacophore (hot spot) points, could also be utilized.
As in the exaScreen workflow, the first step is to partition the reference molecule following the RECAP rules into two or more fragments of comparable size (Figure ). The suitability of the synthons to replace a reference fragment (i.e., f i ) is examined using a restrained docking of a hybrid compound formed by the synthon that would replace fragment f i and the rest of reference fragments (i.e., f j with j ≠ i). This procedure assumes that attachment of a suitable synthon should not significantly alter the pose of the fragments (f j ) shared by the reference ligand and the hybrid compound, thus retaining the bioactive conformation required for the binding of the reference compound to the target protein. To achieve this, we implemented a simple, yet effective method that exploits a SMARTS pattern matching of the common fragment in the reference and hybrid compounds, provided that the overall structural deviation is lower than a given threshold (taken as 2 Å in this study).
3.

Diagram of exaDock workflow. Synthons are joined to the reference fragment, and docking is performed with the reference fragment restrained to its crystallographic position. Then, the highest-scoring synthons are combined to create hybrid molecules, which are finally docked without any restraints.
The recurrent application of this procedure to all the reference fragments enables ranking the available synthons according to the docking score of the hybrid compounds that fill the distinct subpockets in the binding cavity. Finally, the top-scored synthons identified for each subpocket are combined using the reactions defined in the synthon database to generate a pure synthon-based compound. The suitability of the resulting set of pure synthon-based compounds can finally be assessed through unrestrained docking calculations to confirm the proper fit in the binding pocket.
Preparation of the Synthons Library
In this study, the Enamine Real Space library has been used as a commercial source of synthons to examine the performance of exaScreen and exaDock for two-component reactions. Chemaxon’s Calculator Plugins were employed to generate the 3D structures of the compounds from the SMILES code that represent each synthon in the chemical library. The protonation state of synthons was assigned considering a physiological pH, and when applicable, up to 16 stereoisomers were defined. Conformers were generated using RDKit’s ETKDG method. Cleaved bonds were capped with a dummy atom to keep the chemical environment information.
Choice of Test Systems
To evaluate the reliability of exaScreen and exaDock, 10 systems from a diverse range of protein targets were selected from well-known LBVS and SBVS benchmarks reported in the literature (Table ). , In addition, the soluble Epoxide Hydrolase (sEH) enzyme was also included due to the successful application of Hyphar descriptors in a recent study.
1. Molecular Systems Used for Benchmarking and Validation of exaScreen and exaDock.
| system ID | system name | number of actives | PDB code |
|---|---|---|---|
| AA2AR | adenosine receptor A2 | 264 | 5K2D |
| ABL1 | tyrosine-protein kinase | 84 | 2GQG |
| FA10 | coagulation factor X | 215 | 2J2U |
| FGFR1 | fibroblast growth factor receptor 1 | 67 | 4RWJ |
| H1 | histaminic 1 receptor | 23 | 3RZE |
| LCK | tyrosine protein kinase | 83 | 3BYS |
| MAPK2 | MAP kinase 2 | 48 | 2PZY |
| MMP13 | collagenase 3 | 185 | 3ELM |
| PLK1 | serine/threonine-protein kinase | 45 | 4J52 |
| RXRA | retinoic acid receptor RXR-alpha | 46 | 4RMD |
| SEH | soluble epoxide hydrolase | 17 | 4OCZ |
Listed for each target in the original benchmarks ,, and filtered according to molecular size (see text).
To obtain the reference compound for each system, X-ray structures retrieved from the Protein DataBank were cleaned, retaining only the protein and the ligand of interest (only chain A was considered when multiple chains are found in the X-ray structure). The isolated ligand was used as the reference molecule for the validation of exaScreen, whereas the whole ligand–protein complex was used to validate the performance of exaDock. It is worth noting that deletion of water molecules is not expected to introduce a bias in the results obtained for the selected targets upon inspection of the specific pose found in the X-ray structures (Table ). This feature, which is useful for the sake of comparison with other tests, may however be reconsidered for the application to other targets, where water molecules may assist the proper binding of the ligand.
The protein targets were prepared using the Protein Preparation Wizard module in Maestro under default settings. Additionally, missing side chains were modeled using Prime and polar hydrogen orientations were optimized using PROPKA3.0 , at a physiological pH. To properly accommodate the reference compound into the binding site and eliminate potential clashes between ligand and protein, a small minimization of the ligand and the residues located at 6 Å from the ligand was done. The docking grid was defined around the ligand using the default parameters of the Glide Grid Generator.
For each target, the set of known actives was collected from the original benchmarks on which the molecular systems are based. , With regard to the sEH enzyme, actives were collected from the X-ray structures used to build the QSAR model reported in our previous study. In all cases, the original list of actives were filtered in order to select actives with a molecular size similar to the reference compound (see Performance Evaluation section for more details). The same set of actives were used to evaluate the performance of exaScreen and exaDock in LBVS and SBVS, respectively.
Performance Evaluation: Validation of the Alignment Protocol in ExaScreen
With exaScreen being a 3D screening method, it is relevant to check the performance of the computational procedure adopted in exaScreen to overlay synthons onto the reference fragments. Since the prediction of molecular overlays is influenced by the choice of the template, the template’s conformation, and the degree of similarity between compounds and templates, the performance of the overlay procedure was assessed based on the self-alignment of reference compounds (reconstituted from their respective fragments) and the overlay between pairs of active compounds.
To this end, a test set was assembled from the cocrystallized ligands reported in the well-known DUD-E+. This set was curated through the application of a series of filters, which included (i) the susceptibility of the query to be partitioned following the RECAP rules, and (ii) the detection of a significant structural overlay with other queries of the same target as measured from the overlap between their chemical skeletons, which should lead to a shape Tanimoto index >0.7. Overall, this filtering ensures that the compounds included in the test set have similar size and occupy the same cavity in the target. These criteria led to the retrieval of ligands from 87 X-ray structures (Supporting Information Table S1), enabling the evaluation of the self-alignment of 80 compounds and 112 pairwise molecular overlays.
For this database, the 3D structure of the fragments present in the compounds was built up from the SMILES code, the protonation state at physiological pH adjusted with Chemaxon’s Calculator Plugins, and conformers generated using RDKit’s ETKDG method, thus mimicking the procedure described for the synthons library (see above). This approach was adopted to simulate a scenario where the 3D conformation of the compound to be overlaid onto the template is unknown. The molecular overlays obtained from the self-alignment of ligands and the pairwise alignments with other compatible compounds were evaluated from the positional root-mean square deviation (RMSD). Alignments with RMSD <2 Å were considered successful.
Performance Evaluation: Validation of the Screening Efficiency of ExaScreen and ExaDock
The capability to discern between actives and decoys was examined through comparison of the exaScreen results and those obtained from the brute force screening performed with PharmScreen. , This comparison was conducted using a data set of 1 million enumerated compounds per system, that is, a total number of 11 million compounds for the whole set of targets.
For evaluating the SBVS, the results obtained from exaDock were validated through comparison with Glide. Docking calculations were carried out using the default procedure in SP mode (representative files for docking calculations are provided in the Supporting Information). In this case, a data set of 100,000 enumerated compounds was considered per system, leading to a whole set of 1.1 million molecules for the 11 targets included in this analysis.
The chemical spaces used to compare the proposed methods against brute force (PharmScreen and Glide) approaches were created using a random selection of synthons from the Enamine Real Space, although restrained to have a similar size to the reference fragment (i.e., the number of heavy atoms in the synthons was allowed to differ up to ±4 relative to the average number of heavy atoms in the reference fragments). The same synthons used to generate the enumerated sets were subsequently used to validate the performance of the corresponding combinatorial (exaScreen or exaDock) method. Additionally, known hits were added to each system to assess the capability to recover active molecules in the combinatorial and fully enumerated screening strategies. As in the case of decoy synthons, only actives with chemical moieties that differ at most by ± 4 heavy atoms relative to the average number of heavy atoms in the reference fragments were considered (the number of hits per target, which varies from 17 to 264, is reported in Table ). The known actives were split using the same rules adopted for reference ligands and prepared as synthons (see Preparation of the Synthons Library section).
Results and Discussion
Synthon Alignment in ExaScreen
To test the suitability of the procedure implemented for synthon alignment onto the reference fragments, 80 compounds retrieved from X-ray structures were selected as a benchmark data set to examine the accuracy of 80 self-alignments and 112 pairwise molecular overlays (see Performance Evaluation: Validation of the Alignment Protocol in ExaScreen section). The alignment quality was assessed by evaluating the enumerated compound reported by exaScreen, using the fragments generated upon partitioning of the crystallographic ligands.
Figure shows the accuracy of the molecular overlays predicted for each self- and pairwise alignment (see Supporting Information Table S1 and S2 for additional details of this analysis). The results point out that 97.5% of the self-alignments have RMSD values lower than 2.0 Å. Only in two cases, X39 (akt2, PDB 2X39) and ZSV (hivint, PDB 3ZSV; Table S1), the alignment led to overlays with RMSDs larger than the threshold value (see Supporting Information Figure S2 for visualization the self-alignment obtained for these two cases). On the other hand, 91.1% of the pairwise overlays exhibit RMSDs below 2.0 Å, and those cases above this threshold have RMSD values comprised between 2.2 (akt1, overlay between SMR and SM9 in PDBs 3QKL and 3QKM, respectively; Table S2) and 3.6 (nram, overlay between RA2 and BCZ2 in PDBs 1B9 V and 1L7F, respectively; Table S2) (a graphical display of the pairwise molecular alignment for these two cases is available in Supporting Information Figure S2). It is worth noting that when MacroModel was used to generate conformers (using the default parameters) as an alternative method to RDKit’s ETKDG, the self-alignment of 3ZSV-ZSV and 2 × 39-X39 was correctly predicted, leading to overlays with RMSD <2.0 Å (Figure ). This reflects the dependence of the 3D alignment protocol on the completeness of the conformational space. This trend was also found in three out of the 10 pairwise overlays (Figure ).
4.
RMSD (Å) distribution obtained from the overlays obtained from (left) the self-alignment of 80 compounds, and (right) the pairwise overlay of 112 alignments. Results obtained using the default protocol for conformer generation in RDKit’s ETKDG are shown as black (RMSD ≤2 Å) and pink (RMSD >2 Å) dots. For compounds with RMSD >2 Å, the results obtained with conformers generated using MacroModel are shown as blue dots.
These results give confidence to the use of the hydrophobic/philic quadrupole determined relative to the linking atom as a metric to guide the molecular alignment between synthon and fragment, supporting the application of this computational approach to enumerate synthon libraries in LBVS.
Performance of ExaScreen in LBVS
The suitability of exaScreen in recovering active compounds compared to the brute force (PharmScreen) approach was examined for a set of 11 targets. To this end, each system included a set of known actives (ranging from 17 to 264 for the set of targets; see Table ). On the other hand, for the sake of comparison, PharmScreen utilized for each target a unique library containing 1 million decoys enumerated from the set of ∼10,000 synthons used in exaScreen calculations.
The results demonstrated that exaScreen achieved comparable recovery rates to PharmScreen in 10 systems at different levels of enrichment ranging from 0.1% to 0.5%, which correspond to 1000 and 5000 compounds, respectively (Figure ). Interestingly, in FGFR1 and AA2AR, exaScreen outperformed PharmScreen. This increase in performance could be attributed to the restrained synthon search, which fixes the synthon’s linking atoms and could assist the proper alignment with the reference fragment. The only exception is ABL1, where exaScreen showed a poorer performance than PharmScreen due to a suboptimal similarity of synthons against a specific fragment of the reference molecule (see ExaScreen: Case Studies for a detailed discussion).
5.

Histogram representation of the recovery of active molecules obtained from exaScreen (pink) and Pharmscreen (orange) calculations for each system. The different shades of colors represent the values (%) at 0.1, 0.25, and 0.5% of the ranking (from darker to lighter color), which encompass the top 1000, 2500, and 5000 molecules, respectively, for each system.
In addition to the overall screening performance of both methods, the degree of identity of the chemical space recovered by the two methods was also analyzed. Figure shows the overlap between retrieved actives determined for different enrichment levels. This analysis revealed that most targets exhibit a nearly complete overlap between exaScreen and PharmScreen. However, lower levels of overlap were detected in a few instances, such as AA2AR (see below for discussion). Overall, the results support the suitability of the synthon-based search approach as a strategy to identify active compounds, delivering comparable results with a brute force strategy.
6.

Venn diagram of the actives recovered exclusively by exaScreen (magenta) or PharmScreen (orange) and simultaneously recovered by the two methods (light pink) at 0.1, 0.25, and 0.5% of the ranking (top 1000, 2500, and 5000 molecules, respectively) for each system.
ExaScreen: Case Studies
Considering the preceding discussion, it is worth examining in more detail the performance of exaScreen and PharmScreen for the tyrosine kinase ABL1 and the GPCR for adenosine AA2AR.
ABL1
The results for this enzyme show a lower performance of exaScreen compared to PharmScreen. The final ranking of the synthon-based compounds generated by exaScreen comes from a combined score that reflects the hydrophobic/philic resemblance of the synthons against the fragments present in the reference molecule. For the specific case of ABL1, inspection of the overlays between the actives reconstituted from exaScreen calculations and the reference molecule revealed an optimal correspondence between the hydrophobic/philic fields of the synthons against one reference fragment (Figure , synthon 1), but a different hydrophobic/philic profile is observed against the other fragment (Figure , synthon 2). This discrepancy has a negative impact on the ranking of the reconstituted actives, which are then deprioritized in exaScreen. In contrast, this effect is mitigated in PharmScreen calculations, since the evaluation of the hydrophobic/philic measurement is made over the entire molecule, thus making the molecular overlay less sensitive to local differences in the hydrophobic/philic 3D distribution maps (Figure , complete molecule). However, it is worth noting that the decoys prioritized above the known actives by exaScreen, which generally involves synthons more similar to the reference fragments, have not been experimentally tested and they could also represent promising candidates against this target (see Supporting Information Figure S3).
7.
Hydrophobic/philic fields of the reference compound chosen for ABL1 (PDB ID 2GQG) and three active compounds ranked within the top 5000 by PharmScreen. The hydrophobic and hydrophilic regions are shown as white and blue isocontours, respectively, for the chosen reference and active (CHEMBL 1173200, 250213, and 564289) compounds.
AA2AR
In contrast to ABL1, exaScreen retrieves more actives than PharmScreen for AA2AR (Figure ). Moreover, among all the systems studied, AA2AR exhibits the lowest overlap between both methods (Figure ). Clustering of the chemical skeleton of the actives shows that whereas clusters 2, 5, 6, and 8 are populated by both techniques, clusters 1, 3, 4, and 11 are exclusively sampled by exaScreen (Figure ). Particularly, cluster 4 contains 15 actives exclusively sampled by exaScreen, whereas only cluster 8 is predominantly populated by molecules sampled by PharmScreen. This distinction explains the low overlap observed for AA2AR compared to other systems (See Supporting Information Figure S4 for the cluster analyses of the rest of the systems).
8.
Hierarchical clustering of active compounds ranked within the top 5000 for AA2AR (PDB ID 5K2D). The bar plot represents the total number of actives in each cluster, based on circular fingerprints (radius: 6, threshold: 0.85). The main scaffolds for clusters 4 and 8 are highlighted.
The larger number of clusters found by exaScreen can be attributed to a better alignment of the synthons. The reduction in the conformational space using less flexible and smaller molecules (synthons) compared to the reconstituted full compound improves the accuracy of the similarity measurement and facilitates the adoption of the proper overlay with the template, as exemplified in Figure for CHEMBL372475 (cluster 4). The same situation can be extrapolated to FGFR1, where a reduction in conformational space also improves the synthon-based overlay (Figure ).
9.

Overlay of the reference structure (gray) for (left) AA2AR (PDB ID 5K2D) and (right) FGFR1 (PDB ID 4RWJ) with the active compounds CHEMBL372475 and CHEMBL 253292, respectively, obtained using exaScreen (pink) and PharmScreen (orange).
Performance of ExaDock in SBVS
This analysis examined the ability of exaDock to recover actives from synthons against a brute force VS carried out for the enumerated compounds with Glide. To this end, the number of active compounds retrieved at 1.0, 2.5, and 5.0%, which correspond to 1000, 2000, and 5000 compounds, was assessed for the two methods.
The results reveal a comparable performance between exaDock and Glide (Figure ), although the performance of exaDock is notably lower for the mitogen-activated protein kinase 2 and at less extent for the histamine receptor 1 (H1). The behavior for these two targets can be attributed to the physicochemical features of the binding cavity (see below for a detailed discussion). Moreover, one can also notice that the overlap between the actives recovered by exaDock, and Glide is generally low (Figure ), which highlights the influence of the structural features of the pocket on the selection of the synthon-based compounds compared to the results obtained within the LBVS framework, which is merely guided by the similarity alignment. For instance, whereas an almost complete overlap of recovered actives is found for H1 (11 out of 12 actives), the overlap is minimum for ABL1 (3 out of 21 actives) for the top 5% (see below for discussion of these two cases).
10.

Histogram representation of the recovery of active molecules obtained for exaDock (green) and Glide (gray) calculations for each system. The different shades of colors represent the values (%) at 1.0, 2.5, and 5.0% (from darker to lighter color), which encompass the top 1000, 2500, and 5000 molecules, respectively, for each system.
11.

Venn diagram of the actives recovered exclusively by exaDock (green) or Glide (gray) and simultaneously recovered by the two methods (light green) at 1%, 2.5%, and 5% of the ranking (top 1000, 2500, and 5000 molecules, respectively) for each system.
ExaDock: Case Studies
The inclusion of the geometrical and chemical features of the binding site in docking calculations has a notable effect on the ranking of the synthon-based compounds obtained by exaDock and Glide, leading to a larger sensitivity of the overlap between retrieved actives, as discussed for several cases.
LCK
For this tyrosine-protein kinase, 9 and 8 actives were uniquely recovered using exaDock and Glide, respectively, and only 6 actives were simultaneously detected by both methods at the top 5.0% (Figure ). As noted in Figure A,B, the actives retrieved by exaDock reflect the selection of synthons ranked according to the restrained docking of the hybrid (synthon + fragment) compound. This procedure was conceived to retain the proper arrangement of the fragment (f j ) shared between the hybrid compound and the reference molecule, thus limiting the sampling of the synthon space to the subpocket filled by the other fragment (f j ); see ExaDock: Synthon-Based SBVS. Interestingly, when the 9 active (synthon-based) compounds are docked without imposing positional restraints, only a single compound adopted a pose that mimicked the arrangement of the reference compound. In contrast, the rest of the actives adopted poses markedly different, where part of the ligand was oriented protruding from the binding pocket (i.e., pointing toward the solvent) (Figure C). This example provides a rationale to justify the relatively low overlap in the actives recovered by exaDock and Glide (Figure ). In turn, this suggests that the selection of synthons made by exaDock enables the prioritization of molecules that would be ranked in a lower position if they were docked directly in the binding pocket (Figure ). The lower performance obtained with Glide without restraints could stem from using a single protein structure. When performing rigid docking, accommodating actives that differ from the reference could be more complex. Therefore, the use of restraints or the hybrid molecule (which includes part of the reference) facilitates the ligand accommodation.
12.
Representation of the 9 active compounds recovered (top 5.0%) using exaDock for the LCK receptor. The surface of the protein is shown as pale yellow, and the reference molecule is shown as pink sticks. (A) Poses of the hybrid (synthon + fragment) compounds obtained from the restrained (fragment 1) docking with Glide (green sticks). (B) Poses of the hybrid (synthon + fragment) compounds obtained from the restrained (fragment 2) docking with Glide (green sticks). (C) Poses of the synthon-based compounds recovered by exaDock and subsequently docked with Glide (gray sticks).
H1 and ABL1
These two systems exemplify cases of high (H1) and low (ABL1) overlap between exaDock and Glide results. In the former case, 11 actives were simultaneously recovered by exaDock and Glide, while only 1 and 7 hits were exclusively detected by these methods (Figure ). In contrast, 18 and 19 actives were uniquely recovered by exaDock and Glide, and only 3 actives were found simultaneously by the two methods (Figure ). The distinct outcome obtained for H1 and ABL1 may reflect the differences in the binding pocket of these targets. Thus, H1 presents a narrow and deep pocket, which limits the conformational flexibility of putative ligands, whereas ABL1 exhibits a highly solvent-exposed pocket (Figure ).
13.
Representation of the poses obtained for actives simultaneously selected by exaDock and Glide for targets (top) ABL1 and (bottom) H1. The left panels depict the pose of the 3 (ABL1) and 11 (H1) hybrid (restrained fragment + synthon) compounds (shown as green-colored sticks) in exaDock calculations, and the right panel illustrates the pose of the same compounds (shown as gray-colored sticks) evaluated from Glide computations. The reference compound is shown as pink-colored sticks, and the surface denotes the shape of the solvent-exposed (ABL1) and buried (H1) binding pocket.
To explore the influence of the pocket on the performance of exaDock (and Glide) in selecting the best ligands, the topological and physicochemical features of the binding cavity characterized using Fpocket , was determined for the whole set of targets and compared with the actives recovered by both exaDock and Glide (overlap in Supporting Information Tables S3 and S4). Although caution is necessary due to the limited number of targets included here, the statistical analysis highlights the relevance of the “alpha sphere density” parameter, which reflects the compactness of the pocket (i.e., a small/large value indicates a rather buried/exposed cavity), and at less extent the “flexibility”, which depends on the normalized B-factor of atoms in the binding pocket (Supporting Information Table S5). In this regard, the simultaneous recovery of actives is favored in compact and buried cavities (as noted in the normalized values of −1.04 and −0.44 for H1 and ABL1; Supporting Information Table S4) and by the local flexibility of the pocket (as noted in the normalized values of 1.82 and −0.77 for H1 and ABL1; Supporting Information Table S4). In fact, the binding modes between the molecules obtained with exaDock and Glide are similar for the H1 receptor (Figure ), whereas a larger discrepancy can be observed between the two methods for ABL1.
MAPK2
As noted above, exaDock identifies around half of the actives compared to Glide for this kinase (Figures and ). The binding site contains residues D207, K93, and L141, which form interactions with the reference compound (and most actives in the poses obtained with Glide), but a significant fraction of the ligand is located around the edges of the pocket, protruding to the solvent.
Inspection of the hybrid compounds built from exaDock reveal that synthons are properly positioned forming hydrogen bonds with D207 and K93, and their docking scores are more favorable relative to those obtained from decoys (Figure ). However, when synthons target the solvent-exposed fragment of the reference ligand, docking scores are indistinguishable from those of the decoys (Figure ), reflecting the detrimental effect on the score due to the lack of specific recognition motifs in this latter area. This limitation does not occur in the brute-force docking, because the primary driving force in defining the pose comes from the hydrogen bond network formed inside of the pocket with residues L141, D207, and K93.
14.
Histogram distribution of the docking scores for the hybrid compounds derived from exaDock calculations upon restraining (left) the innermost and (right) outermost fragment of the reference ligand (shown as pink-colored sticks). The black dashed line in the histograms denotes the lowest docking score for the decoys. (Right) When synthons sample the outermost fragment, the scores are worse than the values obtained for decoys. (Right) In contrast, synthons that sample the innermost subpocket that contain D207 and K93 are better scored.
Computational Efficiency of ExaScreen and ExaDock
From a computational point of view, combinatorial approaches are expected to perform better compared to brute force methods (Table ), enabling the exploration of ultralarge chemical spaces. This obeys the distinct cost of the synthon-based and brute force methods, which can be estimated to have a linear and exponential dependence, respectively, with the number of synthons. Indeed, this is reflected in the cost of synthon-based and brute-force strategies for the set of targets considered for LBVS and SBVS in this study. The ratio of simulation time required for PharmScreen and Glide relative to exaScreen and exaDock is 119 and 21, respectively, for the data set used in the experiments (1 M and 100 K). For Enamine REAL 2024.9 (70B), these values increase to 112,162, and 52,564, respectively (Table ). Previous studies indicate savings factors of 800 (including conformers generation time, 229 M) for LBVS and 5000 (11B) for SBVS. Assuming speedups depend on the fully enumerated library size, the resource savings reported in this work can be considered competitive.
2. Computational Resources Required for Benchmarking and Validation of exaScreen and exaDock for the Molecular Systems Used in This Study .
| this
study (1 M compounds) |
Enamine
REAL 2024.9 (70B compounds)
|
|||
|---|---|---|---|---|
| LBVS | library size | simulation time (h) | library size | simulation time (h) |
| pharmScreen | 0.19 GB | 11.9 | 1.5 PB | 8.3× 105 |
| exaScreen | 0.02 GB | 0.1 | 6.5 GB | 7.4 |
| this
study (100 K compounds) |
Enamine
REAL 2024.9 (70B compounds) |
|||
|---|---|---|---|---|
| SBVS | library size | simulation time (h) | library size | simulation time (h) |
| glide | 0.6 GB | 5.79 | 39.6 TB | 4.1× 106 |
| exaDock | 3 MB | 0.28 | 369 MB | 78 |
Values reported considering the usage of an AWS c5.12xlarge instance 1st generation Intel Xeon Platinum 8000 series (Skylake-SP) 3.4 GHz, 48 threads, and 96 GB of memory.
Simulation time for PharmScreen, Glide, and exaDock estimated from the numbers obtained in this study.
Average of the simulation time required to run Enamine REAL 2024.9 with exaScreen for all the molecular systems used in this study (values per system are reported in Figure S6).
The speedup of exaScreen is greater than that of exaDock because, in our SBVS approach, a final step is required to dock the top enumerated molecules, which limits the method’s scalability. The time required by exaDock to run EnamineREAL (2024.9) has been extrapolated from the numbers obtained in this study, whereas for exaScreen, the value corresponds to the average experimental time taken for each reference used in this study to process the complete Enamine library (Supporting Information Table S6). As the size of the chemical library continues to grow, combinatorial methods will exhibit better computational efficiency due to their scalability.
Considering the size of the EnamineREAL (2024.9) library, which would include 70 billion compounds, such a ratio would be extrapolated to be ∼105 (Table ). Notably, the size of exaDock library is lower because conformers are not included. It becomes clear that, as the size of the chemical library continues to grow, combinatorial methods will exhibit better computational efficiency due to their scalability.
Final Remarks
This work examines the suitability of two computational strategies to exploit synthons for exploring ultralarge chemical spaces in the framework of LBVS and SBVS at a reduced computational cost due to the linear scaling with the number of building blocks in the synthon library.
The two strategies rely on the 3D structure of the bioactive compound selected by the user as reference template, which is subsequently partitioned into fragments at bonds amenable to combinatorial chemistry. In the context of LBVS, exaScreen exploits atomic descriptors to select the optimal synthons through similarity measurements of the molecular fields generated by the synthons and the reference fragment. In the current implementation described here, the Hyphae descriptors have been used to guide the overlay according to the 3D hydrophobic/philic distribution of both synthon and fragment, taking advantage of the successful application to VS based on lipophilic similarity. , For the set of reference compounds considered here, the overall performance of exaScreen compares with the results obtained with the brute force (PharmScreen) approach, as noted in the similar recovery of actives and the significant degree of identity between the actives selected by the two methods. Nevertheless, this approach can be easily adapted to the implementation of other atomic descriptors, thus enabling to assess how the performance could be affected by the nature of the chemical descriptors, as long as their suitability may depend on the specific physicochemical features of the reference compound.
In the case of SBVS methods, exaDock exploits a restrained docking to explore hybrid compounds (fragment reference + synthon) recurrently for the distinct reference fragments, which ultimately leads to the selection of the optimal synthon-based compounds built from the best ranked synthons identified for the reference fragments. This strategy aims to guide the synthon selection from the 3D pose of the reference compound, minimizing the occurrence of biases due to the known influence exerted by the structural features of the binding pocket on the outcome of docking calculations. In this regard, it is not surprising that the results highlight a larger difference in the actives recovered by exaDock and Glide, which is in contrast with the similar performance obtained in the LBVS analysis. This reflects the influence due to the geometrical and physicochemical properties of the binding site, particularly to the compactness of the pocket and to the solvent exposure of the reference fragments. On the other hand, it is worth noting that the SBVS strategy outlined here, which uses Glide SP to rank the synthon-based compounds, can be coupled to other scoring functions, such as the HINT hydropathic interaction, which may be better suited for a specific class of protein targets.
Overall, the LBVS and SBVS strategies obtain competitive enrichment compared to brute force methods requiring orders of magnitude lower computational resources, paving the way for an efficient and accurate approach to explore upcoming trillion-sized chemical spaces in 3D. Certainly, this exciting scenario should promote the definition of standardized blind data sets and normalized metrics to enable the comparison of the performance between distinct synthon-based methodologies for the screening of ultralarge chemical libraries, considering both the computational efficiency and the enrichment afforded by these techniques.
Supplementary Material
Acknowledgments
The authors thank the Spanish Ministerio de Ciencia e Innovación (PID2020-117646RB-I00 MCIN/AEI/10.13039/501100011033; PID2023-147942OB-I00; Maria de Maetzu CEX2021-001202 M) and Generalitat de Catalunya (2021SGR00671) for financial support. B. M. is a recipient of an Industrial PhD fellowship from the Generalitat de Catalunya (2022 DI 112) and the Spanish Ministerio de Ciencia e Innovación (DIN2021-011749).
ExaScreen and exaDock are proprietary licensed software released by Pharmacelera. The crystallographic data used in the alignment validation, the 3D structure of the actives molecules and the protein structures used in the validation of the VS, and the KNIME workflow to perform the cluster analysis are available in our public GitHub repository at https://github.com/Pharmacelera/ Virtual-Screening-on-Ultra-Large-Chemical-Spaces. The synthons library (EnamineREAL, 2024.9) can be obtained from Enamine Ltd. (https://enamine.net/)
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c00222.
Example of the restrained energy minimization performed for the synthon-based compounds; RMSD values obtained for the self-alignment and pairwise alignment; examples of high RMSD values reported by exaScreen for self-alignment and pairwise molecular alignments in the test set; selected decoys by exaScreen with good hydrophobic complementarity regarding reference structure in ABL1; hierarchical clustering of actives retrieved at 0.5% obtained with exaScreen and PharmScreen; hierarchical clustering of the actives retrieved at 5.0% (top 5000) for the distinct targets obtained with exaDock and Glide; geometrical and physicochemical descriptors of the binding pockets; statistical analysis between the number of actives simultaneously detected by exaDock and Glide; simulation time required to run Enamine REAL 2024.9 with exaScreen for the molecular systems used in this study; and finally, configuration file for docking simulation using Glide (PDF)
The authors declare the following competing financial interest(s): B.M.-L., A.H., E.H., and J.V. are members of Pharmacelera.
References
- Neumann A., Marrison L., Klein R.. Relevance of the Trillion-Sized Chemical Space “EXplore” as a Source for Drug Discovery. ACS Med. Chem. Lett. 2023;14(4):466–472. doi: 10.1021/acsmedchemlett.3c00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyu J., Wang S., Balius T. E., Singh I., Levit A., Moroz Y. S., O’Meara M. J., Che T., Algaa E., Tolmachova K., Tolmachev A. A., Shoichet B. K., Roth B. L., Irwin J. J.. Ultra-Large Library Docking for Discovering New Chemotypes. Nature. 2019;566(7743):224–229. doi: 10.1038/s41586-019-0917-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grygorenko O. O., Radchenko D. S., Dziuba I., Chuprina A., Gubina K. E., Moroz Y. S.. Generating Multibillion Chemical Space of Readily Accessible Screening Compounds. iScience. 2020;23(11):101681. doi: 10.1016/j.isci.2020.101681. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tingle B. I., Irwin J. J.. Large-Scale Docking in the Cloud. J. Chem. Inf. Model. 2023;63(9):2735–2741. doi: 10.1021/acs.jcim.3c00031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gorgulla C., Boeszoermenyi A., Wang Z. F., Fischer P. D., Coote P. W., Padmanabha Das K. M., Malets Y. S., Radchenko D. S., Moroz Y. S., Scott D. A., Fackeldey K., Hoffmann M., Iavniuk I., Wagner G., Arthanari H.. An Open-Source Drug Discovery Platform Enables Ultra-Large Virtual Screens. Nature. 2020;580(7805):663–668. doi: 10.1038/s41586-020-2117-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warr W. A., Nicklaus M. C., Nicolaou C. A., Rarey M.. Exploration of Ultralarge Compound Collections for Drug Discovery. J. Chem. Inf. Model. 2022;62(9):2021–2034. doi: 10.1021/acs.jcim.2c00224. [DOI] [PubMed] [Google Scholar]
- Bedart C., Simoben C. V., Schapira M.. Emerging Structure-Based Computational Methods to Screen the Exploding Accessible Chemical Space. Curr. Opin. Struct. Biol. 2024;86:102812. doi: 10.1016/j.sbi.2024.102812. [DOI] [PubMed] [Google Scholar]
- Corrêa Veríssimo G., Salgado Ferreira R., Gonçalves Maltarollo V.. Ultra-Large Virtual Screening: Definition, Recent Advances, and Challenges in Drug Design. Mol. Inf. 2025;44(1):e202400305. doi: 10.1002/minf.202400305. [DOI] [PubMed] [Google Scholar]
- Meyenburg C., Dolfus U., Briem H., Rarey M.. Galileo: Three-Dimensional Searching in Large Combinatorial Fragment Spaces on the Example of Pharmacophores. J. Comput.-Aided Mol. Des. 2023;37(1):1–16. doi: 10.1007/s10822-022-00485-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gentile F., Agrawal V., Hsing M., Ton A. T., Ban F., Norinder U., Gleave M. E., Cherkasov A.. Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery. ACS Cent. Sci. 2020;6(6):939–949. doi: 10.1021/acscentsci.0c00229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graff D. E., Shakhnovich E. I., Coley C. W.. Accelerating High-Throughput Virtual Screening through Molecular Pool-Based Active Learning. Chem. Sci. 2021;12(22):7866–7881. doi: 10.1039/D0SC06805E. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cavasotto C. N., Di Filippo J. I.. The Impact of Supervised Learning Methods in Ultralarge High-Throughput Docking. J. Chem. Inf. Model. 2023;63(8):2267–2280. doi: 10.1021/acs.jcim.2c01471. [DOI] [PubMed] [Google Scholar]
- Bande A. Y., Baday S.. Accelerating Molecular Docking Using Machine Learning Methods. Mol. Inf. 2024;43(6):e202300167. doi: 10.1002/minf.202300167. [DOI] [PubMed] [Google Scholar]
- Marin E., Kovaleva M., Kadukova M., Mustafin K., Khorn P., Rogachev A., Mishin A., Guskov A., Borshchevskiy V.. Regression-Based Active Learning for Accessible Acceleration of Ultra-Large Library Docking. J. Chem. Inf. Model. 2024;64(7):2612–2623. doi: 10.1021/acs.jcim.3c01661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klarich K., Goldman B., Kramer T., Riley P., Walters W. P.. Thompson Sampling–An Efficient Method for Searching Ultralarge Synthesis on Demand Databases. J. Chem. Inf. Model. 2024;64(4):1158–1171. doi: 10.1021/acs.jcim.3c01790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng C., Beroza P.. Shape-Aware Synthon Search (SASS) for Virtual Screening of Synthon-Based Chemical Spaces. J. Chem. Inf. Model. 2024;64(4):1251–1260. doi: 10.1021/acs.jcim.3c01865. [DOI] [PubMed] [Google Scholar]
- Müller J., Klein R., Tarkhanova O., Gryniukova A., Borysko P., Merkl S., Ruf M., Neumann A., Gastreich M., Moroz Y. S., Klebe G., Glinca S.. Magnet for the Needle in Haystack: “Crystal Structure First” Fragment Hits Unlock Active Chemical Matter Using Targeted Exploration of Vast Chemical Spaces. J. Med. Chem. 2022;65(23):15663–15678. doi: 10.1021/acs.jmedchem.2c00813. [DOI] [PubMed] [Google Scholar]
- Sadybekov A. A., Sadybekov A. V., Liu Y., Iliopoulos-Tsoutsouvas C., Huang X. P., Pickett J., Houser B., Patel N., Tran N. K., Tong F., Zvonok N., Jain M. K., Savych O., Radchenko D. S., Nikas S. P., Petasis N. A., Moroz Y. S., Roth B. L., Makriyannis A., Katritch V.. Synthon-Based Ligand Discovery in Virtual Libraries of over 11 Billion Compounds. Nature. 2022;601(7893):452–459. doi: 10.1038/s41586-021-04220-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rarey M., Stahl M.. Similarity Searching in Large Combinatorial Chemistry Spaces. J. Comput.-Aided Mol. Des. 2001;15(6):497–520. doi: 10.1023/A:1011144622059. [DOI] [PubMed] [Google Scholar]
- Liphardt T., Sander T.. Fast Substructure Search in Combinatorial Library Spaces. J. Chem. Inf. Model. 2023;63(16):5133–5141. doi: 10.1021/acs.jcim.3c00290. [DOI] [PubMed] [Google Scholar]
- Schmidt R., Klein R., Rarey M.. Maximum Common Substructure Searching in Combinatorial Make-on-Demand Compound Spaces. J. Chem. Inf. Model. 2022;62(9):2133–2150. doi: 10.1021/acs.jcim.1c00640. [DOI] [PubMed] [Google Scholar]
- Andrews K. M., Cramer R. D.. Toward General Methods of Targeted Library Design: Topomer Shape Similarity Searching with Diverse Structures as Queries. J. Med. Chem. 2000;43(9):1723–1740. doi: 10.1021/jm000003m. [DOI] [PubMed] [Google Scholar]
- Beroza P., Crawford J. J., Ganichkin O., Gendelev L., Harris S. F., Klein R., Miu A., Steinbacher S., Klingler F. M., Lemmen C.. Chemical Space Docking Enables Large-Scale Structure-Based Virtual Screening to Discover ROCK1 Kinase Inhibitors. Nat. Commun. 2022;13(1):1–10. doi: 10.1038/s41467-022-33981-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hönig S. M. N., Flachsenberg F., Ehrt C., Neumann A., Schmidt R., Lemmen C., Rarey M.. SpaceGrow: Efficient Shape-Based Virtual Screening of Billion-Sized Combinatorial Fragment Spaces. J. Comput.-Aided Mol. Des. 2024;38(1):13. doi: 10.1007/s10822-024-00551-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carlsson J., Luttens A.. Structure-Based Virtual Screening of Vast Chemical Space as a Starting Point for Drug Discovery. Curr. Opin. Struct. Biol. 2024;87:102829. doi: 10.1016/j.sbi.2024.102829. [DOI] [PubMed] [Google Scholar]
- Luque F. J., Barril X., Orozco M.. Fractional Description of Free Energies of Solvation. J. Comput.-Aided Mol. Des. 1999;13(2):139–152. doi: 10.1023/A:1008036526741. [DOI] [PubMed] [Google Scholar]
- Vázquez J., Deplano A., Herrero A., Ginex T., Gibert E., Rabal O., Oyarzabal J., Herrero E., Luque F. J.. Development and Validation of Molecular Overlays Derived from Three-Dimensional Hydrophobic Similarity with PharmScreen. J. Chem. Inf. Model. 2018;58(8):1596–1609. doi: 10.1021/acs.jcim.8b00216. [DOI] [PubMed] [Google Scholar]
- Vázquez J., Ginex T., Herrero A., Morisseau C., Hammock B. D., Luque F. J.. Screening and Biological Evaluation of Soluble Epoxide Hydrolase Inhibitors: Assessing the Role of Hydrophobicity in the Pharmacophore-Guided Search of Novel Hits. J. Chem. Inf. Model. 2023;63(10):3209–3225. doi: 10.1021/acs.jcim.3c00301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cappel D., Dixon S. L., Sherman W., Duan J.. Exploring Conformational Search Protocols for Ligand-Based Virtual Screening and 3-D QSAR Modeling. J. Comput.-Aided Mol. Des. 2015;29(2):165–182. doi: 10.1007/s10822-014-9813-4. [DOI] [PubMed] [Google Scholar]
- Friedrich N. O., Flachsenberg F., Meyder A., Sommer K., Kirchmair J., Rarey M.. Conformator: A Novel Method for the Generation of Conformer Ensembles. J. Chem. Inf. Model. 2019;59(2):731–742. doi: 10.1021/acs.jcim.8b00704. [DOI] [PubMed] [Google Scholar]
- McNutt A. T., Bisiriyu F., Song S., Vyas A., Hutchison G. R., Koes D. R.. Conformer Generation for Structure-Based Drug Design: How Many and How Good? J. Chem. Inf. Model. 2023;63(21):6598–6607. doi: 10.1021/acs.jcim.3c01245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vázquez J., García R., Llinares P., Luque F. J., Herrero E.. On the Relevance of Query Definition in the Performance of 3D Ligand-Based Virtual Screening. J. Comput.-Aided Mol. Des. 2024;38(1):18. doi: 10.1007/s10822-024-00561-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Das S., Merz K. M.. Molecular Gas-Phase Conformational Ensembles. J. Chem. Inf. Model. 2024;64(3):749–760. doi: 10.1021/acs.jcim.3c01309. [DOI] [PubMed] [Google Scholar]
- Lewell X. Q., Judd D. B., Watson S. P., Hann M. M.. RECAP - Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Comput. Sci. 1998;38(3):511–522. doi: 10.1021/ci970429i. [DOI] [PubMed] [Google Scholar]
- Ginex T., Muñoz-Muriedas J., Herrero E., Gibert E., Cozzini P., Luque F. J.. Development and Validation of Hydrophobic Molecular Fields Derived from the Quantum Mechanical IEF/PCM-MST Solvation Models in 3D-QSAR. J. Comput. Chem. 2016;37(13):1147–1162. doi: 10.1002/jcc.24305. [DOI] [PubMed] [Google Scholar]
- Curutchet C., Orozco M., Luque F. J.. Solvation in Octanol: Parametrization of the Continuum MST Model. J. Comput. Chem. 2001;22(11):1180–1193. doi: 10.1002/jcc.1076. [DOI] [Google Scholar]
- Curutchet C., Bidon-Chanal A., Soteras I., Orozco M., Luque F. J.. MST Continuum Study of the Hydration Free Energies of Monovalent Ionic Species. J. Phys. Chem. B. 2005;109(8):3565–3574. doi: 10.1021/jp047197s. [DOI] [PubMed] [Google Scholar]
- Soteras I., Curutchet C., Bidon-Chanal A., Orozco M., Luque F. J.. Extension of the MST Model to the IEF Formalism: HF and B3LYP Parametrizations. J. Mol. Struct.:THEOCHEM. 2005;727(1–3):29–40. doi: 10.1016/j.theochem.2005.02.029. [DOI] [Google Scholar]
- Luque F. J., Bofill J. M., Orozco M.. New Strategies to Incorporate the Solvent Polarization in Self-Consistent Reaction Field and Free-Energy Perturbation Simulations. J. Chem. Phys. 1995;103(23):10183–10191. doi: 10.1063/1.469921. [DOI] [Google Scholar]
- Javier Luque F., Curutchet C., Muñoz-Muriedas J., Bidon-Chanal A., Soteras I., Morreale A., Gelpí J. L., Orozco M.. Continuum Solvation Models: Dissecting the Free Energy of Solvation. Phys. Chem. Chem. Phys. 2003;5:3827–3836. doi: 10.1039/b306954k. [DOI] [Google Scholar]
- Muñoz-Muriedas J., Perspicace S., Bech N., Guccione S., Orozco M., Luque F. J.. Hydrophobic Molecular Similarity from MST Fractional Contributions to the Octanol/Water Partition Coefficient. J. Comput.-Aided Mol. Des. 2005;19(6):401–419. doi: 10.1007/s10822-005-7928-3. [DOI] [PubMed] [Google Scholar]
- Klebe G., Abraham U., Mietzner T.. Molecular Similarity Indices in a Comparative Analysis (CoMSIA) of Drug Molecules to Correlate and Predict Their Biological Activity. J. Med. Chem. 1993;37(24):4130–4146. doi: 10.1021/jm00050a010. [DOI] [PubMed] [Google Scholar]
- Glide Schrödinger Release 2024-3; Schrödinger, LLC: New York, 2024. [Google Scholar]
- Halgren T. A., Murphy R. B., Friesner R. A., Beard H. S., Frye L. L., Pollard W. T., Banks J. L. G.. Glide: New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening. J. Med. Chem. 2004;47(7):1750–1759. doi: 10.1021/jm030644s. [DOI] [PubMed] [Google Scholar]
- Friesner R. A., Banks J. L., Murphy R. B., Halgren T. A., Klicic J. J., Mainz D. T., Repasky M. P., Knoll E. H., Shelley M., Perry J. K., Shaw D. E., Francis P., Shenkin P. S.. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004;47(7):1739–1749. doi: 10.1021/jm0306430. [DOI] [PubMed] [Google Scholar]
- ChemAxon. “Calculator Plugins Were Used for Structure Property Prediction and Calculation”, Marvin 23.16. 2024.
- RDKit Open-Source Cheminformatics Software, 2025.
- Cleves A. E., Jain A. N.. Structure- And Ligand-Based Virtual Screening on DUD-E+: Performance Dependence on Approximations to the Binding Pocket. J. Chem. Inf. Model. 2020;60(9):4296–4310. doi: 10.1021/acs.jcim.0c00115. [DOI] [PubMed] [Google Scholar]
- Weiss D. R., Bortolato A., Tehan B., Mason J. S.. GPCR-Bench: A Benchmarking Set and Practitioners’ Guide for G Protein-Coupled Receptor Docking. J. Chem. Inf. Model. 2016;56(4):642–651. doi: 10.1021/acs.jcim.5b00660. [DOI] [PubMed] [Google Scholar]
- Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., Shindyalov I. N., Bourne P. E.. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maestro Schrödinger Release 2024-3; Schrödinger, LLC: New York, 2024. [Google Scholar]
- Søndergaard C. R., Olsson M. H. M., Rostkowski M., Jensen J. H.. Improved Treatment of Ligands and Coupling Effects in Empirical Calculation and Rationalization of p K a Values. J. Chem. Theory Comput. 2011;7(7):2284–2295. doi: 10.1021/ct200133y. [DOI] [PubMed] [Google Scholar]
- Olsson M. H. M., Søndergaard C. R., Rostkowski M., Jensen J. H.. PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical p K a Predictions. J. Chem. Theory Comput. 2011;7(2):525–537. doi: 10.1021/ct100578z. [DOI] [PubMed] [Google Scholar]
- Vazquez J., Deplano A., Herrero A., Gibert E., Herrero E., Luque F. J.. Assessing the Performance of Mixed Strategies To Combine Lipophilic Molecular Similarity and Docking in Virtual Screening. J. Chem. Inf. Model. 2020;60(1):4231–4245. doi: 10.1021/acs.jcim.9b01191. [DOI] [PubMed] [Google Scholar]
- Le Guilloux V., Schmidtke P., Tuffery P.. Fpocket: An open source platform for ligand pocket detection. BMC Bioinf. 2009;10:168. doi: 10.1186/1471-2105-10-168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidtke P., Barril X.. Understanding and Predicting Druggability. A High-Throughput Method for Detection of Drug Binding Sites. J. Med. Chem. 2010;53(15):5858–5867. doi: 10.1021/jm100574m. [DOI] [PubMed] [Google Scholar]
- Kellogg G. E., Abraham D. J.. Hydrophobicity Is LogP o/w More than the Sum of Its Parts? Eur. J. Med. Chem. 2000;35(7–8):651–661. doi: 10.1016/S0223-5234(00)00167-7. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
ExaScreen and exaDock are proprietary licensed software released by Pharmacelera. The crystallographic data used in the alignment validation, the 3D structure of the actives molecules and the protein structures used in the validation of the VS, and the KNIME workflow to perform the cluster analysis are available in our public GitHub repository at https://github.com/Pharmacelera/ Virtual-Screening-on-Ultra-Large-Chemical-Spaces. The synthons library (EnamineREAL, 2024.9) can be obtained from Enamine Ltd. (https://enamine.net/)






