Inspiration

The "needle in a haystack" nature of drug discovery is the primary bottleneck in modern medicine. We were inspired by the challenge of predicting Drug-Target Interaction (DTI)—specifically, how a specific molecule will bind to a specific protein target. In a world where bringing a single drug to market can take a decade and billions of dollars, we wanted to build a computational "sieve" that could accurately prioritize the most promising candidates before a single wet-lab experiment is ever conducted.

What it does

Artificial Enzymes is a high-performance machine learning pipeline designed to predict the binding affinity (pIC50) between drugs and proteins. By processing raw SMILES strings and protein sequences, our system provides a quantitative estimate of how strongly a drug will interact with its target. Our unique approach utilizes a Hybrid Ensemble Architecture that combines the structural pattern recognition of Gradient Boosted Trees with the continuous "interaction physics" of a Deep Neural Network.

How we built it

Our technical stack was chosen for both speed and generalization: Feature Engineering: We transformed raw SMILES into 2048-bit Morgan Fingerprints and augmented them with 10 RDKit physicochemical descriptors (LogP, TPSA, Molecular Weight, etc.) to give the model "physical intuition". Protein Embeddings: We utilized pre-trained ProtT5-XL-UniRef50 embeddings to capture the complex evolutionary and structural information of protein targets in a 1024-dimensional space. The Ensemble: We built a dual-model system: XGBoost: Optimized with a Histogram-based tree method for fast structural memorization. PyTorch MLP: A deep architecture with separate drug and protein encoders linked by a non-linear interaction head. Data Transformation: We implemented a QuantileTransformer to map our skewed pIC50 distribution to a normal curve, which significantly reduced error at high-affinity extremes.

Challenges we ran into

The biggest hurdle was the "Cold Target" problem. Early in development, our models achieved high accuracy on known proteins but dropped to an $R^2$ of nearly 0.08 when facing entirely unseen targets. We also fought significant infrastructure battles: bypassing local Windows admin restrictions for Python environments and managing the 12.7 GB RAM limit in Google Colab while processing 331,000 high-dimensional rows.

Accomplishments that we're proud of

We are incredibly proud of our Ensemble's generalization jump. By moving from a standalone XGBoost model to our MLP-Hybrid, we successfully "forced" the model to learn protein-drug interactions rather than just memorizing labels. Additionally, successfully integrating RDKit descriptors into our SMILES pipeline allowed our model to maintain accuracy even when structural fingerprints were sparse.

What we learned

We learned that "Data transformation is 2000x more important than the model." No matter how deep our Neural Network was, it only began to perform once we properly scaled our protein embeddings and normalized our target values. We also gained deep experience in Group K-Fold validation, learning that a model's performance on "seen" data is often an illusion that crumbles without a proper "Cold Split" strategy.

What's next for Artificial Enzymes

The next frontier for this project is 3D Spatial Integration. While ProtT5 embeddings are powerful, they are 1D sequences; we want to incorporate 3D atomic coordinates from PDBBind to model the physical pocket fit. We also plan to expand our training data by merging the ChEMBL and BindingDB datasets to improve our "Full Cold" prediction accuracy for rare orphan receptors.

Built With

Share this project:

Updates