Inspiration

Drug discovery is an incredibly costly and time-consuming process, and early-stage toxicity testing often relies on animal experiments. We were inspired by the idea of reducing this dependency by predicting toxicity directly from molecular structure using machine learning.

If we can accurately identify harmful compounds before lab testing, we can help accelerate drug development while making the process more ethical and efficient.

What it does

We built a machine learning pipeline that predicts log(LD50), a standard measure of acute toxicity, directly from molecular structure.

Given a molecule’s SMILES representation, our model outputs a toxicity prediction where:

Lower values → more toxic Higher values → less toxic

Beyond prediction, our goal was to make the model interpretable, so we could understand why certain molecules are toxic.

How we built it

Molecular Representation We converted SMILES strings into numerical features using:

Morgan Fingerprints (2048 bits) Capture local structural patterns in the molecule Molecular Descriptors (19 features) Include interpretable properties such as molecular weight, ring count, and lipophilicity Each molecule is represented as a 2067-dimensional feature vector.

We trained and compared multiple models:

Ridge Regression — linear baseline Random Forest — captures non-linear relationships XGBoost — gradient boosting, the strongest individual model Weighted Ensemble — combines RF + XGBoost for best performance

Challenges we ran into

Class imbalance: very few non-toxic examples made them harder to predict Feature dimensionality: handling 2000+ features without overfitting Model selection: balancing performance vs interpretability

Accomplishments that we're proud of

Matched state-of-the-art performance (R² = 0.65) on toxicity prediction Built an end-to-end pipeline from raw SMILES to predictions Combined models to achieve stronger, more stable results Generated interpretable insights aligned with real chemical knowledge Identified structural patterns driving toxicity (e.g., aromatic rings) Balanced performance with explainability — not just a black-box model

What we learned

A major focus of our project was interpretability.

Molecular size is the strongest predictor of toxicity Aromatic rings are consistently associated with higher toxicity The model highlights specific toxic substructures within molecules

This shows that our model is not just accurate; it aligns with known chemical behaviour and provides actionable insights

What's next for Lethal Dose - PharmaTech

Apply Graph Neural Networks (GNNs) to better capture molecular structure Incorporate 3D molecular features Improve handling of class imbalance

Our work shows that machine learning can: Reduce reliance on animal testing Accelerate drug discovery Provide interpretable insights for safer molecule design We didn’t just build a model, we built a tool to understand toxicity.

Built With

Share this project:

Updates