Inspiration

Medical prescriptions and healthcare documents are often difficult for patients to understand. Complex medical terminology, abbreviations, and clinical language can create confusion, leading to misunderstandings about medications, treatments, and diagnoses.

We wanted to explore whether a Small Language Model (SLM), trained entirely from scratch, could learn medical language patterns and help bridge the communication gap between healthcare professionals and patients. Instead of relying on large pre-trained models, we challenged ourselves to build a specialized medical language model from the ground up.

What it does

MediGPT translates complex medical prescriptions, diagnoses, and clinical terminology into plain English that patients can easily understand.

For example, instead of presenting:

"Administer antihypertensive therapy with ACE inhibitors while monitoring renal function."

MediGPT can generate:

"Take medication that helps lower your blood pressure. Regular kidney function tests may be required while using this medicine."

The goal is to make healthcare information more accessible, improve patient understanding, and reduce confusion caused by highly technical medical language.

How we built it

Unlike many AI projects that fine-tune existing large models, MediGPT was built completely from scratch.

Dataset Construction

  • 15,000+ medical documents
  • Medical Q&A pairs from MedQuAD
  • Biomedical content from PubMedQA
  • Over 6.5 million medical tokens
  • 28,000+ unique medical vocabulary terms

Custom Medical Preprocessing

We designed a domain-specific preprocessing pipeline that preserves important medical terminology such as:

  • covid-19
  • anti-inflammatory
  • hypertension
  • neuropathy

while removing noise such as HTML tags, URLs, and irrelevant artifacts.

Model Architecture

  • GPT-style decoder-only transformer
  • 8 Transformer layers
  • 8 attention heads
  • 512-dimensional embeddings
  • 512-token context window
  • 8.5 million parameters

Training Pipeline

The model was trained using:

  • PyTorch
  • AdamW optimizer
  • Xavier initialization
  • Cosine annealing learning rate scheduling
  • Linear warmup
  • Weight decay regularization
  • Gradient clipping

The complete training process was implemented from scratch to better understand every component of modern transformer architectures.

Challenges we ran into

Medical Data Quality

Medical datasets contain highly specialized terminology and inconsistent formatting. Standard preprocessing often destroys important clinical terms, so we built a custom medical-aware cleaning pipeline.

Training Stability

Training transformers from scratch is challenging. We faced issues such as unstable gradients, slow convergence, and overfitting. Techniques like Xavier initialization, learning-rate warmup, and cosine scheduling significantly improved stability.

Domain-Specific Learning

Teaching a small model to understand complex medical concepts while remaining computationally efficient required careful balancing of architecture size, dataset quality, and training configuration.

Accomplishments that we're proud of

  • Built a medical language model completely from scratch
  • Trained on over 6.5 million medical tokens
  • Achieved validation perplexity of 9.84
  • Learned meaningful biomedical terminology and clinical reasoning patterns
  • Generated coherent medical explanations and condition relationships
  • Demonstrated that specialized small models can perform effectively without massive computational resources

What we learned

This project taught us that domain-specific Small Language Models can achieve impressive results when trained on carefully curated data.

We gained hands-on experience with:

  • Transformer architecture
  • Self-attention mechanisms
  • Tokenization and vocabulary design
  • Language model training
  • Learning rate scheduling
  • Medical NLP pipelines
  • Model evaluation and text generation

Most importantly, we learned that building AI systems from scratch provides a much deeper understanding than simply fine-tuning existing models.

What's next for MediGPT

Future improvements include:

  • Prescription-specific fine-tuning
  • Medical report simplification
  • Drug interaction explanations
  • Clinical note summarization
  • Multilingual support
  • Integration with healthcare applications
  • Expansion to 50M+ parameter medical models

Our long-term vision is to make medical information understandable and accessible for everyone, regardless of their medical background.

Built With

Share this project:

Updates