Inspiration

Medical prescriptions and healthcare documents are often difficult for patients to understand. Complex medical terminology, abbreviations, and clinical language can create confusion, leading to misunderstandings about medications, treatments, and diagnoses.

We wanted to explore whether a Small Language Model (SLM), trained entirely from scratch, could learn medical language patterns and help bridge the communication gap between healthcare professionals and patients. Instead of relying on large pre-trained models, we challenged ourselves to build a specialized medical language model from the ground up.

What it does

MediGPT translates complex medical prescriptions, diagnoses, and clinical terminology into plain English that patients can easily understand.

For example, instead of presenting:

"Administer antihypertensive therapy with ACE inhibitors while monitoring renal function."

MediGPT can generate:

"Take medication that helps lower your blood pressure. Regular kidney function tests may be required while using this medicine."

The goal is to make healthcare information more accessible, improve patient understanding, and reduce confusion caused by highly technical medical language.

How we built it

Unlike many AI projects that fine-tune existing large models, MediGPT was built completely from scratch.

Dataset Construction

15,000+ medical documents
Medical Q&A pairs from MedQuAD
Biomedical content from PubMedQA
Over 6.5 million medical tokens
28,000+ unique medical vocabulary terms

Custom Medical Preprocessing

We designed a domain-specific preprocessing pipeline that preserves important medical terminology such as:

covid-19
anti-inflammatory
hypertension
neuropathy

while removing noise such as HTML tags, URLs, and irrelevant artifacts.

Model Architecture

GPT-style decoder-only transformer
8 Transformer layers
8 attention heads
512-dimensional embeddings
512-token context window
8.5 million parameters

Training Pipeline

The model was trained using:

PyTorch
AdamW optimizer
Xavier initialization
Cosine annealing learning rate scheduling
Linear warmup
Weight decay regularization
Gradient clipping

The complete training process was implemented from scratch to better understand every component of modern transformer architectures.

Challenges we ran into

Medical Data Quality

Medical datasets contain highly specialized terminology and inconsistent formatting. Standard preprocessing often destroys important clinical terms, so we built a custom medical-aware cleaning pipeline.

Training Stability

Training transformers from scratch is challenging. We faced issues such as unstable gradients, slow convergence, and overfitting. Techniques like Xavier initialization, learning-rate warmup, and cosine scheduling significantly improved stability.

Domain-Specific Learning

Teaching a small model to understand complex medical concepts while remaining computationally efficient required careful balancing of architecture size, dataset quality, and training configuration.

Accomplishments that we're proud of

Built a medical language model completely from scratch
Trained on over 6.5 million medical tokens
Achieved validation perplexity of 9.84
Learned meaningful biomedical terminology and clinical reasoning patterns
Generated coherent medical explanations and condition relationships
Demonstrated that specialized small models can perform effectively without massive computational resources

What we learned

This project taught us that domain-specific Small Language Models can achieve impressive results when trained on carefully curated data.

We gained hands-on experience with:

Transformer architecture
Self-attention mechanisms
Tokenization and vocabulary design
Language model training
Learning rate scheduling
Medical NLP pipelines
Model evaluation and text generation

Most importantly, we learned that building AI systems from scratch provides a much deeper understanding than simply fine-tuning existing models.

What's next for MediGPT

Future improvements include:

Prescription-specific fine-tuning
Medical report simplification
Drug interaction explanations
Clinical note summarization
Multilingual support
Integration with healthcare applications
Expansion to 50M+ parameter medical models

Our long-term vision is to make medical information understandable and accessible for everyone, regardless of their medical background.

Built With

adamw
architecture
attention
bpe
deep
generative
gpt
language
learning
machine
mechanism
medical
models
natural
natural-language-processing
numpy
pandas
processing
python
pytorch
small
tiktoken
tokenization
transformers

Updates

Rishabh Shenoy started this project — Jun 03, 2026 11:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.