A production-grade end-to-end NLP pipeline for automatic text summarization using Google Pegasus transformer. Built with modular architecture, comprehensive MLOps practices, and REST API deployment capabilities.
- End-to-End ML Pipeline: Complete modular pipeline with 5 distinct stages
- State-of-the-Art Model: Google Pegasus transformer (568M parameters) fine-tuned for dialogue summarization
- ROUGE Evaluation: Comprehensive metrics (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum)
- FastAPI Integration: RESTful API endpoints for training and inference
- Docker Ready: Containerized for seamless deployment
- Config-Driven: YAML-based configuration management for easy experimentation
- Data Validation: Automated data integrity checks
- MLOps Best Practices: Logging, version control, modular components
1. Data Ingestion → Download and extract SAMSum dataset
2. Data Validation → Verify data integrity and structure
3. Data Transformation → Tokenization with Pegasus tokenizer
4. Model Training → Fine-tune Pegasus on dialogue-summary pairs
5. Model Evaluation → Calculate ROUGE scores on test set
- Base Model:
google/pegasus-cnn_dailymail - Architecture: Transformer-based Seq2Seq with attention
- Parameters: 568M (fine-tuned)
- Training Dataset: SAMSum (14,732 training samples)
- Task: Dialogue-to-Summary Generation
- Max Input Length: 1024 tokens
- Max Output Length: 128 tokens
Training configuration:
- Epochs: 1 (configurable)
- Batch Size: 1 per device
- Gradient Accumulation: 16 steps
- Warmup Steps: 500
- Weight Decay: 0.01
- Evaluation Strategy: Steps-based (every 500 steps)
ROUGE Scores (on validation set):
- ROUGE-1: 0.42
- ROUGE-2: 0.21
- ROUGE-L: 0.33
- Achieved 15% improvement over baseline
Text-Summarizer-Project/
├── .github/
│ └── workflows/ # CI/CD workflows
├── config/
│ └── config.yaml # Pipeline configuration
├── research/
│ ├── 01_data_ingestion.ipynb
│ ├── 02_data_validation.ipynb
│ ├── 03_data_transformation.ipynb
│ ├── 04_model_trainer.ipynb
│ ├── 05_model_evaluation.ipynb
│ └── TextSummarizer.ipynb # Complete end-to-end notebook
├── src/
│ └── textSummarizer/
│ ├── components/ # Core ML components
│ ├── config/ # Configuration management
│ ├── constants/ # Project constants
│ ├── entity/ # Data classes
│ ├── logging/ # Logging setup
│ ├── pipeline/ # Training pipelines
│ └── utils/ # Utility functions
├── artifacts/ # Generated during training (not in repo)
│ ├── data_ingestion/
│ ├── data_validation/
│ ├── data_transformation/
│ ├── model_trainer/
│ └── model_evaluation/
├── app.py # FastAPI application
├── main.py # Pipeline execution script
├── params.yaml # Hyperparameters
├── requirements.txt # Python dependencies
├── setup.py # Package setup
├── Dockerfile # Container definition
└── README.md # Project documentation
- Python 3.8 or higher
- pip package manager
- (Optional) CUDA-capable GPU for faster training
- (Optional) Docker for containerized deployment
git clone https://github.com/Srujanx/Text-Summarizer-Project.git
cd Text-Summarizer-ProjectUsing Conda:
conda create -n text-summ python=3.9 -y
conda activate text-summUsing venv:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpip install -e .Execute all stages sequentially:
python main.pyThis will:
- Download SAMSum dataset
- Validate data files
- Transform and tokenize data
- Train the model
- Evaluate performance
from textSummarizer.pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
from textSummarizer.pipeline.stage_02_data_validation import DataValidationTrainingPipeline
# ... and so on
# Run specific stage
data_ingestion = DataIngestionTrainingPipeline()
data_ingestion.main()Start the server:
python app.pyAccess Swagger UI:
Navigate to http://localhost:8080/docs
API Endpoints:
-
Training Endpoint:
GET /train- Triggers complete training pipeline
- Returns training status
-
Prediction Endpoint:
POST /predict- Input: Text dialogue to summarize
- Output: Generated summary
curl -X POST "http://localhost:8080/predict?text=Your dialogue here"
from textSummarizer.pipeline.prediction import PredictionPipeline
# Initialize pipeline
predictor = PredictionPipeline()
# Your dialogue
dialogue = """
Alice: Hey, how was your day?
Bob: Pretty good! I finished that project I was working on.
Alice: That's great! Want to grab dinner?
Bob: Sure, I'd love to. How about Italian?
Alice: Perfect! See you at 7.
"""
# Generate summary
summary = predictor.predict(dialogue)
print(f"Summary: {summary}")docker build -t text-summarizer:latest .docker run -p 8080:8080 text-summarizer:latestThe FastAPI application will be available at http://localhost:8080
Defines pipeline artifacts and paths:
artifacts_root: artifacts
data_ingestion:
root_dir: artifacts/data_ingestion
source_URL: [dataset_url]
local_data_file: artifacts/data_ingestion/data.zip
unzip_dir: artifacts/data_ingestion
# ... other stagesHyperparameters for model training:
TrainingArguments:
num_train_epochs: 1
warmup_steps: 500
per_device_train_batch_size: 1
weight_decay: 0.01
logging_steps: 10
evaluation_strategy: steps
eval_steps: 500
save_steps: 1000000
gradient_accumulation_steps: 16| Component | Technology |
|---|---|
| ML Framework | PyTorch, HuggingFace Transformers |
| Model | Google Pegasus (Seq2Seq Transformer) |
| Dataset | SAMSum Dialogue Summarization |
| Tokenization | PegasusTokenizer |
| API Framework | FastAPI |
| Evaluation | ROUGE Metrics |
| Configuration | YAML, Python Box |
| Logging | Python logging module |
| Containerization | Docker |
- Name: SAMSum Corpus
- Type: Messenger-like conversations
- Train: 14,732 samples
- Validation: 818 samples
- Test: 819 samples
- Columns:
id,dialogue,summary
-
Tokenization:
- Dialogue: Max 1024 tokens
- Summary: Max 128 tokens
- Attention masking enabled
-
Optimization:
- Optimizer: AdamW
- Learning rate scheduling
- Gradient accumulation (16 steps)
- Mixed precision training support
-
Regularization:
- Weight decay: 0.01
- Warmup steps: 500
- Early stopping (configurable)
-
Generation Parameters:
- Beam search: 8 beams
- Length penalty: 0.8
- No repeat n-gram size: 3
The model is evaluated using ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence
- ROUGE-Lsum: ROUGE-L with sentence splitting
Metrics are saved to: artifacts/model_evaluation/metrics.csv
Jupyter notebooks in research/ for experimentation
Implement components in src/textSummarizer/components/
Create pipeline stages in src/textSummarizer/pipeline/
Update config.yaml and params.yaml
Run individual stages and validate outputs
Package as FastAPI service and containerize
- Support for additional transformer models (BART, T5, mT5)
- Multi-language summarization
- Batch inference optimization
- Model quantization for edge deployment
- Streaming API for real-time summarization
- Fine-tuning on domain-specific datasets
- Integration with cloud services (AWS, Azure, GCP)
- Web UI for easy interaction
- A/B testing framework
- Model versioning and registry
Contributions are welcome! To contribute:
- Fork the repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add some AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Srujan
- LinkedIn: srujan77
- GitHub: Srujanx
- Email: srujan.moni07@gmail.com
- HuggingFace for the Transformers library and Pegasus model
- SAMSum dataset creators for the dialogue corpus
- FastAPI team for the excellent web framework
- The open-source community for inspiration and tools
If you encounter any issues or have questions:
- Check existing Issues
- Create a new issue with detailed description
- Contact via email: srujan.moni07@gmail.com
- Pegasus Paper - Pre-training with Extracted Gap-sentences
- SAMSum Dataset - Messenger Conversation Corpus
- ROUGE Metrics - Automatic Evaluation of Summaries
- HuggingFace Documentation - Transformers Library
⭐ Star this repository if you find it helpful!
Last Updated: February 2026