A safety guardrail system that classifies prompts and evaluates policy compliance. Takes a prompt and determines if it's safe or attempting a jailbreak, then retrieves relevant policies and judges whether the content complies with safety or usage guidelines to reduce off-topic prompts.
Possible use cases:
- Preventing misuse of AI chatbots by limiting filtering prompts sent to the LLM.
- Moderating ChatBot behaviour by filtering prompts using chosen policies.
- Overview
- Model Performance
- Features
- Setup
- Usage
- Docker & Deployment
- Project Structure
- API Documentation
- Tools & Technologies
- Remarks
Clean Talk is a two-stage safety evaluation system:
-
Classifier (DistilBERT) - Classifies prompts into 6 categories:
safe- Safe and benign promptsadversarial_harmful- Adversarial attacks attempting to cause harmvanilla_harmful- Directly harmful prompts without adversarial framingadversarial_benign- Adversarial attacks on benign topicsunsafe- Unsafe promptsvanilla_benign- Benign prompts without adversarial intent
Trained on a combination of these 2 datasets: nvidia/Aegis-AI-Content-Safety-Dataset-2.0 & allenai/wildjailbreak
-
Judge LLM (RAG + Gemini) - Uses RAG with Pinecone vector database to retrieve relevant policies and evaluate compliance via Google Gemini.
The DistilBERT classifier was trained with strong performance metrics:
- Final Accuracy: 82% on validation set
- Final F1 Score: 82%
- Training Loss: Converged from 0.84 → 0.47
- Learning Rate: 5e-05, Max Epoch: 5
The model shows excellent performance across all 6 safety categories:
- Strong diagonal values indicate high accuracy per class
- Clear distinction between safety categories
- Few misclassifications across categories
- ✅ Real-time prompt classification using DistilBERT
- ✅ Confidence scores for predictions
- ✅ FastAPI backend for easy integration
- ✅ Streamlit web interface with policy management
- ✅ RAG-based policy evaluation using Pinecone & Google Gemini
- ✅ Dynamic custom policy creation and storage
- ✅ Docker support
python -m venv venv
source venv/bin/activate Copy .env.example to .env and fill in the given variables.
pip install -r requirements.txtThe model is automatically downloaded from Hugging Face Hub on first run and cached locally.
To train your own model:
Run notebooks/02_training.ipynb. You can either upload to Hugging Face and update the model reference in src/core/classifier.py, or store it locally in models/best_model.pt.
Run the notebook notebooks/03_setup_rag.ipynb to initialize Pinecone and seed initial policies (you can also add policies directly through the Streamlit interface).
Start the FastAPI backend on http://localhost:8000:
python src/api/main.pyThe API will be available at: http://localhost:8000/docs (Swagger UI)
Alternatively, using uvicorn directly:
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000In a new terminal (ensure API is running), start the Streamlit frontend:
streamlit run src/app.pyThe app will be available at: http://localhost:8501
Features:
- Prompt Classification - Enter a prompt to get instant safety classification
- Sample Prompts - Choose from pre-loaded test prompts to explore the system
- Policy Management - Add new safety policies via the sidebar
- Current Policies - View all stored policies
- Policy Compliance Check - Automatically validates prompts against stored policies using RAG
For deployment to Streamlit Cloud without a separate backend:
streamlit run src/streamlit.pystreamlit.py combines the FastAPI backend and Streamlit frontend into a single file. Deploy by pushing to GitHub and selecting src/streamlit.py as the main file in Streamlit Cloud.
docker-compose up --buildStarts backend API on http://localhost:8000 and frontend on http://localhost:8501.
Docker Files:
Dockerfile.api- Backend FastAPI containerDockerfile.app- Frontend Streamlit containerdocker-compose.yml- Orchestrates both services
You can also deploy directly to Google Cloud Run by connecting the GitHub repository.
clean-talk/
├── Dockerfile.api # Backend API container configuration
├── Dockerfile.app # Frontend Streamlit container configuration
├── docker-compose.yml # Docker Compose orchestration (runs both services)
├── .dockerignore # Files excluded from Docker containers
│
├── README.md # This file
├── requirements.txt # Python dependencies
│
├── models/
│ └── best_model.pt # Trained model checkpoint
│
├── notebooks/
│ ├── 01_data_exploration.ipynb # EDA and data analysis
│ ├── 02_training.ipynb # Model training
│ └── 03_setup_rag.ipynb # Pinecone DB set up
│
├── reports/
│ ├── experiment_logs/ # Training logs and metrics
│ ├── diagrams/ # Training metrics visualisation
│ └── api_log.csv # Log of all API requests
│
└── src/
├── api/
│ └── main.py # FastAPI application
├── core/
│ ├── classifier.py # DistilBERT model inference
│ ├── features.py # Feature engineering utilities
│ └── safety_rag.py # RAG pipeline with Pinecone & Gemini
├── app.py # Streamlit frontend with separate backend
├── streamlit.py # Streamlit with integrated API (for Cloud deployment)
└── utils/
├── api_logger.py # API request/response logging
├── logger.py # Training logger
└── helper.py # Utility functions
Health check endpoint.
Response:
{
"message": "Prompt Classification API is running"
}Classify a prompt using the DistilBERT model and return safety classification with confidence score.
Request:
{
"prompt": "string"
}Response:
{
"prompt": "string",
"classification": "string (one of: safe, adversarial_harmful, vanilla_harmful, adversarial_benign, unsafe, vanilla_benign)",
"confidence": "float (0.0 to 1.0)"
}Evaluate prompt compliance against stored policies using RAG (Retrieval-Augmented Generation).
Request:
{
"prompt": "string"
}Response:
{
"decision": "string (VIOLATION or NOT A VIOLATION)",
"policy": "string (the relevant policy that was checked)",
"response_to_user": "string (detailed explanation)"
}Add a new safety policy to the Pinecone vector database.
Request:
{
"policy": "string"
}Response:
{
"id": "string",
"text": "string",
"status": "uploaded"
}- PyTorch - Deep learning framework
- Transformers - Hugging Face model library
- DistilBERT - Fast, lightweight BERT model
- FastAPI - Modern web framework
- Streamlit - Interactive web app framework
- Uvicorn - ASGI server for FastAPI
- Pandas - Data manipulation
- Scikit-learn - ML utilities
- Matplotlib & Seaborn - Visualization
- Datasets - Hugging Face datasets library for loading training data
- Pinecone - Vector database for policy storage and retrieval
- Sentence-Transformers (all-MiniLM-L6-v2) - Embedding generation for semantic search
- Google Gemini - LLM for policy evaluation and judgment
This is an exploratory project. The aim of this project is to learn tools and frameworks that are commonly used in AI projects.

