Clean Talk: Guardrail API

A safety guardrail system that classifies prompts and evaluates policy compliance. Takes a prompt and determines if it's safe or attempting a jailbreak, then retrieves relevant policies and judges whether the content complies with safety or usage guidelines to reduce off-topic prompts.

Possible use cases:

Preventing misuse of AI chatbots by limiting filtering prompts sent to the LLM.
Moderating ChatBot behaviour by filtering prompts using chosen policies.

Overview

Clean Talk is a two-stage safety evaluation system:

Classifier (DistilBERT) - Classifies prompts into 6 categories:
- safe - Safe and benign prompts
- adversarial_harmful - Adversarial attacks attempting to cause harm
- vanilla_harmful - Directly harmful prompts without adversarial framing
- adversarial_benign - Adversarial attacks on benign topics
- unsafe - Unsafe prompts
- vanilla_benign - Benign prompts without adversarial intent
Trained on a combination of these 2 datasets: nvidia/Aegis-AI-Content-Safety-Dataset-2.0 & allenai/wildjailbreak
Judge LLM (RAG + Gemini) - Uses RAG with Pinecone vector database to retrieve relevant policies and evaluate compliance via Google Gemini.

Model Performance

Training Results

The DistilBERT classifier was trained with strong performance metrics:

Final Accuracy: 82% on validation set
Final F1 Score: 82%
Training Loss: Converged from 0.84 → 0.47
Learning Rate: 5e-05, Max Epoch: 5

Confusion Matrix

The model shows excellent performance across all 6 safety categories:

Strong diagonal values indicate high accuracy per class
Clear distinction between safety categories
Few misclassifications across categories

Features

✅ Real-time prompt classification using DistilBERT
✅ Confidence scores for predictions
✅ FastAPI backend for easy integration
✅ Streamlit web interface with policy management
✅ RAG-based policy evaluation using Pinecone & Google Gemini
✅ Dynamic custom policy creation and storage
✅ Docker support

Setup

1. Clone and Create Virtual Environment

python -m venv venv
source venv/bin/activate

2. Set up your environment variables.

Copy .env.example to .env and fill in the given variables.

3. Install Dependencies

pip install -r requirements.txt

4. Prepare the Model

The model is automatically downloaded from Hugging Face Hub on first run and cached locally.

To train your own model: Run notebooks/02_training.ipynb. You can either upload to Hugging Face and update the model reference in src/core/classifier.py, or store it locally in models/best_model.pt.

5. Set up Pinecone RAG

Run the notebook notebooks/03_setup_rag.ipynb to initialize Pinecone and seed initial policies (you can also add policies directly through the Streamlit interface).

Usage

Running the API

Start the FastAPI backend on http://localhost:8000:

python src/api/main.py

The API will be available at: http://localhost:8000/docs (Swagger UI)

Alternatively, using uvicorn directly:

uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000

Running the Streamlit App

In a new terminal (ensure API is running), start the Streamlit frontend:

streamlit run src/app.py

The app will be available at: http://localhost:8501

Features:

Prompt Classification - Enter a prompt to get instant safety classification
Sample Prompts - Choose from pre-loaded test prompts to explore the system
Policy Management - Add new safety policies via the sidebar
Current Policies - View all stored policies
Policy Compliance Check - Automatically validates prompts against stored policies using RAG

Streamlit Cloud Deployment

For deployment to Streamlit Cloud without a separate backend:

streamlit run src/streamlit.py

streamlit.py combines the FastAPI backend and Streamlit frontend into a single file. Deploy by pushing to GitHub and selecting src/streamlit.py as the main file in Streamlit Cloud.

Docker

Running with Docker

docker-compose up --build

Starts backend API on http://localhost:8000 and frontend on http://localhost:8501.

Docker Files:

Dockerfile.api - Backend FastAPI container
Dockerfile.app - Frontend Streamlit container
docker-compose.yml - Orchestrates both services

Google Cloud Deployment

You can also deploy directly to Google Cloud Run by connecting the GitHub repository.

Project Structure

clean-talk/
├── Dockerfile.api             # Backend API container configuration
├── Dockerfile.app             # Frontend Streamlit container configuration
├── docker-compose.yml         # Docker Compose orchestration (runs both services)
├── .dockerignore              # Files excluded from Docker containers
│
├── README.md                  # This file
├── requirements.txt           # Python dependencies            
│
├── models/
│   └── best_model.pt          # Trained model checkpoint
│
├── notebooks/
│   ├── 01_data_exploration.ipynb   # EDA and data analysis
│   ├── 02_training.ipynb           # Model training
│   └── 03_setup_rag.ipynb          # Pinecone DB set up
│
├── reports/
│   ├── experiment_logs/       # Training logs and metrics
│   ├── diagrams/              # Training metrics visualisation
│   └── api_log.csv            # Log of all API requests 
│
└── src/
    ├── api/
    │   └── main.py            # FastAPI application
    ├── core/
    │   ├── classifier.py      # DistilBERT model inference
    │   ├── features.py        # Feature engineering utilities
    │   └── safety_rag.py      # RAG pipeline with Pinecone & Gemini
    ├── app.py                 # Streamlit frontend with separate backend
    ├── streamlit.py           # Streamlit with integrated API (for Cloud deployment)
    └── utils/
        ├── api_logger.py      # API request/response logging
        ├── logger.py          # Training logger
        └── helper.py          # Utility functions

API Documentation

Endpoints

`GET /`

Health check endpoint.

Response:

{
  "message": "Prompt Classification API is running"
}

`POST /classify`

Classify a prompt using the DistilBERT model and return safety classification with confidence score.

Request:

{
  "prompt": "string"
}

Response:

{
  "prompt": "string",
  "classification": "string (one of: safe, adversarial_harmful, vanilla_harmful, adversarial_benign, unsafe, vanilla_benign)",
  "confidence": "float (0.0 to 1.0)"
}

`POST /policy_check`

Evaluate prompt compliance against stored policies using RAG (Retrieval-Augmented Generation).

Request:

{
  "prompt": "string"
}

Response:

{
  "decision": "string (VIOLATION or NOT A VIOLATION)",
  "policy": "string (the relevant policy that was checked)",
  "response_to_user": "string (detailed explanation)"
}

`POST /add_policy`

Add a new safety policy to the Pinecone vector database.

Request:

{
  "policy": "string"
}

Response:

{
  "id": "string",
  "text": "string",
  "status": "uploaded"
}

Tools & Technologies

Core Libraries

PyTorch - Deep learning framework
Transformers - Hugging Face model library
DistilBERT - Fast, lightweight BERT model
FastAPI - Modern web framework
Streamlit - Interactive web app framework
Uvicorn - ASGI server for FastAPI

Data & ML

Pandas - Data manipulation
Scikit-learn - ML utilities
Matplotlib & Seaborn - Visualization
Datasets - Hugging Face datasets library for loading training data

RAG & LLM

Pinecone - Vector database for policy storage and retrieval
Sentence-Transformers (all-MiniLM-L6-v2) - Embedding generation for semantic search
Google Gemini - LLM for policy evaluation and judgment

Remarks

This is an exploratory project. The aim of this project is to learn tools and frameworks that are commonly used in AI projects.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
notebooks		notebooks
reports		reports
src		src
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.api		Dockerfile.api
Dockerfile.app		Dockerfile.app
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Clean Talk: Guardrail API

Table of Contents

Overview

Model Performance

Training Results

Confusion Matrix

Features

Setup

1. Clone and Create Virtual Environment

2. Set up your environment variables.

3. Install Dependencies

4. Prepare the Model

5. Set up Pinecone RAG

Usage

Running the API

Running the Streamlit App

Streamlit Cloud Deployment

Docker

Running with Docker

Google Cloud Deployment

Project Structure

API Documentation

Endpoints

GET /

POST /classify

POST /policy_check

POST /add_policy

Tools & Technologies

Core Libraries

Data & ML

RAG & LLM

Remarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`POST /classify`

`POST /policy_check`

`POST /add_policy`

Packages