In this project I implemented a Transformer-based text classification pipeline from scratch in Rust without any external Machine learning libraries. It is designed to classify textual inputs into predefined categories using advanced Natural Language Processing (NLP) techniques. Currently I built it for spam detection use case but can be easily changed for applications of sentiment analysis, topic categorization, content moderation, etc. Each module has its own detailed README file. Below is a high-level summary of the modules, followed by a comprehensive description of the configuration settings.
- Custom Transformer Model: A fully implemented Transformer architecture optimized for text classification tasks.
- End-to-End Pipeline: Includes modules for tokenization, training, evaluation, and inference.
- Scalability: Designed for extensibility, enabling support for more advanced features or larger datasets.
This module generates positional encodings to inject sequential information into token embeddings. The encodings are calculated using sine and cosine functions.
- Purpose: Provides position-based context to the model.
- Read Full Documentation
Implements the scaled dot-product attention and multi-head attention mechanisms, which are critical for enabling the Transformer to focus on relevant parts of the input sequence.
- Purpose: Allows the model to weigh the importance of different tokens dynamically.
- Read Full Documentation
Defines a feed-forward neural network used within the Transformer layers. This module processes attention outputs with non-linearity for feature transformation.
- Purpose: Applies dense layers with activation functions.
- Read Full Documentation
Applies layer normalization to stabilize the training process by normalizing inputs within each layer.
- Purpose: Prevents internal covariate shifts during training.
- Read Full Documentation
Combines attention, feed-forward, and normalization mechanisms to form the building blocks of the Transformer model.
- Purpose: Represents a single Transformer encoder layer.
- Read Full Documentation
Handles token embedding and integrates positional encodings into the input sequence representation.
- Purpose: Converts tokens into dense vectors with positional context.
- Read Full Documentation
Integrates multiple encoder layers and provides a complete Transformer architecture for text classification.
- Purpose: Acts as the core model architecture.
- Read Full Documentation
Implements cross-entropy loss for multi-class classification tasks, providing both loss computation and gradient calculations.
- Purpose: Guides the optimization process by computing classification errors.
- Read Full Documentation
Implements optimization algorithms like Stochastic Gradient Descent (SGD) and Adam for updating model parameters during training.
- Purpose: Efficiently minimizes the loss function.
- Read Full Documentation
Facilitates the loading, batching, and preprocessing of datasets for training and evaluation.
- Purpose: Manages dataset handling for input to the model.
- Read Full Documentation
Handles the training loop, including forward and backward passes, loss computation, parameter updates, and metric tracking.
- Purpose: Automates the training process for multiple epochs.
- Read Full Documentation
Evaluates the model on a test dataset, computing metrics such as accuracy, precision, recall, and F1-score.
- Purpose: Validates model performance on unseen data.
- Read Full Documentation
Runs predictions on new inputs, providing both class labels and probability distributions for each prediction.
- Purpose: Enables practical usage of the trained model.
- Read Full Documentation
The configuration settings are defined in the config.rs file and are crucial for controlling model behavior, training dynamics, and tokenization. Below are the key parameters:
MAX_SEQ_LENGTH: Maximum length of input sequences (default: 128).PAD_TOKEN: Padding token ([PAD]) used to ensure uniform sequence lengths.UNK_TOKEN: Unknown token ([UNK]) for handling out-of-vocabulary words.CLS_TOKEN: Classification token ([CLS]) added at the start of each input sequence.SEP_TOKEN: Separator token ([SEP]) added between sentence pairs.
BATCH_SIZE: Number of samples processed simultaneously during training (default: 32).LEARNING_RATE: Learning rate for the optimizer (default: 0.001).BETA1: Beta1 parameter for the Adam optimizer (default: 0.9).BETA2: Beta2 parameter for the Adam optimizer (default: 0.999).EPSILON: Small constant for numerical stability in Adam updates (default: 1e-8).
-
Tokenization:
- Preprocesses text into tokenized and padded sequences.
-
Model Training:
- Uses the
Trainermodule to train the Transformer model on the training dataset.
- Uses the
-
Evaluation:
- Validates the model’s performance using the
Evaluatormodule.
- Validates the model’s performance using the
-
Inference:
- Deploys the trained model for predictions using the
Inferencemodule.
- Deploys the trained model for predictions using the
- Rust: Ensure Rust is installed. Install it using:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-
Clone the repository:
git clone <repository_url> cd transformer_project
-
Build the project:
cargo build
-
Run the project:
cargo run
- Auto specifying Epochs: Update training module to auto specify the number of epochs or later in integration.
- Visualization Tools: Add a visualization modules.
- Pre-trained Models: Incorporate pre-trained weights for fine-tuning.
- Tests: Fix tests and add more tests.