The data_loader module provides functionality for parsing datasets into tokenized and padded inputs, along with corresponding labels, for use in Transformer-based models. It supports multiple dataset formats (e.g., JSON and CSV), making it versatile for a variety of text classification tasks. The module integrates with the tokenizer to preprocess raw text into numerical sequences and handles batching to prepare data for efficient training and testing.
In machine learning pipelines, data preprocessing is critical to ensure raw datasets are converted into model-compatible formats. The data_loader module facilitates this by:
- Parsing raw datasets from supported file formats
- Tokenizing and padding text data using a tokenizer
- Converting labels into numerical representations for classification
- Preparing data batches for training and evaluation
This module bridges the gap between raw datasets and model training by automating preprocessing, reducing manual intervention.
- Each row contains a
textfield and alabelfield - Example:
text,label
"Win a free iPhone now!",spam
"Meeting scheduled for 10 AM tomorrow.",not spam
- A list of objects, each containing
textandlabelkeys - Example:
[
{ "text": "Win a free iPhone now!", "label": "spam" },
{ "text": "Meeting scheduled for 10 AM tomorrow.", "label": "not spam" }
]- Detects file format based on extension (e.g.,
.csv,.json) - Reads and extracts text and label fields from the dataset
- Uses the tokenizer to convert text into tokenized and padded sequences of fixed length
- Maps string labels (e.g., "spam", "not spam") to numerical categories (e.g.,
1,0)
- Divides tokenized data and labels into batches for efficient processing during training and testing
Given an input sentence, the tokenizer splits it into tokens, maps each token to an integer using a vocabulary, and applies padding to ensure all sequences are of equal length.
Let:
textbe the input sentencevocab(token)be the vocabulary index for a tokenPAD_INDEXbe the index for the padding tokenMAX_SEQ_LENGTHbe the fixed sequence length
Tokenization: For a sentence text:
- Split into tokens:
tokens = preprocess(text) - Map tokens to indices:
token_indices = [vocab(token) for token in tokens]
Padding: To ensure uniform sequence length:
if len(token_indices) < MAX_SEQ_LENGTH:
token_indices.extend([PAD_INDEX] * (MAX_SEQ_LENGTH - len(token_indices)))
else:
token_indices = token_indices[:MAX_SEQ_LENGTH]Labels are converted into numerical categories for use in classification. For example:
"spam" -> 1"not spam" -> 0
- Supports multiple input formats (CSV, JSON)
- Handles diverse text classification tasks with configurable tokenization and padding
- Efficient batch preparation for large datasets
- Seamless integration with Transformer-based models
- Provides meaningful errors for unsupported file formats or missing fields