fasttfidf is a Python library that provides TF-IDF vectorization with automatic memory management and SIMD acceleration. It processes datasets larger than RAM using memory-mapped files and streaming architecture, making it practical to work with multi-gigabyte text corpora on commodity hardware
- Memory-efficient processing: Handles datasets larger than available RAM through streaming and automatic memory management
- SIMD optimization: Leverages AVX2 (x86_64) and NEON (ARM) instruction sets for accelerated text processing
- Multiprocessing: Parallel vocabulary building across CPU cores with automatic load balancing
- Batch training support: Train models incrementally without loading full dataset into memory
- Vocabulary exploration: Built-in methods for analyzing and querying learned vocabularies
- CSV and Parquet only: Requires CSV or Parquet files with a
textcolumn header - No preprocessing: Does not perform stopword removal, stemming, or lemmatization - input text must be preprocessed
- Batch processing required: Transform returns raw components (data, indices, indptr) that require manual conversion to sparse matrices
- Manual IDF application: IDF weighting and normalization must be applied manually during transformation (example given)
- File-based only: Cannot process in-memory data structures or Python dataframes/lists directly
conda create -n fasttfidf python=3.9 -y
conda activate fasttfidf
conda install -c conda-forge arrow-cpp pyarrow psutil pybind11 pytest -y
git clone https://github.com/purijs/fasttfidf
cd fasttfidf
python -m pip install --no-build-isolation -e . Install Apache Arrow C++ library:
macOS (Homebrew):
brew install apache-arrowLinux (Ubuntu/Debian):
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/ubuntu/apache-arrow-apt-source-latest-$(lsb_release -cs).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release -cs).deb
sudo apt update
sudo apt install -y libarrow-dev libparquet-dev pkg-configWindows:
conda install -c conda-forge arrow-cpp pyarrowThen install from source:
git clone https://github.com/purijs/fasttfidf
cd fasttfidf
pip install pybind11 pytest setuptools psutil pyarrow
pip install -e .Requirements:
- Python 3.9+
- NumPy >= 1.19.0
- SciPy >= 1.5.0
- C++17 compatible compiler
- Apache Arrow C++ library (version: 22)
import fasttfidf_csv
# Fit the vectorizer on your training data
vec = fasttfidf_csv.TfidfVectorizer()
vec.fit('train.csv', num_processes=0) # use all cores
# Save model for later use
vec.save('model.tfidf')fasttfidf provides two transformation workflows depending on your use case.
For datasets larger than RAM, train models incrementally:
import fasttfidf_csv
from scipy.sparse import csr_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import normalize
import numpy as np
# Load model
vec = fasttfidf_csv.TfidfVectorizer()
vec.load('model.tfidf')
# Get IDF weights once
idf_weights = vec.get_idf_array().astype(np.float32)
n_features = len(idf_weights)
# Initialize incremental learner
model = SGDClassifier(loss='log_loss', max_iter=1)
# Stream and train in batches
vec.open_stream('train.csv')
batch_size = 128 * 1024 * 1024 # 128 MB
while True:
# Get raw term frequencies
batch = vec.get_batch(batch_size)
if batch is None:
break
data, indices, indptr = batch
n_docs = len(indptr) - 1
# Build sparse matrix
X = csr_matrix((data, indices, indptr),
shape=(n_docs, n_features))
# Apply IDF weighting and L2 normalization
X = X.astype(np.float32)
X.data *= idf_weights[X.indices]
normalize(X, norm='l2', copy=False)
# Incremental training
model.partial_fit(X, y_batch, classes=np.unique(y_train))When the full TF-IDF matrix fits in memory:
from scipy.sparse import vstack, csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
import numpy as np
# Load model
vec = fasttfidf_csv.TfidfVectorizer()
vec.load('model.tfidf')
# Get IDF weights
idf_weights = vec.get_idf_array().astype(np.float32)
n_features = len(idf_weights)
# Collect all batches
matrices = []
vec.open_stream('train.csv')
batch_size = 500 * 1024 * 1024 # 500 MB
while True:
batch = vec.get_batch(batch_size)
if batch is None:
break
data, indices, indptr = batch
n_docs = len(indptr) - 1
# Build sparse matrix
X = csr_matrix((data, indices, indptr),
shape=(n_docs, n_features))
# Apply IDF and normalize
X = X.astype(np.float32)
X.data *= idf_weights[X.indices]
normalize(X, norm='l2', copy=False)
matrices.append(X)
# Combine all batches
X_train = vstack(matrices)
# Train model on full matrix
model = LogisticRegression()
model.fit(X_train, y_train)# Get vocabulary statistics
stats = vec.get_vocab_stats()
print(f"Vocabulary size: {stats['vocab_size']}")
print(f"Total documents: {stats['total_docs']}")
# Find rarest terms
rare_words = vec.get_top_idf_words(n=10)
for word, idf in rare_words:
print(f"{word}: {idf:.3f}")
# Find most common terms
common_words = vec.get_bottom_idf_words(n=10)
# Search vocabulary
results = vec.search_words('machine', max_results=50)
# Check specific terms
idf_value = vec.get_word_idf('computer')
doc_freq = vec.get_word_df('computer')fasttfidf expects CSV files with a header row and a text column:
text
This is the first document.
This document is the second document.
And this is the third one.fasttfidf also supports Apache Parquet files with a text column. Use the fasttfidf_parquet module for parquet files:
import fasttfidf_parquet
vec = fasttfidf_parquet.TfidfVectorizer()
vec.fit('/path/to/parquet/files/', num_processes=0) # or file.parquet
vec.save('model.tfidf')
# Transform works the same way
vec.open_stream('test.parquet')
batch = vec.get_batch(128 * 1024 * 1024)Important: The library does not perform any text preprocessing. Your files must contain pre-processed text with stopwords removed, text lowercased, and any other desired preprocessing already applied.
-
fit(filename, num_processes=0, min_df=1, max_df=0, max_features=0, verbose=True)- Build vocabulary from CSV filefilename: Path to CSV filenum_processes: Number of workers (0 = auto-detect)min_df: Minimum document frequencymax_df: Maximum document frequency (0 = no limit)max_features: Limit vocabulary size (0 = no limit)verbose: Print progress messages
-
save(filename)- Save model to disk as text file -
load(filename)- Load model from disk
open_stream(filename)- Open CSV file for streaming transformationget_batch(batch_size_bytes)- Get next batch of raw term frequencies- Returns:
(data, indices, indptr)tuple of NumPy arrays, orNonewhen stream ends data: uint16 array of term frequenciesindices: int32 array of column indicesindptr: int32 array of row pointers (CSR format)
- Returns:
get_vocabulary()- Return vocabulary as dict mapping word -> indexget_idf()- Return IDF values as Python listget_idf_array()- Return IDF values as NumPy arrayget_feature_names()- Return feature names in index orderget_vocab_size()- Return vocabulary sizeget_total_docs()- Return total documents processed during fit
get_vocab_stats()- Get vocabulary statistics dictget_top_idf_words(n=10)- Get n words with highest IDF (rarest)get_bottom_idf_words(n=10)- Get n words with lowest IDF (most common)get_word_idf(word)- Get IDF value for specific wordget_word_df(word)- Get document frequency for wordsearch_words(pattern, max_results=100)- Search vocabulary by substringget_words_in_idf_range(min_idf, max_idf, max_results=1000)- Filter by IDF rangeget_words_in_df_range(min_df, max_df, max_results=1000)- Filter by document frequencyhas_word(word)- Check if word exists in vocabularyget_random_words(n=10, seed=42)- Get random vocabulary sampleexport_vocabulary_with_idf()- Export vocabulary as dict with IDF values
fasttfidf uses a three-stage pipeline optimized for large-scale processing:
-
Vocabulary Building: Memory-mapped file access with multiprocessing and dynamic sub-batching prevents out-of-memory errors on large datasets. Worker processes use adaptive memory management to stay within available RAM.
-
IDF Calculation: Inverse document frequency values are computed once during fit and cached in the model file for efficient transformation.
-
Streaming Transformation: Zero-copy batch processing returns CSR sparse matrix components (data, indices, indptr) that can be incrementally processed or combined.
Run the test suite:
pytest tests.py -vThis project is distributed under the MIT License. See LICENSE file for details.
If you use fasttfidf in a scientific publication, please cite:
@software{fasttfidf2025,
author = {Puri, Jaskaran Singh},
title = {fasttfidf: High-performance TF-IDF for large-scale text datasets},
year = {2025},
url = {https://github.com/purijs/fasttfidf}
}