Working with Large Datasets in Python Using Vaex

working large datasets python using vaex

Image by Author

When working with datasets containing millions or billions of rows, traditional data analysis libraries often struggle. Pandas, while excellent for small to medium datasets, loads entire datasets into memory and performs operations eagerly, which can quickly exhaust available RAM and slow down analysis workflows. This creates a bottleneck for data scientists and analysts who need to work with large datasets efficiently.

Vaex is a Python library designed specifically to handle datasets that are too large to fit comfortably in memory. It uses memory mapping to access data directly from disk without loading everything into RAM, and it employs lazy evaluation to defer computations until absolutely necessary. This approach lets you work with billion-row datasets on a standard laptop with the same ease as working with smaller datasets in pandas.

In this tutorial, you’ll learn the basics of Vaex and see how it compares to pandas when working with large datasets. We’ll explore loading data, filtering, creating computed columns, and performing aggregations using a dataset with over one million rows.

Installation and Setup

Vaex can be installed using pip:

pip install vaex

Vaex has minimal dependencies and works with Python 3.6 and higher. The library is especially efficient when working with HDF5 or Apache Arrow file formats, though it can also read CSV, Parquet, and other common formats. For this tutorial, we’ll use one of Vaex’s built-in example datasets.

Loading Data

Let’s start by loading a dataset and examining its basic properties.

import vaex

# Load the dataset
df = vaex.datasets.iris_1e6()

# Display basic information
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Dataset size: {df.nbytes / 1024**2:.1f} MB")

This loads a dataset with over one million rows of iris flower measurements.

Shape: (1005000, 5)
Columns: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class_']
Dataset size: 38.3 MB

The dataset contains 1,005,000 rows and 5 columns, taking up 38.3 MB of space. Notice how quickly the data loads. This is because Vaex doesn’t actually load all the data into memory at once.

Comparing Vaex to Pandas

To understand why Vaex is valuable for large datasets, let’s compare it directly to pandas. We’ll create five new calculated columns in both libraries and observe the differences in speed and memory usage.

import pandas as pd
import time

df_vaex = vaex.datasets.iris_1e6()
df_pandas = df_vaex.to_pandas_df()

print("=== Creating 5 New Calculated Columns ===\n")

# Vaex (lazy)
start = time.time()
df_vaex['sepal_ratio'] = df_vaex.sepal_length / df_vaex.sepal_width
df_vaex['petal_ratio'] = df_vaex.petal_length / df_vaex.petal_width
df_vaex['sepal_area'] = df_vaex.sepal_length * df_vaex.sepal_width
df_vaex['petal_area'] = df_vaex.petal_length * df_vaex.petal_width
df_vaex['total_size'] = df_vaex.sepal_area + df_vaex.petal_area
vaex_time = time.time() - start
print(f"Vaex: {vaex_time*1000:.2f} ms (lazy, no computation yet)")

# Pandas (eager)
start = time.time()
df_pandas['sepal_ratio'] = df_pandas['sepal_length'] / df_pandas['sepal_width']
df_pandas['petal_ratio'] = df_pandas['petal_length'] / df_pandas['petal_width']
df_pandas['sepal_area'] = df_pandas['sepal_length'] * df_pandas['sepal_width']
df_pandas['petal_area'] = df_pandas['petal_length'] * df_pandas['petal_width']
df_pandas['total_size'] = df_pandas['sepal_area'] + df_pandas['petal_area']
pandas_time = time.time() - start
print(f"Pandas: {pandas_time*1000:.2f} ms (computed immediately)")
print(f"\nSpeed: {pandas_time/vaex_time:.1f}x faster with Vaex")

pandas_new_memory = df_pandas.memory_usage(deep=True).sum() / 1024**2
print(f"\nPandas memory: {pandas_new_memory:.1f} MB (doubled from 38.3 MB)")
print("Vaex memory: 38.3 MB (unchanged with virtual columns)")

Here we create the same five calculated columns in both libraries and measure execution time.

=== Creating 5 New Calculated Columns ===

Vaex: 1.51 ms (lazy, no computation yet)
Pandas: 28.03 ms (computed immediately)

Speed: 18.6x faster with Vaex

Pandas memory: 76.7 MB (doubled from 38.3 MB)
Vaex memory: 38.3 MB (unchanged with virtual columns)

The results are striking. Vaex completed the operations 18.6 times faster than pandas because it didn’t actually compute anything yet. It simply recorded the operations to execute later. More importantly, pandas doubled its memory usage from 38.3 MB to 76.7 MB as it created physical copies of each new column. Vaex’s memory usage remained at 38.3 MB because it creates virtual columns that are computed on demand rather than stored in memory.

This difference becomes even more significant with larger datasets. If you’re working with a 10 GB dataset and create five new columns in pandas, you’re suddenly using 20 GB of RAM. With Vaex, you’d still be using about 10 GB.

Note: These benchmarks were run on a specific hardware configuration. Performance will vary based on your system specifications and library versions, so test on your own setup for accurate results.

Filtering Data

Filtering data in Vaex creates views rather than copies, making it memory efficient.

# Create filtered views without copying data
setosa = df[df.class_ == 0]
large_sepals = df[df.sepal_length > 6.0]
complex_filter = df[(df.sepal_length > 6.0) & (df.petal_width < 1.5)]

print(f"Original dataset: {len(df):,} rows")
print(f"Setosa flowers: {len(setosa):,} rows")
print(f"Large sepals: {len(large_sepals):,} rows")
print(f"Complex filter: {len(complex_filter):,} rows")
print("\nAll filters created instantly without copying data")

Each filter operation creates a new view of the data without copying the underlying arrays.

Original dataset: 1,005,000 rows
Setosa flowers: 335,000 rows
Large sepals: 408,700 rows
Complex filter: 87,100 rows

All filters created instantly without copying data

All three filtered datasets were created instantly because Vaex simply stores the filter conditions rather than creating new datasets. You can combine multiple conditions using standard Python operators like & for AND and | for OR. When you eventually use these filtered views, Vaex applies the filters on the fly.

Creating Virtual Columns

Virtual columns are one of Vaex’s most valuable features. These columns don’t exist in memory but are computed on demand when needed.

# Create computed columns without using extra memory
df['sepal_ratio'] = df.sepal_length / df.sepal_width
df['petal_ratio'] = df.petal_length / df.petal_width
df['sepal_area'] = df.sepal_length * df.sepal_width

# View the results
print("Virtual columns created:")
print(df[['sepal_length', 'sepal_width', 'sepal_ratio', 'sepal_area']].head())
print("\nThese columns are computed on-the-fly, no extra memory used")

We create three new calculated columns that appear alongside the original columns.

Virtual columns created:
  #    sepal_length    sepal_width    sepal_ratio    sepal_area
  0             5.9            3          1.96667         17.7
  1             6.1            3          2.03333         18.3
  2             6.6            2.9        2.27586         19.14
  3             6.7            3.3        2.0303          22.11
  4             5.5            4.2        1.30952         23.1
  5             5.1            3.4        1.5             17.34
  6             6.3            2.3        2.73913         14.49
  7             5              3.5        1.42857         17.5
  8             6.7            3.1        2.16129         20.77
  9             6              2.2        2.72727         13.2

These columns are computed on-the-fly, no extra memory used

The virtual columns appear in the output as if they were regular columns, but they’re calculated on the fly when accessed. This means you can create dozens of derived columns without worrying about memory usage. If you need to materialize these columns for export or further processing, you can do so explicitly using methods like materialize() or when exporting to a file format.

Aggregations and Grouping

Vaex handles aggregations efficiently by computing statistics in a single pass over the data.

# Quick aggregation statistics
print("Mean values across species:")
print(f"Sepal length: {df.mean(df.sepal_length):.2f}")
print(f"Petal length: {df.mean(df.petal_length):.2f}")

# Groupby with multiple aggregations
grouped = df.groupby(by='class_', agg={
    'mean_sepal': vaex.agg.mean('sepal_length'),
    'mean_petal': vaex.agg.mean('petal_length'),
    'count': vaex.agg.count()
})

print("\nGrouped by species (class_):")
print(grouped)

Computing means across the entire dataset and grouping by categories happens efficiently.

Mean values across species:
Sepal length: 5.84
Petal length: 3.76

Grouped by species (class_):
  #    class_    mean_sepal    mean_petal    count
  0         0         5.006         1.464   335000
  1         1         5.936         4.26    335000
  2         2         6.588         5.552   335000

Vaex’s aggregation system lets you compute multiple statistics in a single pass through the data. The groupby() operation works similarly to pandas but is designed for large datasets. You can combine multiple aggregation functions including mean, sum, std, min, max, and count. For datasets with millions or billions of rows, this approach is faster than pandas because it minimizes the number of passes through the data.

Conclusion

Vaex provides an efficient way to work with large datasets that would otherwise be difficult or impossible to handle with traditional libraries. Its lazy evaluation system means operations are nearly instantaneous to set up, and its memory mapping capabilities let you work with datasets much larger than your available RAM. The virtual column system lets you create unlimited derived columns without memory overhead, while the filtering and aggregation systems provide the familiar DataFrame interface you’re used to from pandas.

For datasets under a few hundred megabytes, pandas remains an excellent choice with its rich ecosystem and familiar syntax. However, once your datasets grow to gigabytes or larger, Vaex becomes an invaluable tool that can mean the difference between a smooth analysis workflow and constant memory errors. Consider using Vaex when working with large CSV files, when you need to create many derived columns, or when your pandas workflows are consuming too much memory or running slowly.