Language: Python
Data Science
Vaex was created by Jovan Popovic in 2015 to handle very large tabular datasets efficiently. It leverages memory mapping, lazy evaluations, and optimized algorithms to provide a pandas-like interface while scaling to billions of rows, making it ideal for big data analysis and visualization.
Vaex is a high-performance Python library for out-of-core DataFrames, enabling visualization and exploration of datasets larger than memory. It allows fast filtering, grouping, aggregations, and statistical computations without loading the full dataset into RAM.
pip install vaexconda install -c conda-forge vaexVaex allows you to manipulate, filter, group, and aggregate large datasets efficiently. It uses lazy evaluations to compute results only when needed, which minimizes memory usage. Vaex integrates well with NumPy and Pandas-like syntax.
import vaex
df = vaex.from_csv('data.csv', convert=True)
print(df.head())Loads a CSV file into a Vaex DataFrame. `convert=True` converts it to a fast HDF5-backed format for faster future access.
filtered = df[df['age'] > 30]
print(filtered.head())Filters rows where the 'age' column is greater than 30 without loading the full dataset into memory.
agg = df.groupby('department', agg={'avg_salary': vaex.agg.mean('salary')})
print(agg)Performs a group-by operation and calculates the average salary per department.
df['bmi'] = df['weight'] / (df['height']/100)**2
print(df[['weight','height','bmi']].head())Creates a virtual column 'bmi' based on existing columns. Computation is lazy and memory-efficient.
import matplotlib.pyplot as plt
agg = df.count(binby=df['age'], limits=[0,100], shape=100)
plt.plot(agg)
plt.show()Generates a histogram for the 'age' column using Vaex's fast binning.
df.export_csv('filtered.csv')
df.export_hdf5('filtered.hdf5')Exports Vaex DataFrames to CSV or HDF5 formats for further use.
Use HDF5 or Arrow format for very large datasets for faster access.
Leverage virtual columns to avoid unnecessary memory usage.
Apply filtering and aggregation lazily to scale computations efficiently.
Use `vaex.open()` or `vaex.from_csv(convert=True)` to optimize repeated data loads.
Combine with visualization tools like Matplotlib or Bokeh for interactive plotting of large datasets.