Introduction#

Daft is a high-performance data engine designed for AI and multimodal workloads, providing simple, reliable data processing for images, audio, video, and structured data at any scale.

READ ANYTHING

Daft reads raw unstructured and multimodal data collected from application systems

Object storage (AWS S3, GCS, R2)
Event bus (Kafka)
Data lake (Iceberg, Deltalake)

EXPENSIVE TRANSFORMATIONS

Build efficient Daft data pipelines involving heavy transformations

GPU models
User-provided Python code
External LLM APIs

WRITE

Daft lands data into specialized data systems for downstream use cases

Search (full-text search and vector DBs)
Applications (SQL/NoSQL databases)
Analytics (data warehouses)
Model training (S3 object storage)

Why Daft?#

Unified multimodal data processing

While traditional dataframes struggle with anything beyond tables, Daft natively handles tables, images, text, and embeddings through a single Python API. No more stitching together specialized tools for different data types.

Python-native, no JVM required

Built for modern AI/ML workflows with Python at its core and Rust under the hood. Skip the JVM complexity, version conflicts, and memory tuning to achieve 20x faster start times—get the performance without the Java tax.

Seamless scaling, from laptop to cluster

Start local, scale global—without changing a line of code. Daft's Rust-powered engine delivers blazing performance on a single machine and effortlessly extends to distributed clusters when you need more horsepower.

Key Features#

Native multimodal processing: Process any data type—from structured tables to unstructured text and rich media—with native support for images, audio, video, and embeddings in a single, unified framework.
Built-in AI operations: Transform data with AI natively: run LLM prompts with structured outputs, generate embeddings, and classify images or text using models from OpenAI, Transformers, or your own custom providers, all optimized for batch processing.
Rust-powered performance: Experience breakthrough speed with our Rust foundation delivering vectorized execution and non-blocking I/O that processes the same queries with 5x less memory while consistently outperforming industry standards by an order of magnitude.
Universal data connectivity: Access data anywhere it lives—cloud storage (S3, Azure, GCS, Hugging Face), modern table formats (Apache Iceberg, Delta Lake, Apache Hudi), or enterprise catalogs (Unity Catalog, AWS Glue)—all with zero configuration.
Push your code to your data: Bring your Python functions directly to your data with zero-copy UDFs powered by Apache Arrow, eliminating data movement overhead and accelerating processing speeds.
Out of the box reliability: Deploy with confidence—intelligent memory management prevents OOM errors while sensible defaults eliminate configuration headaches, letting you focus on results, not infrastructure.

Looking to get started with Daft ASAP?

If you are ready to jump into code, take a look at these resources:

Quickstart: Itching to run some Daft code? Hit the ground running with our 10 minute quickstart.
Examples: See Daft in action with use cases across text, images, audio, and more.
API Documentation: Searchable documentation and reference material to Daft's public API.

Contribute to Daft#

If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us on GitHub 🚀

Take a look at the many issues tagged with good first issue in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our Distributed Data Slack Community and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!