Awesome Data Engineering

A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering, covering data ingestion, transformation, storage, orchestration, and reliable data infrastructure at scale.

Foundations & Concepts
Data Ingestion & Integration
Streaming & Event Processing
Data Transformation & Modeling
Workflow Orchestration
Storage, Warehousing & Lakehouses
Query Engines & Analytics
NoSQL & Specialized Datastores
Data Quality, Governance & Lineage
Observability & Reliability
Infrastructure & Platforms
Data Engineering on the Cloud
Learning Resources
Related Awesome Lists

Foundations & Concepts

Data Engineering Explained – Overview of data engineering roles, responsibilities, and workflows.
Modern Data Stack – Overview of modern analytics and data engineering tooling.
Data Lake vs Data Warehouse – Comparison of storage architectures for analytics.
CAP Theorem – Fundamental trade-offs in distributed data systems.
Event-Driven Architecture – Architectural style for real-time data systems.

Data Ingestion & Integration

Apache Kafka Connect – Framework for moving data between Kafka and external systems.
Apache NiFi – Visual data ingestion and flow automation platform.
Airbyte – Open-source data integration platform for ELT pipelines.
Fivetran – Managed data connectors for analytics and warehousing.
Singer – Open-source standard for data extraction and loading.
Debezium – Change data capture (CDC) platform for databases.

Streaming & Event Processing

Apache Kafka – Distributed event streaming platform.
Apache Pulsar – Cloud-native pub/sub and streaming platform.
Apache Flink – Stream-first processing framework with low latency.
Kafka Streams – Stream processing library built on Kafka.
Apache Storm – Real-time computation system for stream processing.

Data Transformation & Modeling

dbt – SQL-based transformation and analytics engineering tool.
Apache Spark – Distributed engine for large-scale data processing.
Apache Beam – Unified programming model for batch and streaming pipelines.
Dask – Parallel computing library for scalable Python data processing.
SQLMesh – Versioned, testable SQL transformations.

Workflow Orchestration

Apache Airflow – Platform for scheduling and monitoring data workflows.
Dagster – Data orchestration platform with strong observability and testing.
Prefect – Workflow orchestration system for data pipelines.
Luigi – Python package for building complex pipelines.
Argo Workflows – Kubernetes-native workflow engine.

Storage, Warehousing & Lakehouses

Amazon S3 – Object storage widely used as a data lake.
Google Cloud Storage – Scalable object storage for analytics workloads.
Azure Data Lake Storage – Optimized storage for analytics on Azure.
Snowflake – Cloud-native data warehouse.
BigQuery – Serverless analytics data warehouse.
Delta Lake – Open-source storage layer enabling lakehouse architecture.
Apache Iceberg – Table format for large-scale analytic datasets.
Apache Hudi – Incremental data processing and lakehouse framework.

Query Engines & Analytics

Trino – Distributed SQL query engine for large datasets.
Presto – High-performance distributed SQL engine.
Spark SQL – SQL analytics module built on Apache Spark.
DuckDB – In-process analytical SQL engine.
ClickHouse – Column-oriented OLAP database.

NoSQL & Specialized Datastores

Apache Cassandra – Distributed wide-column NoSQL database.
MongoDB – Document-oriented NoSQL database.
Apache HBase – NoSQL database built on HDFS.
Amazon DynamoDB – Managed NoSQL key-value store.
Redis – In-memory data store for caching and streaming use cases.

Data Quality, Governance & Lineage

Great Expectations – Data quality validation framework.
Apache Atlas – Metadata management and data governance platform.
OpenLineage – Open standard for capturing data lineage.
DataHub – Open-source metadata and data catalog.
Amundsen – Data discovery and metadata engine.

Observability & Reliability

Monte Carlo – Data observability platform for pipelines.
Bigeye – Data quality monitoring and alerting.
Prometheus – Metrics and monitoring system.
Grafana – Visualization platform for observability.
OpenTelemetry – Observability framework for distributed systems.

Infrastructure & Platforms

Kubernetes – Container orchestration for data workloads.
Ray – Distributed computing framework for scalable data processing.
Terraform – Infrastructure as code for data platforms.
Apache Mesos – Distributed systems kernel for resource management.

Data Engineering on the Cloud

Databricks – Unified analytics platform built on Apache Spark.
AWS EMR – Managed big data platform on AWS.
Google Dataproc – Managed Spark and Hadoop service.
Azure Synapse Analytics – Integrated analytics service.
Snowflake Data Cloud – Platform for data sharing and analytics.

Learning Resources

Tutorials

Data Engineering Zoomcamp – Free hands-on data engineering course.
Apache Spark Documentation – Official Spark guides and examples.
Kafka Documentation – Official Kafka tutorials.

Guides

Designing Data-Intensive Applications – Foundational book on scalable data systems.
Streaming Systems – Concepts and architectures for stream processing.
Data Engineering Best Practices – Modern data engineering workflows.

Courses

Data Engineering Fundamentals – Core data pipeline concepts.
Streaming Data Engineering – Real-time data processing architectures.
Cloud Data Engineering – Building scalable pipelines in the cloud.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
check_readme_links.py		check_readme_links.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Data Engineering

Contents

Foundations & Concepts

Data Ingestion & Integration

Streaming & Event Processing

Data Transformation & Modeling

Workflow Orchestration

Storage, Warehousing & Lakehouses

Query Engines & Analytics

NoSQL & Specialized Datastores

Data Quality, Governance & Lineage

Observability & Reliability

Infrastructure & Platforms

Data Engineering on the Cloud

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 1

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Engineering

Contents

Foundations & Concepts

Data Ingestion & Integration

Streaming & Event Processing

Data Transformation & Modeling

Workflow Orchestration

Storage, Warehousing & Lakehouses

Query Engines & Analytics

NoSQL & Specialized Datastores

Data Quality, Governance & Lineage

Observability & Reliability

Infrastructure & Platforms

Data Engineering on the Cloud

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 1

Languages

Packages