A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering, covering data ingestion, transformation, storage, orchestration, and reliable data infrastructure at scale.
- Foundations & Concepts
- Data Ingestion & Integration
- Streaming & Event Processing
- Data Transformation & Modeling
- Workflow Orchestration
- Storage, Warehousing & Lakehouses
- Query Engines & Analytics
- NoSQL & Specialized Datastores
- Data Quality, Governance & Lineage
- Observability & Reliability
- Infrastructure & Platforms
- Data Engineering on the Cloud
- Learning Resources
- Related Awesome Lists
- Data Engineering Explained – Overview of data engineering roles, responsibilities, and workflows.
- Modern Data Stack – Overview of modern analytics and data engineering tooling.
- Data Lake vs Data Warehouse – Comparison of storage architectures for analytics.
- CAP Theorem – Fundamental trade-offs in distributed data systems.
- Event-Driven Architecture – Architectural style for real-time data systems.
- Apache Kafka Connect – Framework for moving data between Kafka and external systems.
- Apache NiFi – Visual data ingestion and flow automation platform.
- Airbyte – Open-source data integration platform for ELT pipelines.
- Fivetran – Managed data connectors for analytics and warehousing.
- Singer – Open-source standard for data extraction and loading.
- Debezium – Change data capture (CDC) platform for databases.
- Apache Kafka – Distributed event streaming platform.
- Apache Pulsar – Cloud-native pub/sub and streaming platform.
- Apache Flink – Stream-first processing framework with low latency.
- Kafka Streams – Stream processing library built on Kafka.
- Apache Storm – Real-time computation system for stream processing.
- dbt – SQL-based transformation and analytics engineering tool.
- Apache Spark – Distributed engine for large-scale data processing.
- Apache Beam – Unified programming model for batch and streaming pipelines.
- Dask – Parallel computing library for scalable Python data processing.
- SQLMesh – Versioned, testable SQL transformations.
- Apache Airflow – Platform for scheduling and monitoring data workflows.
- Dagster – Data orchestration platform with strong observability and testing.
- Prefect – Workflow orchestration system for data pipelines.
- Luigi – Python package for building complex pipelines.
- Argo Workflows – Kubernetes-native workflow engine.
- Amazon S3 – Object storage widely used as a data lake.
- Google Cloud Storage – Scalable object storage for analytics workloads.
- Azure Data Lake Storage – Optimized storage for analytics on Azure.
- Snowflake – Cloud-native data warehouse.
- BigQuery – Serverless analytics data warehouse.
- Delta Lake – Open-source storage layer enabling lakehouse architecture.
- Apache Iceberg – Table format for large-scale analytic datasets.
- Apache Hudi – Incremental data processing and lakehouse framework.
- Trino – Distributed SQL query engine for large datasets.
- Presto – High-performance distributed SQL engine.
- Spark SQL – SQL analytics module built on Apache Spark.
- DuckDB – In-process analytical SQL engine.
- ClickHouse – Column-oriented OLAP database.
- Apache Cassandra – Distributed wide-column NoSQL database.
- MongoDB – Document-oriented NoSQL database.
- Apache HBase – NoSQL database built on HDFS.
- Amazon DynamoDB – Managed NoSQL key-value store.
- Redis – In-memory data store for caching and streaming use cases.
- Great Expectations – Data quality validation framework.
- Apache Atlas – Metadata management and data governance platform.
- OpenLineage – Open standard for capturing data lineage.
- DataHub – Open-source metadata and data catalog.
- Amundsen – Data discovery and metadata engine.
- Monte Carlo – Data observability platform for pipelines.
- Bigeye – Data quality monitoring and alerting.
- Prometheus – Metrics and monitoring system.
- Grafana – Visualization platform for observability.
- OpenTelemetry – Observability framework for distributed systems.
- Kubernetes – Container orchestration for data workloads.
- Ray – Distributed computing framework for scalable data processing.
- Terraform – Infrastructure as code for data platforms.
- Apache Mesos – Distributed systems kernel for resource management.
- Databricks – Unified analytics platform built on Apache Spark.
- AWS EMR – Managed big data platform on AWS.
- Google Dataproc – Managed Spark and Hadoop service.
- Azure Synapse Analytics – Integrated analytics service.
- Snowflake Data Cloud – Platform for data sharing and analytics.
- Data Engineering Zoomcamp – Free hands-on data engineering course.
- Apache Spark Documentation – Official Spark guides and examples.
- Kafka Documentation – Official Kafka tutorials.
- Designing Data-Intensive Applications – Foundational book on scalable data systems.
- Streaming Systems – Concepts and architectures for stream processing.
- Data Engineering Best Practices – Modern data engineering workflows.
- Data Engineering Fundamentals – Core data pipeline concepts.
- Streaming Data Engineering – Real-time data processing architectures.
- Cloud Data Engineering – Building scalable pipelines in the cloud.
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.