Skip to content

brandonhimpfen/awesome-data-engineering

Repository files navigation

Awesome Data Engineering Awesome Lists

GitHub Sponsors   Ko-Fi   PayPal   Stripe   X   Facebook

A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering, covering data ingestion, transformation, storage, orchestration, and reliable data infrastructure at scale.

Contents

Foundations & Concepts

Data Ingestion & Integration

  • Apache Kafka Connect – Framework for moving data between Kafka and external systems.
  • Apache NiFi – Visual data ingestion and flow automation platform.
  • Airbyte – Open-source data integration platform for ELT pipelines.
  • Fivetran – Managed data connectors for analytics and warehousing.
  • Singer – Open-source standard for data extraction and loading.
  • Debezium – Change data capture (CDC) platform for databases.

Streaming & Event Processing

  • Apache Kafka – Distributed event streaming platform.
  • Apache Pulsar – Cloud-native pub/sub and streaming platform.
  • Apache Flink – Stream-first processing framework with low latency.
  • Kafka Streams – Stream processing library built on Kafka.
  • Apache Storm – Real-time computation system for stream processing.

Data Transformation & Modeling

  • dbt – SQL-based transformation and analytics engineering tool.
  • Apache Spark – Distributed engine for large-scale data processing.
  • Apache Beam – Unified programming model for batch and streaming pipelines.
  • Dask – Parallel computing library for scalable Python data processing.
  • SQLMesh – Versioned, testable SQL transformations.

Workflow Orchestration

  • Apache Airflow – Platform for scheduling and monitoring data workflows.
  • Dagster – Data orchestration platform with strong observability and testing.
  • Prefect – Workflow orchestration system for data pipelines.
  • Luigi – Python package for building complex pipelines.
  • Argo Workflows – Kubernetes-native workflow engine.

Storage, Warehousing & Lakehouses

Query Engines & Analytics

  • Trino – Distributed SQL query engine for large datasets.
  • Presto – High-performance distributed SQL engine.
  • Spark SQL – SQL analytics module built on Apache Spark.
  • DuckDB – In-process analytical SQL engine.
  • ClickHouse – Column-oriented OLAP database.

NoSQL & Specialized Datastores

Data Quality, Governance & Lineage

  • Great Expectations – Data quality validation framework.
  • Apache Atlas – Metadata management and data governance platform.
  • OpenLineage – Open standard for capturing data lineage.
  • DataHub – Open-source metadata and data catalog.
  • Amundsen – Data discovery and metadata engine.

Observability & Reliability

  • Monte Carlo – Data observability platform for pipelines.
  • Bigeye – Data quality monitoring and alerting.
  • Prometheus – Metrics and monitoring system.
  • Grafana – Visualization platform for observability.
  • OpenTelemetry – Observability framework for distributed systems.

Infrastructure & Platforms

  • Kubernetes – Container orchestration for data workloads.
  • Ray – Distributed computing framework for scalable data processing.
  • Terraform – Infrastructure as code for data platforms.
  • Apache Mesos – Distributed systems kernel for resource management.

Data Engineering on the Cloud

Learning Resources

Tutorials

Guides

Courses

  • Data Engineering Fundamentals – Core data pipeline concepts.
  • Streaming Data Engineering – Real-time data processing architectures.
  • Cloud Data Engineering – Building scalable pipelines in the cloud.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

License

CC0

About

A curated list of tools, frameworks, platforms, architectures, and learning resources for data engineering.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages