Skip to content

brandonhimpfen/awesome-big-data

Repository files navigation

Awesome Big Data Awesome Lists

GitHub Sponsors   Ko-Fi   PayPal   Stripe   X   Facebook

A curated list of frameworks, platforms, tools, architectures, and learning resources for Big Data, covering distributed storage, batch and stream processing, analytics engines, and large-scale data infrastructure.

Contents

Foundations & Concepts

Distributed Storage & File Systems

  • HDFS – Distributed file system designed for large-scale data storage.
  • Amazon S3 – Object storage widely used as a data lake backend.
  • Google Cloud Storage – Scalable object storage for analytics workloads.
  • Azure Data Lake Storage – Optimized storage for big data analytics on Azure.
  • Ceph – Distributed storage system for object, block, and file data.

Data Processing Engines

  • Apache Hadoop – Framework for distributed storage and batch processing.
  • Apache Spark – Unified analytics engine for batch and stream processing.
  • Apache Flink – Stream-first data processing framework with low latency.
  • Apache Beam – Unified programming model for batch and streaming data.
  • Dask – Parallel computing library for scalable data processing in Python.

Streaming & Real-Time Processing

  • Apache Kafka – Distributed event streaming platform.
  • Apache Pulsar – Cloud-native pub/sub and streaming platform.
  • Apache Storm – Real-time computation system for stream processing.
  • Kafka Streams – Stream processing library built on Kafka.
  • Redpanda – Kafka-compatible streaming platform with simplified operations.

Query Engines & SQL on Big Data

  • Trino – Distributed SQL query engine for large datasets.
  • Presto – High-performance SQL engine for analytics.
  • Apache Hive – SQL-like query system for Hadoop.
  • Apache Drill – Schema-free SQL query engine.
  • Spark SQL – SQL module for Apache Spark.

Data Warehousing & OLAP

NoSQL Databases

Data Ingestion & Integration

  • Apache Kafka Connect – Framework for moving data between Kafka and external systems.
  • Apache NiFi – Visual data ingestion and flow management tool.
  • Apache Sqoop – Bulk data transfer between Hadoop and relational databases.
  • Airbyte – Open-source data integration platform.
  • Fivetran – Managed ELT pipelines for analytics teams.

Workflow Orchestration

  • Apache Airflow – Platform for programmatically authoring and scheduling workflows.
  • Dagster – Data orchestration platform with strong observability.
  • Luigi – Python module for building complex pipelines.
  • Prefect – Workflow orchestration system for data engineering.
  • Argo Workflows – Kubernetes-native workflow engine.

Resource Management & Cluster Computing

  • YARN – Resource management layer for Hadoop clusters.
  • Kubernetes – Container orchestration platform increasingly used for big data workloads.
  • Apache Mesos – Distributed systems kernel for resource isolation and sharing.
  • Ray – Distributed computing framework for scalable applications.

Monitoring, Governance & Quality

  • Apache Atlas – Metadata management and data governance platform.
  • Great Expectations – Data quality and validation framework.
  • OpenLineage – Open standard for data lineage collection.
  • Prometheus – Monitoring and alerting toolkit for distributed systems.
  • Grafana – Visualization and observability platform.

Cloud Big Data Platforms

Learning Resources

Tutorials

Guides

Courses

  • Big Data Fundamentals – Distributed systems and data processing basics.
  • Spark & Streaming Analytics – Batch and real-time data processing.
  • Cloud Big Data Engineering – Building scalable analytics on cloud platforms.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

License

CC0

About

A curated list of frameworks, platforms, tools, architectures, and learning resources for Big Data.

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages