Awesome Big Data

A curated list of frameworks, platforms, tools, architectures, and learning resources for Big Data, covering distributed storage, batch and stream processing, analytics engines, and large-scale data infrastructure.

Foundations & Concepts
Distributed Storage & File Systems
Data Processing Engines
Streaming & Real-Time Processing
Query Engines & SQL on Big Data
Data Warehousing & OLAP
NoSQL Databases
Data Ingestion & Integration
Workflow Orchestration
Resource Management & Cluster Computing
Monitoring, Governance & Quality
Cloud Big Data Platforms
Learning Resources
Related Awesome Lists

Foundations & Concepts

Big Data Explained – Overview of big data characteristics, use cases, and architectures.
Lambda Architecture – Architecture combining batch and stream processing.
Kappa Architecture – Stream-first alternative to Lambda architecture.
CAP Theorem – Trade-offs in distributed systems consistency, availability, and partition tolerance.
Data Lake Architecture – Centralized storage for raw and processed data.

Distributed Storage & File Systems

HDFS – Distributed file system designed for large-scale data storage.
Amazon S3 – Object storage widely used as a data lake backend.
Google Cloud Storage – Scalable object storage for analytics workloads.
Azure Data Lake Storage – Optimized storage for big data analytics on Azure.
Ceph – Distributed storage system for object, block, and file data.

Data Processing Engines

Apache Hadoop – Framework for distributed storage and batch processing.
Apache Spark – Unified analytics engine for batch and stream processing.
Apache Flink – Stream-first data processing framework with low latency.
Apache Beam – Unified programming model for batch and streaming data.
Dask – Parallel computing library for scalable data processing in Python.

Streaming & Real-Time Processing

Apache Kafka – Distributed event streaming platform.
Apache Pulsar – Cloud-native pub/sub and streaming platform.
Apache Storm – Real-time computation system for stream processing.
Kafka Streams – Stream processing library built on Kafka.
Redpanda – Kafka-compatible streaming platform with simplified operations.

Query Engines & SQL on Big Data

Trino – Distributed SQL query engine for large datasets.
Presto – High-performance SQL engine for analytics.
Apache Hive – SQL-like query system for Hadoop.
Apache Drill – Schema-free SQL query engine.
Spark SQL – SQL module for Apache Spark.

Data Warehousing & OLAP

Snowflake – Cloud-native data warehouse for analytics.
Google BigQuery – Serverless enterprise data warehouse.
Amazon Redshift – Managed data warehouse on AWS.
ClickHouse – Column-oriented OLAP database for real-time analytics.
Apache Druid – High-performance OLAP database for event data.

NoSQL Databases

Apache Cassandra – Distributed wide-column NoSQL database.
Apache HBase – NoSQL database built on HDFS.
MongoDB – Document-oriented NoSQL database.
Amazon DynamoDB – Fully managed NoSQL key-value database.
ScyllaDB – High-performance Cassandra-compatible database.

Data Ingestion & Integration

Apache Kafka Connect – Framework for moving data between Kafka and external systems.
Apache NiFi – Visual data ingestion and flow management tool.
Apache Sqoop – Bulk data transfer between Hadoop and relational databases.
Airbyte – Open-source data integration platform.
Fivetran – Managed ELT pipelines for analytics teams.

Workflow Orchestration

Apache Airflow – Platform for programmatically authoring and scheduling workflows.
Dagster – Data orchestration platform with strong observability.
Luigi – Python module for building complex pipelines.
Prefect – Workflow orchestration system for data engineering.
Argo Workflows – Kubernetes-native workflow engine.

Resource Management & Cluster Computing

YARN – Resource management layer for Hadoop clusters.
Kubernetes – Container orchestration platform increasingly used for big data workloads.
Apache Mesos – Distributed systems kernel for resource isolation and sharing.
Ray – Distributed computing framework for scalable applications.

Monitoring, Governance & Quality

Apache Atlas – Metadata management and data governance platform.
Great Expectations – Data quality and validation framework.
OpenLineage – Open standard for data lineage collection.
Prometheus – Monitoring and alerting toolkit for distributed systems.
Grafana – Visualization and observability platform.

Cloud Big Data Platforms

Databricks – Unified analytics platform built on Apache Spark.
AWS EMR – Managed big data platform on AWS.
Google Dataproc – Managed Spark and Hadoop service.
Azure Synapse Analytics – Integrated analytics service for big data and warehousing.
Alibaba Cloud MaxCompute – Large-scale data warehousing and analytics platform.

Learning Resources

Tutorials

Spark Documentation – Official guides for Apache Spark.
Kafka Documentation – Official Kafka concepts and tutorials.
Hadoop Training – Hadoop ecosystem documentation.

Guides

Designing Data-Intensive Applications – Foundational book on scalable data systems.
Big Data Architecture Patterns – Common patterns for big data solutions.
Streaming Systems – Concepts and architectures for stream processing.

Courses

Big Data Fundamentals – Distributed systems and data processing basics.
Spark & Streaming Analytics – Batch and real-time data processing.
Cloud Big Data Engineering – Building scalable analytics on cloud platforms.

Related Awesome Lists

Contribute

Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.

Pull requests that do not adhere to the contribution guidelines may be closed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
.editorconfig		.editorconfig
.gitattributes		.gitattributes
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
check_readme_links.py		check_readme_links.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Big Data

Contents

Foundations & Concepts

Distributed Storage & File Systems

Data Processing Engines

Streaming & Real-Time Processing

Query Engines & SQL on Big Data

Data Warehousing & OLAP

NoSQL Databases

Data Ingestion & Integration

Workflow Orchestration

Resource Management & Cluster Computing

Monitoring, Governance & Quality

Cloud Big Data Platforms

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Big Data

Contents

Foundations & Concepts

Distributed Storage & File Systems

Data Processing Engines

Streaming & Real-Time Processing

Query Engines & SQL on Big Data

Data Warehousing & OLAP

NoSQL Databases

Data Ingestion & Integration

Workflow Orchestration

Resource Management & Cluster Computing

Monitoring, Governance & Quality

Cloud Big Data Platforms

Learning Resources

Tutorials

Guides

Courses

Related Awesome Lists

Contribute

License

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages