A curated list of frameworks, platforms, tools, architectures, and learning resources for Big Data, covering distributed storage, batch and stream processing, analytics engines, and large-scale data infrastructure.
- Foundations & Concepts
- Distributed Storage & File Systems
- Data Processing Engines
- Streaming & Real-Time Processing
- Query Engines & SQL on Big Data
- Data Warehousing & OLAP
- NoSQL Databases
- Data Ingestion & Integration
- Workflow Orchestration
- Resource Management & Cluster Computing
- Monitoring, Governance & Quality
- Cloud Big Data Platforms
- Learning Resources
- Related Awesome Lists
- Big Data Explained – Overview of big data characteristics, use cases, and architectures.
- Lambda Architecture – Architecture combining batch and stream processing.
- Kappa Architecture – Stream-first alternative to Lambda architecture.
- CAP Theorem – Trade-offs in distributed systems consistency, availability, and partition tolerance.
- Data Lake Architecture – Centralized storage for raw and processed data.
- HDFS – Distributed file system designed for large-scale data storage.
- Amazon S3 – Object storage widely used as a data lake backend.
- Google Cloud Storage – Scalable object storage for analytics workloads.
- Azure Data Lake Storage – Optimized storage for big data analytics on Azure.
- Ceph – Distributed storage system for object, block, and file data.
- Apache Hadoop – Framework for distributed storage and batch processing.
- Apache Spark – Unified analytics engine for batch and stream processing.
- Apache Flink – Stream-first data processing framework with low latency.
- Apache Beam – Unified programming model for batch and streaming data.
- Dask – Parallel computing library for scalable data processing in Python.
- Apache Kafka – Distributed event streaming platform.
- Apache Pulsar – Cloud-native pub/sub and streaming platform.
- Apache Storm – Real-time computation system for stream processing.
- Kafka Streams – Stream processing library built on Kafka.
- Redpanda – Kafka-compatible streaming platform with simplified operations.
- Trino – Distributed SQL query engine for large datasets.
- Presto – High-performance SQL engine for analytics.
- Apache Hive – SQL-like query system for Hadoop.
- Apache Drill – Schema-free SQL query engine.
- Spark SQL – SQL module for Apache Spark.
- Snowflake – Cloud-native data warehouse for analytics.
- Google BigQuery – Serverless enterprise data warehouse.
- Amazon Redshift – Managed data warehouse on AWS.
- ClickHouse – Column-oriented OLAP database for real-time analytics.
- Apache Druid – High-performance OLAP database for event data.
- Apache Cassandra – Distributed wide-column NoSQL database.
- Apache HBase – NoSQL database built on HDFS.
- MongoDB – Document-oriented NoSQL database.
- Amazon DynamoDB – Fully managed NoSQL key-value database.
- ScyllaDB – High-performance Cassandra-compatible database.
- Apache Kafka Connect – Framework for moving data between Kafka and external systems.
- Apache NiFi – Visual data ingestion and flow management tool.
- Apache Sqoop – Bulk data transfer between Hadoop and relational databases.
- Airbyte – Open-source data integration platform.
- Fivetran – Managed ELT pipelines for analytics teams.
- Apache Airflow – Platform for programmatically authoring and scheduling workflows.
- Dagster – Data orchestration platform with strong observability.
- Luigi – Python module for building complex pipelines.
- Prefect – Workflow orchestration system for data engineering.
- Argo Workflows – Kubernetes-native workflow engine.
- YARN – Resource management layer for Hadoop clusters.
- Kubernetes – Container orchestration platform increasingly used for big data workloads.
- Apache Mesos – Distributed systems kernel for resource isolation and sharing.
- Ray – Distributed computing framework for scalable applications.
- Apache Atlas – Metadata management and data governance platform.
- Great Expectations – Data quality and validation framework.
- OpenLineage – Open standard for data lineage collection.
- Prometheus – Monitoring and alerting toolkit for distributed systems.
- Grafana – Visualization and observability platform.
- Databricks – Unified analytics platform built on Apache Spark.
- AWS EMR – Managed big data platform on AWS.
- Google Dataproc – Managed Spark and Hadoop service.
- Azure Synapse Analytics – Integrated analytics service for big data and warehousing.
- Alibaba Cloud MaxCompute – Large-scale data warehousing and analytics platform.
- Spark Documentation – Official guides for Apache Spark.
- Kafka Documentation – Official Kafka concepts and tutorials.
- Hadoop Training – Hadoop ecosystem documentation.
- Designing Data-Intensive Applications – Foundational book on scalable data systems.
- Big Data Architecture Patterns – Common patterns for big data solutions.
- Streaming Systems – Concepts and architectures for stream processing.
- Big Data Fundamentals – Distributed systems and data processing basics.
- Spark & Streaming Analytics – Batch and real-time data processing.
- Cloud Big Data Engineering – Building scalable analytics on cloud platforms.
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.