Overview
Amazon EMR, formerly known as Amazon Elastic MapReduce, is an advanced cluster platform designed to simplify, scale, and optimize the processing and analysis of massive data volumes. With its user-friendly interface, flexibility, and affordability, EMR offers an ideal solution for organizations seeking to leverage big data. The platform boasts a wide range of data processing engines, such as Apache Hadoop, Apache Spark, Apache Hive, Apache Flink, and many others, enabling users to harness the power of these tools for efficient data manipulation and insights generation.
AWS EMR clusters send metrics to CloudWatch in 5 minute intervals (1 minute if detailed monitoring is enabled).
Collect EMR Hadoop 2.x metrics from AWS EMR clusters.
Steps/Tasks
- Setup AWS EMR Cluster
- Set up an EMR cluster with Hadoop 2.x.
- Fetch metrics using the generic
cloudwatch metricbeat module
- Utilize the cloudwatch metricbeat module to fetch the required EMR Hadoop 2.x metrics.
- Create AWS EMR metrics integration
- Integrate the collected metrics into our existing monitoring infrastructure.
- Ensure proper handling and processing of EMR Hadoop 2.x metrics.
- Create AWS EMR metrics data stream documentation
- Create documentation detailing the newly created data stream for EMR Hadoop 2.x metrics.
- Include comprehensive information about metric names, dimensions, and their meanings.
- Add pipeline & pipeline tests (if needed)
- Implement a pipeline, if necessary, to process the collected metrics.
- Develop tests to verify the correctness of the pipeline.
- Add systems tests using terraform
- Use terraform to create automated systems tests that validate the end-to-end metrics collection.
- Verify the accurate collection and ingestion of the metrics.
Docs
Metrics and dimensions that should be collected for Hadoop 2.x clusters:
Overview
Amazon EMR, formerly known as Amazon Elastic MapReduce, is an advanced cluster platform designed to simplify, scale, and optimize the processing and analysis of massive data volumes. With its user-friendly interface, flexibility, and affordability, EMR offers an ideal solution for organizations seeking to leverage big data. The platform boasts a wide range of data processing engines, such as Apache Hadoop, Apache Spark, Apache Hive, Apache Flink, and many others, enabling users to harness the power of these tools for efficient data manipulation and insights generation.
AWS EMR clusters send metrics to CloudWatch in 5 minute intervals (1 minute if detailed monitoring is enabled).
Collect EMR Hadoop 2.x metrics from AWS EMR clusters.
Steps/Tasks
cloudwatchmetricbeat moduleDocs
Metrics and dimensions that should be collected for Hadoop 2.x clusters: