{"id":241928,"date":"2024-10-04T04:42:06","date_gmt":"2024-10-04T04:42:06","guid":{"rendered":"https:\/\/dev.allcode.com\/?page_id=241928"},"modified":"2024-10-04T04:42:06","modified_gmt":"2024-10-04T04:42:06","slug":"aws-devops-distributed-tracing","status":"publish","type":"page","link":"https:\/\/dev.allcode.com\/aws-devops-distributed-tracing\/","title":{"rendered":"AWS DevOps Distributed Tracing"},"content":{"rendered":"<p>AllCode leverages AWS X-Ray to analyze and debug production applications across microservices, including Lambda Functions and Step Functions, using AWS CDK. Additionally, we collaborate with customers to integrate distributed tracing into their preferred tools, such as Splunk and Dynatrace, if needed.<\/p>\n<p>We used X-Ray in combination with AWS CloudWatch and AWS CloudTrial.<\/p>\n<p>We started by<\/p>\n<p>1. Defining, Collecting, and Analyzing Workload Health Metrics<br \/>\nAWS Services: Use AWS CloudWatch to monitor Lambda functions and SQS queues.<br \/>\n3rd Party Tools: Integrate with seed.run for CI\/CD and log analysis.<\/p>\n<p>2. Exporting Standard Application Logs<br \/>\nAWS Services: Use AWS CloudTrail and CloudWatch Logs to capture all API calls and standard application logs.<br \/>\n3rd Party Tools: Seed.run checks Lambda logs for errors and sends email notifications.<\/p>\n<p>3. Defining Thresholds for Operational Metrics<br \/>\nThresholds: Define CloudWatch Alarms for key metrics like error rates, latency, and SQS Dead Letter Queue (DLQ) sizes.<br \/>\nCustomer Example: Monitoring and Alerts for Lambda-SQS Architecture<br \/>\nWorkload Health Metrics:<\/p>\n<p>Utilized AWS CloudWatch to set up metrics for Lambda error rates, SQS queue depth, and DLQ sizes.<br \/>\nIntegrated seed.run to monitor logs and automatically notify personnel in charge if errors are detected.<br \/>\nStandard Application Logs:<\/p>\n<p>Enabled CloudTrail for capturing API calls made on the Lambda functions and SQS services.<br \/>\nUsed seed.run to provide an overview of past errors, their timestamps, and corresponding X-Ray traces.<br \/>\nThresholds for Alerts:<\/p>\n<p>CloudWatch Alarms were set for:<br \/>\nLambda error rates above 1%.<br \/>\nSQS queue depth exceeding 1000 messages.<br \/>\nMore than 5 messages in the DLQ.<br \/>\nWhen any of these alarms are triggered, an email is sent to the corresponding person in charge.<br \/>\nBy implementing these KPIs and metrics, we have a robust monitoring and alerting system that aids in quick error detection and resolution.<\/p>\n<p>Evidence:<br \/>\nStandardized Document:<\/p>\n<p>A comprehensive guide detailing the above KPIs and metrics is available in our internal wiki.<br \/>\nCustomer Example Implementation:<\/p>\n<p>In a recent customer engagement, we implemented the above monitoring and KPIs for a system that used Lambda functions orchestrated with SQS.<br \/>\nBy following these practices, we ensure optimal health monitoring and quick response times for operational events.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AllCode leverages AWS X-Ray to analyze and debug production applications across microservices, including Lambda Functions and Step Functions, using AWS CDK. Additionally, we collaborate with customers to integrate distributed tracing into their preferred tools, such as Splunk and Dynatrace, if needed. We used X-Ray in combination with AWS CloudWatch and AWS CloudTrial. We started by [&hellip;]<\/p>\n","protected":false},"author":15,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","_et_gb_content_width":"","inline_featured_image":false,"footnotes":""},"class_list":["post-241928","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/pages\/241928","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/comments?post=241928"}],"version-history":[{"count":1,"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/pages\/241928\/revisions"}],"predecessor-version":[{"id":241929,"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/pages\/241928\/revisions\/241929"}],"wp:attachment":[{"href":"https:\/\/dev.allcode.com\/wp-json\/wp\/v2\/media?parent=241928"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}