We reviewed 100+ incident reports last quarter Q4. Here are the top 5 outages that could have been prevented (and what we found). 1. 𝐄𝐱𝐩𝐢𝐫𝐞𝐝 𝐓𝐋𝐒 𝐜𝐞𝐫𝐭𝐢𝐟𝐢𝐜𝐚𝐭𝐞𝐬 (𝐭𝐡𝐞 𝐫𝐨𝐨𝐭 𝐜𝐚𝐮𝐬𝐞 𝐢𝐧 20+ 𝐢𝐧𝐜𝐢𝐝𝐞𝐧𝐭𝐬.) Why it happens: – Alerts sent to dead Slack channels – Zero rotation visibility – No automated expiry checks What works: – Auto-renew via Let’s Encrypt or ACM – Alert routing with human fallback – Pre-expiry alerting with escalation logic 2. 𝐒𝐭𝐚𝐥𝐞 𝐊𝐮𝐛𝐞𝐫𝐧𝐞𝐭𝐞𝐬 𝐬𝐞𝐜𝐫𝐞𝐭𝐬 (𝐰𝐞 𝐬𝐚𝐰 𝐬𝐞𝐜𝐫𝐞𝐭𝐬 𝐠𝐞𝐭 𝐫𝐨𝐭𝐚𝐭𝐞𝐝 𝐢𝐧 𝐕𝐚𝐮𝐥𝐭... 𝐛𝐮𝐭 𝐧𝐞𝐯𝐞𝐫 𝐫𝐞𝐥𝐨𝐚𝐝𝐞𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐩𝐨𝐝.) How to fix it: – Sync secrets via GitOps – Automate post-rotation reloads – Add secret freshness checks in readiness probes 3. 𝐋𝐞𝐚𝐤𝐞𝐝 𝐨𝐫 𝐟𝐨𝐫𝐠𝐨𝐭𝐭𝐞𝐧 𝐈𝐀𝐌 𝐭𝐨𝐤𝐞𝐧𝐬 (𝐚𝐧 𝐮𝐧𝐮𝐬𝐞𝐝 𝐫𝐨𝐨𝐭 𝐭𝐨𝐤𝐞𝐧 𝐦𝐚𝐝𝐞 𝐢𝐭 𝐢𝐧𝐭𝐨 𝐚 𝐂𝐈 𝐣𝐨𝐛 𝐚𝐧𝐝 𝐠𝐨𝐭 𝐞𝐱𝐩𝐥𝐨𝐢𝐭𝐞𝐝.) What prevents this: – Force expiry on all tokens – Audit inactive access keys weekly – Use temporary roles, not static secrets 4. 𝐌𝐢𝐬𝐥𝐞𝐚𝐝𝐢𝐧𝐠 𝐇𝐞𝐚𝐥𝐭𝐡 𝐂𝐡𝐞𝐜𝐤𝐬 (𝐡𝐚𝐥𝐟 𝐭𝐡𝐞 𝐭𝐞𝐚𝐦 𝐝𝐢𝐝𝐧’𝐭 𝐞𝐯𝐞𝐧 𝐤𝐧𝐨𝐰 𝐰𝐡𝐚𝐭 𝐭𝐡𝐞 𝐩𝐫𝐨𝐛𝐞𝐬 𝐰𝐞𝐫𝐞 𝐜𝐨𝐧𝐟𝐢𝐠𝐮𝐫𝐞𝐝 𝐭𝐨 𝐭𝐞𝐬𝐭.) The better way: – Test real request paths, not dependencies – Validate probes on failure modes – Treat probes like production traffic—because they are 5. 𝐃𝐍𝐒 𝐦𝐢𝐬𝐜𝐨𝐧𝐟𝐢𝐠 𝐝𝐮𝐫𝐢𝐧𝐠 𝐢𝐧𝐟𝐫𝐚 𝐜𝐮𝐭𝐨𝐯𝐞𝐫𝐬 (𝐭𝐞𝐚𝐦𝐬 𝐜𝐡𝐚𝐧𝐠𝐞𝐝 𝐈𝐏𝐬 𝐨𝐫 𝐥𝐨𝐚𝐝 𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐫𝐬 𝐛𝐮𝐭 𝐟𝐨𝐫𝐠𝐨𝐭 𝐃𝐍𝐒 𝐓𝐓𝐋𝐬.) Prevention checklist: – Lower TTLs before change windows – Validate DNS propagation with external checks – Monitor traffic split between new and old endpoints So what’s one outage you’ve seen that was totally preventable, but still took down production? Love to hear your stories. ♻️ 𝐑𝐄𝐏𝐎𝐒𝐓 𝐒𝐨 𝐎𝐭𝐡𝐞𝐫𝐬 𝐂𝐚𝐧 𝐋𝐞𝐚𝐫𝐧.
Cloud Security
Explore top LinkedIn content from expert professionals.
-
-
⚠️ Another month, another critical vulnerability? Definitely not as critical as the recent findings by Dirk-jan Mollema, but still a significant finding due to the simplicity of exploitation. I'm extremely proud of our team at The Collective Consulting (and Bob Bracke in particular) that discovered this flaw, and worked with MSRC on reporting it and getting it fixed. ⚡️TLDR; due to a (lack of) verification flaw, cross-tenant access to Azure Event Grids was possible -- no authentication required. Microsoft has fixed the vulnerability -- no action required. ❓What happened: While building a multi-tenant event-capture service, The Collective spotted that creating an Event Grid System Topic scoped at a management group level let them see event subscriptions from other tenants if the management-group ID was reused. √ At the root: Management Group IDs are not globally unique (unlike subscription IDs). Microsoft’s filtering logic assumed uniqueness and failed to isolate tenants properly. 👀 What could be seen: number of events, error counts, types of Azure Policy events being tracked, delivery endpoints (webhooks, function apps), authentication headers and related settings. ⚠️ Impact: All tenants using Event Grid subscriptions on management-group scope were potentially exposed. Because there was no telemetry or alerts for this kind of cross-tenant visibility, it’s unclear if and how this was exploited. Based on feedback of MSRC, no signs of active abuse were reported. The practical exploitation “reach” may have been limited (since an Event Grid subscription had to be configured at that scope), but the design flaw is significant. 👉🏻 Takeaways Be careful with constructs that assume global uniqueness but rely on values that are only locally unique. This is true for everything, not just Azure, of course! Keep an eye on vendor/patch disclosures—here, Microsoft’s MSRC responded quickly after disclosure. Why this matters (especially for my network, given our focus on security)? If you’re managing or designing services that span tenants (e.g., service providers, ISVs, MSPs), this kind of flaw demonstrates how subtle configuration-or-scope assumptions can lead to cross-tenant visibility. In regulated environments (finance, insurance, high-security), extra scrutiny of event-streaming, telemetry, and subscription isolation is critical. It reinforces the notion that even “built-in” cloud mechanisms (like Event Grid) require careful architecture and threat modelling—especially when they touch multi-tenant boundaries. Fore more information, check out our blog with more details: https://lnkd.in/eGKBH3av #microsoft #security #azure
-
#Kubernetes security awareness: the ability to start a Pod in a Namespace means the implicit ability to read Secrets in the same Namespace. 😱 Even if you have RBAC rules against it. Let's try to understand why! 🤔 How do you properly protect Secrets in Kubernetes? Say the database to a production database. Tons of sensitive data in there. Developers get a Role that specifically does not allow them to "get" the Secret, because they shouldn't be able to just do that. But they should be able to start Pods in the "production" Namespace. And those Pods need the permission to be fed the contents of the Secret, so they can connect to the database. Kubernetes has been designed to make this work. Perhaps you never thought about that strangeness? 🤷♂️ But already now, you immediately see that there is a bit of a problem: if you let someone start a Pod that references a Secret, they can just include code that either sends that Secret to them (use Network Policies to prevent such data exfiltration, BTW) OR they could just dump the Secret's contents into the logs and read them that way. So if you have a Secret and your threat model says you have to protect it from your internal staff, make sure they cannot deploy Pods, either. This is actually a people problem (the threat of insiders) rather than a purely technical one. So you can't solve it with tech alone. But you can enhance a people-centric solution with technical guardrails! How? Use a GitOps approach like Argo CD and code reviews. This way, developers don't get to start Pods themselves, they can only ask the Argo ServiceAccount to do it for them, after having their requests reviewed by their team members. There is no way to exfiltrate data unless you fool two of your team members, as well. Doing it this way means nobody can easily slip in data exfiltration code without a proper security code review (I'm assuming security-conscious companies review commits for this). And of course you need Network Policies, too. Can't lose data to the outside if a firewall eats the network packets. 😅 Follow me (Lars) if you think #DevOps, #DevSecOps, #Kubernetes, and #CompassionateLeadership is interesting.
-
A security researcher uncovered a quiet way to walk into any Microsoft Entra tenant—no alerts, no logs, no noise. By chaining Microsoft’s internal “Actor tokens” with a validation flaw in the Azure AD Graph API, an attacker could pose as any user, even Global Admins, for 24 hours across tenants. That’s a big deal because identity is the key we trust most. If changes show up under a real admin’s name, how quickly would your team catch it? Here’s the simple version of how it worked: Actor tokens weren’t documented, didn’t follow normal security policies, and requests for them weren’t logged. The Azure AD Graph API also lacked API-level logging. With a token, an attacker could read user and group details, conditional access policies, app permissions, device info, and even BitLocker keys synced to Entra. If they impersonated a Global Admin, they could change those settings—and it would look like a normal change made by a trusted account. The researcher reported the issue in July 2025. Microsoft moved fast, rolled out fixes and mitigations, and issued a CVE on September 4 saying customers don’t need to take action. There’s no evidence it was exploited in the wild. Still, this is a wake-up call: even the biggest platforms can hide deep, quiet risk. Build for resilience, assume silent failure modes, and consider reducing single-vendor dependence where it makes sense. Identity is your front door, treat it like mission-critical. #EntraID #IdentitySecurity #CloudSecurity #ChangeYourPassword Follow me for clear Microsoft identity security breakdowns and practical takeaways your team can use right away.
-
DevOps & SRE Perspective: Lessons from the Amazon Web Services US-East-1 Outage ! 1. Outage context • AWS reported “increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region,” later identifying issues around the Amazon DynamoDB API endpoint and DNS resolution as the likely root cause. • The region is a critical hub for many global workloads — meaning any failure has broad impact. • From the trenches: “Just got woken up to multiple pages. No services are loading in east-1, can’t see any of my resources. Getting alerts lambdas are failing, etc.” 2. What this means for SRE/DevOps teams • Single-region risk: Relying heavily on one region (or one availability zone) is a brittle strategy. Global services, control planes, identity/auth systems often converge here — so when it fails, the blast radius is massive. • DNS and foundational services matter: It’s not always the compute layer that fails first. DNS, global system endpoints, shared services (like DynamoDB, IAM) can be the weak link. • Cascading dependencies: A failure in one service can ripple through many others. E.g., if control-plane endpoints are impacted, your fail-over mechanisms may not even activate. • Recovery ≠ full resolution: Even after the main fault is resolved, backlogs, latencies, and unknown state issues persist. Teams need to monitor until steady state is confirmed. 3. Practical take-aways & actions •Adopt a multi-region / multi-AZ fallback strategy: Ensure critical workloads can shift automatically (or manually) to secondary regions or providers. •Architect global state & control plane resilience: Make sure services like IAM, identity auth, configuration, and global databases don’t concentrate in one point of failure. •Simulate DNS failures and control-plane failures in chaos testing: Practice what happens when DNS fails, when endpoint resolution slows, when the control plane is unreachable. •Improve monitoring + alerting on “meta-services”: Don’t just monitor your app metrics—watch DNS latency/resolve errors, endpoint access times, control-plane API errors. •Communicate clearly during incidents: Transparency and frequent updates matter. Teams downstream depend on accurate context. •Expect eventual consistency & backlog states post-recovery: After the main fix, watch for delayed processing, stuck queues, prolonged latencies, and reconcile state when needed. 4. Final thought This outage is a stark reminder: being cloud-native doesn’t eliminate infrastructure risk — it changes its shape. As practitioners in DevOps and SRE, our job isn’t just to prevent failure (impossible) but to anticipate, survive, and recover effectively. Let’s use this as an impetus to elevate our game, architect with failure in mind, and build systems that fail gracefully. #DevOps #SRE #CloudReliability #AWS #Outage #IncidentManagement #Resilience
-
I recently led a couple of cloud-incident workshops, got a lot of great questions, had wonderful exchanges, frankly learned a lot myself, and wanted to share a few takeaways: • 𝗔𝘀𝘀𝘂𝗺𝗲 𝗯𝗿𝗲𝗮𝗰𝗵 - 𝘀𝗲𝗿𝗶𝗼𝘂𝘀𝗹𝘆: Treat "when, not if" as an operating principle and design for resilience. • 𝗖𝗹𝗮𝗿𝗶𝗳𝘆 𝘀𝗵𝗮𝗿𝗲𝗱 𝗿𝗲𝘀𝗽𝗼𝗻𝘀𝗶𝗯𝗶𝗹𝗶𝘁𝘆: Most gaps aren’t exotic zero-days - they’re governance gray zones, handoffs, and multi-cloud inconsistencies. • 𝗜𝗱𝗲𝗻𝘁𝗶𝘁𝘆 𝗶𝘀 𝘁𝗵𝗲 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝗽𝗹𝗮𝗻𝗲: MFA everywhere (but not enough), push passwordless, least privilege by default, regular access reviews, strong secrets management, and a push to passwordless. • 𝗠𝗮𝗸𝗲 𝗳𝗼𝗿𝗲𝗻𝘀𝗶𝗰𝘀 𝗰𝗹𝗼𝘂𝗱-𝗿𝗲𝗮𝗱𝘆: Extend log retention, preserve/analyze on copies, verify what your CSP actually provides, and rehearse with legal and IR together. • 𝗗𝗲𝘁𝗲𝗰𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗽𝗿𝗼𝘃𝗶𝗱𝗲𝗿𝘀: Aggregate logs (AWS/Azure/GCP/Oracle), layer in behavior-based analytics/CDR, and keep a cloud-specific IR/DR runbook ready to execute. • 𝗕𝗼𝗻𝘂𝘀 𝗿𝗲𝗮𝗹𝗶𝘁𝘆 𝗰𝗵𝗲𝗰𝗸: host/VM escapes are rare - but possible. Don’t build your program around unicorns; prioritize immutable builds, hardening, and hygiene first. If you’d like my cloud IR readiness checklist or the TM approach I’ve been using, drop a comment, and we’ll share. Let’s raise the bar together. #CloudSecurity #IncidentResponse #ThreatModeling #CISO #DevSecOps #DigitalForensics #MDR EPAM Systems Eugene Dzihanau Chris Thatcher Adam Bishop Julie Hansberry, MBA Ken Gordon Sharon Nimirovski Aviv Srour
-
A critical security flaw has been discovered in certain Azure Active Directory (AAD) setups where appsettings.json files—meant for internal application configuration—have been inadvertently published in publicly accessible areas. These files include sensitive credentials: ClientId and ClientSecret Why it’s dangerous: 1. With these exposed credentials, an attacker can: 2. Authenticate via Microsoft’s OAuth 2.0 Client Credentials Flow 3. Generate valid access tokens 4. Impersonate legitimate applications 5. Access Microsoft Graph APIs to enumerate users, groups, and directory roles (especially when applications are granted high permissions like Directory.Read.All or Mail.Read) Potential damage: Unauthorized access or data harvesting from SharePoint, OneDrive, Exchange Online Deployment of malicious applications under existing trusted app identities Escalation to full access across Microsoft 365 tenants Suggested Mitigations Immediately review and remove any publicly exposed configuration files (e.g., appsettings.json containing AAD credentials). Secure application secrets using secret management tools like Azure Key Vault or environment-based configuration. Audit permissions granted to AAD applications—minimize scope and avoid overly permissive roles. Monitor tenant activity and access via Microsoft Graph to detect unauthorized app access or impersonation. https://lnkd.in/e3CZ9Whx
-
Security starts with the right permissions. Running pods as root will cause you so much headache. Here is an excellent lab to help you see the impact. You can run it locally with Minikube! Here are some things I have been learning in my studying for the CKS that can help. • PodSecurity Standards and Admission Controllers: Leverage Kubernetes native features like PodSecurity admission to enforce non-root execution policies, ensuring compliance is baked into your cluster setup. • CICD Pipeline Security Scans: Integrate security scanners like Trivy or Kubeaudit into your CI/CD pipelines to identify and block image builds that require root privileges, ensuring only compliant images are created. • Container Security Contexts: Use Kubernetes securityContext to explicitly define non-root user settings (runAsUser, runAsGroup, allowPrivilegeEscalation) in your manifests, and enforce their use through tools like OPA/Gatekeeper policies in your GitOps workflow.