@vinicpires - Site Reliability Engineer

Site Reliability Engineer

Available for hire

Years of experience

4+ years

Experience level

Senior

Available for

Full-time

Download Resume / CV

Site Reliability Engineer with 4+ years of experience managing high-scale multi-cluster Kubernetes environments (100+ nodes) for 1M+ concurrent users. Proven track record in cloud cost optimization, generating $140k+ in annual savings through FinOps and infrastructure right-sizing. Expert in GitOps, CI/CD automation, and enterprise-grade observability (OpenTelemetry, Datadog) to drive high availability and drastically reduce MTTR in distributed systems.

Employment History

Site Reliability Engineer at Kaizen Gaming Current 2025 - Now

- Engineered high-performance GitLab CI/CD pipelines, slashing lead time for changes by 60% and implementing automated canary deployments for zero-downtime releases. - Orchestrated multi-region OpenShift clusters (100+ nodes) across on premise and cloud (Azure and On-premise) environments, supporting a high-traffic gaming platform with 1M+ concurrent users and maintaining 99.9% availability for high-memory workloads. - Migrated Helm-based deployments to ArgoCD (GitOps), establishing a centralized and auditable multi-cluster deployment model. - Contributed to cloud cost optimization initiatives, right-sizing overprovisioned instances and reducing monthly infrastructure spend from ~€32k to an estimated ~€20k (~€144k annual savings) for a single Kubernetes cluster. - Designed and operated full-stack observability platforms (OpenTelemetry, Prometheus, Grafana, VictoriaMetrics, OpenSearch), reducing troubleshooting time and improving operational visibility across distributed systems. - Developed a Python automation to map GitLab source code to live Helm/ArgoCD deployments on OpenShift, eliminating 'orphan' applications and reducing cross-team troubleshooting time by 40%.

Observability Engineer at Appoena (allocated at MARS Inc.) 2024 - 2025

- Contributed to an enterprise-wide observability consolidation initiative, migrating multiple legacy monitoring stacks into a unified Datadog platform and establishing a single source of truth across services and infrastructure. - Optimized Datadog usage (log pipelines, retention strategy, and metric cardinality), reducing annual platform costs by approximately $20,000 without compromising visibility. - Identified and eliminated unused Azure ExpressRoute circuits, preventing over $100k in annual unnecessary cloud expenses. - Redesigned alerting and observability workflows, reducing troubleshooting time (MTTR) from ~20 minutes to under 2 minutes by improving signal quality and actionable monitoring standards.

DevOps Engineer at Jack Experts 2023 - 2024

- Drove Kubernetes (EKS) cost optimization efforts across multi-cloud environments (AWS, Azure, GCP), contributing to a ~20% reduction in infrastructure costs through workload tuning and resource right-sizing. - Designed an automated incident management workflow by integrating Zabbix, Rundeck, and ITSM tool, reducing manual intervention for recurring production alerts by 40% and significantly accelerating MTTR. - Designed and implemented standardized CI/CD pipelines (GitLab CI, GitHub Actions), improving deployment reliability and release safety. - Applied FinOps practices to improve cost visibility and eliminate inefficient resource allocation, increasing cloud spending predictability. - Implemented Infrastructure as Code (Terraform, Ansible) and operated production-grade Kubernetes platforms with full-stack observability (Prometheus, Grafana, Zabbix, Loki and Promtail).

Support Engineer at Acelera Litoral 2022 - 2023

Provided Tier 2 technical support for multiple clients, stabilizing network infrastructure and reducing ticket resolution time by improving monitoring alerts and documentation.

Education

Bachelor of Computer Science at Universidade Sao Judas Tadeu 2022 - 2025

Site Reliability Engineer

Skills

Languages

Employment History

Education

Get realtime job alerts