You're a tech lead who's just joined a team and on your first day, the CTO tells you -
"Our video processing API is very slow, looks like it's a problem with AWS EFS that we are using. Take a look and fix it."
Story time.
You are a tech lead who's tasked to backup 1.5TB (yes, not a typo, a terabyte) of data DAILY to S3.
You're not really sure about any other background about this requirement, so you start asking some questions:
1. Where's the data produced from and stored currently?
You're a team lead who's just back from a vacation.
In the first standup after you return, here's the conversation with your teammate (T):
Teammate(T): Hey, I am blocked since I don't have AWS access credentials for the last 2 days. I am unable to proceed with any work.
Story time.
A CTO of a company calls you. They just migrated from Heroku to AWS on EKS.
He's happy with the migration but wants you to build Heroku's "Ephemeral Preview Apps" on Kubernetes.
You know you can use ArgoCD here, but you're in for some surprises and complications!
You're woken up by a p90 latency-related alert.
This alert is for the main API service, so you start investigating right away.
Your first thought is: it was working well so far, what changed - deployment or config. Hours later, you'd find out that it was neither.
Storytime
Here's a story of how a pragmatic tech lead who understands networking fundamentals like
- iptables
- packet routing and
- NAT
saved thousands of $$$ on cloud costs.
If you think low-level fundamentals don't matter in 2023, be ready for an awakening!
A short debugging story to start off the new year.
Developer (D): Hey, can you join a call? I need some help in debugging a webhook API connectivity issue. The customer team is also on the call.
Team lead (You): Okay sure, add me to the call.
You: So, what are you trying to do
You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min.
The CTO wants you to drastically reduce the costs for this infra without compromising reliability.
You come up with a solution that looks ridiculous at first, but makes total sense.
You're a lead SRE and CTO asks you to manage and scale a self-managed 6-node MySQL cluster with 1.5+ TB data on production.
You do what it takes, a few months pass, but now, it's time to move to a managed service.
You think this should be straightforward, but it's not so easy.
I have been working heavily with databases recently. Here's a reading list I'd suggest to people interested in learning about databases and data modeling in general.
- learning.oreilly.com/library/view/d… Designing Data-Intensive Applications
- learning.oreilly.com/library/view/d… Database Internals
-
A founder of a recently funded company calls you.
He wants you to come in and fix webapp performance problems.
You have been in similar situations before and want another thrill, so you say - "Yes, I'll take a look".
But this isn't the typical problem you have faced earlier 👇
Recap:
You started with
- debugging EFS
- to tailing logs to find API latency
- to debugging DB schema and queries
- finally landed on the config file fix
which worked.
Real life debugging is pretty messy. The more situations you have to debug, the better you become.
A new exercise is available on the Go bootcamp.
Learn to build your own toy Redis in Go. This exercise will help you learn core system programming concepts related to TCP, concurrency, and also data structures.
We'll plan a live coding session about this problem soon 🤞
You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min.
The CTO wants you to drastically reduce the costs for this infra without compromising reliability.
You come up with a solution that looks ridiculous at first, but makes total sense.