← Back to indexNiranjan
Essay · IIIJun 20215 min read

How I accidentally became a DevOps engineer

Inheriting an AWS account at a small fintech after the vendor was fired, with no handover and no AWS on my resume.

In 2018, the small fintech I worked at had its AWS run by an offshore firm. Most things worked. Services ran, bills got paid. Then the bill started climbing past what any of us thought the workload justified. Requests for explanations got slow replies, or none at all. Management asked me to poke at the account informally a couple of times, and what I saw made them less confident rather than more. One morning the contract was terminated. Not gracefully. No handover.

I took it on. Frontend engineer at the time, around ten of us on the engineering team, and I'd had some GCP exposure from a Google Summer of Code project a couple of years earlier. Nothing deep, no certifications, no AWS on paper. But someone had to, and the next-closest person had less. The case for me was mostly that the learning was going to happen one way or another, and I was the closest to where it had to start.

For the first few days I was a headless animal. No runbooks, nothing tagged. A root account, a handful of IAM users with long-lived keys, and a console full of things I'd never seen before. What got me moving was drawing parallels with GCP: IAM vs IAM, S3 vs GCS, EC2 vs Compute Engine, Security Groups vs firewall rules, VPC vs VPC (close enough). The names and defaults differ in small ways that matter, but the mental model mostly carries. A week in, I stopped needing the translation for most things.

Access & security first. Rotated every key. Dropped IAM users who didn't need to exist. Pulled buckets off the public internet where they shouldn't have been. Tightened security groups a notch at a time, and watched Slack for a day to see what broke.

Rightsizing second. EC2 instance types matched to CloudWatch metrics I could actually see, not to whatever the previous vendor had guessed. A lot of m4.2xlarges doing the work of m4.larges. That move alone paid for several months of my learning curve.

Then the culling, which was the part I was least proud of at the time. Anything with no identifiable owner, no traffic, and no explanation got stopped. Not terminated, just stopped. Then I'd wait for a cry on Slack. Most of the time nothing came. Occasionally something did, and I'd start it back up and learn what it was. Trial & error, with Slack as the monitoring plane. Not elegant. But the alternative was paying for services nobody knew existed, indefinitely.

Reserved Instances came later, and not quickly. RIs are a one-year lock-in at minimum, and a small startup that might pivot its workload in six months doesn't commit lightly to that. I waited until I was confident which instances were truly baseline load before buying anything. Spot went in where we could tolerate interruption, mostly batch workloads.

Two, maybe three months. The bill came down by about half, from somewhere in the forty-thousand-a-month range to the low twenties. The infra stopped being a mystery. The frontend engineer who was going to do this for a quarter didn't go back.

The part I don't dwell on: I wrote almost none of this down while it was happening. Every decision lived in my head, including the reasons things were the size they were. Bus factor of one. Good for the cost line, not for me. Every infra question has funnelled to me since, regardless of whether I'm on call or on holiday. Two-and-change years on, it hasn't fully unwound. If I were doing it again, I'd lose a day a week to writing things down from the first week. I didn't, and I'm the one who still pays for it.

It's fair to ask whether any of this was real expertise or just a low bar set by the previous vendor. Honest answer, mostly the latter for the first round of fixes. Anyone who'd touched AWS before would have caught the public S3 buckets and the oversized instances. The RI strategy and the rightsizing based on actual load patterns took real learning. But the 50% headline was easy because the bar was on the floor.

That's how I became the infra person. Proximity, mostly. I was the closest one to the problem, willing to learn on the job, and the company was small enough that volunteering and doing weren't really separate acts. I think small teams underrate this. Hiring for a gap is expensive: weeks to run a process, more weeks before the hire is productive, and the problem keeps costing money the whole time. Meanwhile somebody already in the room has offered to do it. Saying yes is close to free, and it says something to the rest of the team about how the place actually rewards initiative.

Thanks for reading. Questions, disagreements, or corrections,
.