Twenty Years Of Building The Systems Behind The Systems

Saurabh Ahuja started programming in C during college coding competitions, some of them overnight, while his classmates were chasing MBAs. Two decades later, he’s still solving infrastructure problems, just at a different scale.

At Amazon, he wrote the code that opens Locker doors when customers pick up packages, a system that serves millions of retrievals a year and has to account for jammed doors, ADA compliance, and making sure auto-generated pickup codes don’t accidentally spell something offensive.

At McAfee, he spent nine years building whitelisting technology that locked down bank ATMs, preventing any unauthorized software from running on machines handling real money. He’s worked across security, e-commerce, and enterprise SaaS, and he now serves as a Principal Member of Technical Staff with twenty years of experience spanning infrastructure, Kubernetes, Terraform, and multi-cloud architecture.

We talked with Saurabh about what keeps him writing code after all these years, the gap between cloud documentation and production reality, why he thinks most migration plans underestimate legacy code, and how completing a 320-mile Ultraman triathlon taught him something useful about system reliability.

👉Join in the conversation on LinkedIn

You’ve spent two decades building infrastructure at companies like McAfee, Amazon, and Salesforce. What drew you to this particular corner of engineering, and what’s kept you there?

I got fascinated by computers back when I was in 10th grade. I did a summer camp on learning DOS and immediately got hooked into it. I was good at mathematics and found computers interesting. I decided to pursue Engineering in Computer Science and worked hard in my high school to get admission in a good university, and I got selected for bachelor’s in computer engineering in one of the best Universities in India.

I was among the best programmers in my class and back then I used to write programs in C language. I used to participate in programming competitions, some of them were overnight. It’s so much fun.

Later, fast forward, in the final year or couple of years after graduation, all my friends opted for MBA from reputed universities as MBA used to be the field in those days like AI is in demand today. I enjoyed programming, problem solving and got fascinated by the idea that companies would hire us for programming, ask us to write code which is fun for me, and they will pay us all. I found it interesting and got hired as a software engineer. I was never interested in MBA.

Since then, I have been enjoying problem solving, programming. I never got bored as technology kept evolving.

I transitioned from C language -> Java Language -> Go Lang

The nature of my job/problem kept evolving from scaling 32-bit processes, vertical scaling to horizontal scaling, small functions/modules to large design, few weeks to few months to multiyear projects. Scaling systems from 1000’s to 10’s of millions of users to hundreds of millions of users and technology has evolved to support billions of users.

Many of the fundamentals are still the same, and we are building more and more abstractions. Like most of the public cloud workloads today are run on some variation of Linux servers.

Being a principal engineer, most of my time goes into designing systems and mentor junior engineers but I still find time to get coding, stay hands on and every few weeks gets a chance to debug difficult issues and that’s the part I most enjoy.

At Amazon, you wrote the code that opens locker doors when customers retrieve packages. Walk us through what that project taught you about building systems that have to work every single time.

Amazon Lockers is an interesting project as it solves real life customer problems with the help of programming/technology. Millions of packages are delivered to customers through Amazon Locker every year and Customers are interacting with Amazon Lockers; they enter the code and Locker Door Opens and customers pick up their packages.

Instead of starting from technology, we started from the customer experience and worked backwards. Like Lockers have physical doors, what if they are jammed and the door doesn’t open, we added some intelligence to retry opening the door multiple times before giving up. If failed to open, monitor all these lockers in real time and get them serviced. So that other customers don’t experience the same problem.

We realized customers are shipping multiple packages to Amazon Lockers, we detected that implemented intelligence that one code is sufficient for multiple packages in different lockers and we also consolidate multiple packages in a single locker if we can; it, again, requires intelligent algorithms.)

Then this technology should work for all kinds of customers. We made sure these lockers are ADA compliant and can be accessible by differently abled people, place their packages in lower locker doors. The code (we had English alphabet and numbers) that we generate for customers has to be safe for customers and should not be an offensive word.

These are just a few examples; to make technology work, we had to come up with a lot of innovative ideas, implement them and stitch them together to have a pleasant customer experience.

You’ve worked across security (McAfee), e-commerce (Amazon), and enterprise software. How has moving between these domains shaped how you approach infrastructure problems?

I have worked across multiple companies and have realized fundamentals are still the same. At McAfee, we built whitelisting technology to protect systems from viruses. NCR who builds Bank ATMs was our customer and we locked down ATMs and do not allow any processes/software/code to run other than the whitelisted ones that are required to run the ATM.

At Amazon, we built those ATM-like kiosks, but this is specific to customers picking up packages rather than taking out money from the ATM.

For Enterprise software, as well, we are using Linux servers, and we have to keep them secure and run application software and keep it protected from malicious software.

Fundamentally, you need an operating system to run on the hardware, you have to run applications on the operating system to solve business problems, and you have to keep those applications/data protected from the malicious actors.

Some things like the principle of least privilege, Integrity of data/tamper proofing, and confidentiality, only authorized people/service are able to see the data, are well known and work well.

Cloud is simply someone else’s computer, and many computers are connected together, so networking models come into the picture as well.

Security is not a separate job; it should be there at every step. Security by Design, when designing the infrastructure, Threat Modelling, and Penetration testing have become more and more important as the stakes are high.

Building Zero Trust Infrastructure is important and data should be encrypted at rest, in transit.

Multi-cloud environments have become the norm for large enterprises but managing them is notoriously complex. What are the most common mistakes you see organizations make when operating across multiple cloud providers?

Cloud is complex because it’s available publicly. You have to think of security models and isolation. Different cloud has different IAM model (Identity and Access Management)

Organizations trying to build abstraction over multi-cloud by having generic interfaces is not always possible and on the other extreme having separate teams for different clouds, e.g. one for AWS and another for GCP but fundamentally applications are the same is not good use of resources and not the most efficient way.

It’s important to find the right balance between generic vs specific systems and make sure we still keep the systems simple and do not over engineer. One example is S3 bucket in AWS and GCS bucket in Google Cloud both store objects.

It’s easy to create a single interface object-storage for multi cloud and would work for the simple cases but for cases like, e.g. Replication of data, in AWS replication has to be done/setup explicitly in a different region while Google Cloud can replicate the data across multiple regions depending upon the chosen location for the bucket. Cost can be a factor as well when choosing explicit replication in AWS vs GCS has implicit replication.

Replication features for buckets are specific to the cloud, and it may be hard to develop a generic interface and in this case it’s better to have functionality specific to a particular cloud.

That’s the reason, finding the right balance between generic and specific is important when adopting a multi-cloud strategy.

5. Kubernetes and Terraform have become foundational tools for infrastructure teams. For engineers earlier in their careers, what should they understand about these technologies beyond the documentation?

Deploying changes safely in a backward-compatible way.

Learning fundamentals from the documentation is important, and talks about the straight path, but real-life technology and implementation are messy. Businesses need to run and at the same time keep evolving to keep up with the technology. Logical and maintainable code is one of the skills I used to assess when taking interviews for one of the Mega Tech companies.

Migrating legacy code to the latest technology is hard and poor design choices can lead to tech debt, increased operations and low productivity.

One real life example is, let’s say your company has a closed ecosystem and you are building/managing a platform for executing terraform on Kubernetes for your developers.

Different teams on-board to terraform at different times and have been using different provider versions.

Now, one approach is to bake the provider version in the Image and keep updating the Image whenever a new version is released so that all provider versions are available at the time of terraform execution. Initially it worked fine, but over time image size got bigger and bigger and now terraform execution takes more time as time to download the bigger and bigger image has been increasing. This path was chosen due to least resistance.

Ideal solution here would be to keep provider versions in a private registry (due to closed loop system), sync this private registry with public registry and terraform execution should fetch the right provider version at runtime keeping container image footprint small and faster execution. But yes, it would require private registry to spin up at first and would require some additional work but It’s more logical and maintainable in the longer term.

So, my advice to early career engineers is to have logical and maintainable code and designing/architecting the systems for the long term is important. Sometimes short-term solutions are unavoidable due to business requirements but as engineers we should always advocate for the right design and implementation instead of hacking the system for the short term.

You’ve led cloud migration efforts at enterprise scale. What separates migrations that succeed from those that stall or fail?

The first is never underestimate the legacy code. When we plan for migrations, we go by principals, do a thorough analysis, come up with the right design but when we start doing the real work. We get stumped because legacy code that has evolved over many years and due to business requirements or technical limitations doesn’t follow the algorithm/straight path. It has many hacks that we need to handle when we migrate.

Second is assuming the public cloud and latest technology has a solution to everything. When I first worked on cloud migration back in 2018, many technologies like docker and Kubernetes were 4 or 5 years old and were evolving. Some of the features were not available in these technologies. E.g. docker daemon runs as a root by default for various reasons but it’s not acceptable by security or we have to mitigate the risk in some other way.

Public cloud means software is running on someone else’s computer. Even though AWS and GCP have come a long way, still there are limitations in the functionality/capabilities provided by public cloud. There are soft limits and hard limits. Soft limits in general, can be changed by requesting to increase quotas (e.g. default number of customer managed policies) but we cannot change hard limits. We need to (re)architect keeping soft and hard limits in mind.

Third, there is no such thing as perfect design or code. It’s all about tradeoffs to meet the business requirements. For example, we all know the standard CAP theorem; it can either be an AP (Availability and Partition) or CP (Consistency and Partition) system because partition is inevitable in distributed systems. We always have to make tradeoffs to meet business requirements.

Outside of work, you compete in triathlons, including completing an Ultraman. How does endurance athletics influence how you think about engineering challenges?

Power of small consistent improvements: There is no Silver Bullet to achieve things overnight.

Ultraman – Yes, Ultraman is a 3-day event where we swim/bike/run 320 miles in 3 days, but it takes many years of small consistent improvements that compound over time.

Engineering – Similarly, to achieve massive scale, we keep making small, consistent improvements every day. Like improving security posture, improving build time, refining processes, and automating everyday tasks.

Discipline/ Process, show up every day. Consistency beats intensity. There will be bad days, but that’s not the destination. Keep going and trust the process.

Observability is very important

Ultraman – While preparing for Ultraman, I had structured training plan, we kept logging and monitoring the daily, weekly mileage, heart rate, calories intake, wattage and kept coarse correcting by focusing on the weak areas

Engineering – We keep logging/monitoring the health of our system. I prefer pro-active rather than reactive approach. As soon as something is not right, we get alert and we fix the system pro-actively.

Thinking Long term

Ultraman – I learned that I need to follow the process and ignore a small nag—a blister, a hydration misses, skipping stretching after a long workout, or a pacing error—leads to catastrophic failure hours and days later. I have suffered a knee injury 2 different times because I did not stretch after running for 20 miles and sat at the office for straight 8 hours.

Engineering – We prioritize long-term stability and clean code over short-term feature velocity because we know the race is long, and taking shortcuts creates more technical debt and pain in the future.

After 20 years in this field, what’s one piece of conventional wisdom about infrastructure that you think is wrong or outdated?

Infrastructure is the foundation layer, and it’s not customer-facing. It sits one layer below it. One of the conventional wisdoms is that we design systems assuming stable infrastructure, but the truth is, systems fail all the time. E.g., We need to perform horizontal scaling in distributed systems to handle scale, and anything can fail, network fails (network partition), machine fails, deployments go wrong all the time, and dependencies fail.

Failure is inevitable; we should design systems for failure.

One particular example from my past experience is that entire companies’ applications are running on Kubernetes cluster and we need to upgrade Kubernetes cluster and upgrading Kubernetes cluster went wrong and applications got impacted. These are hard problems to solve, and we should think about the impact and how to recover the system from the failure when things go south.

This discussion continues on LinkedIn with industry peers and decision-makers.

👉Join in the conversation on LinkedIn

Twenty Years of Building the Systems Behind the Systems

Twenty Years of Building the Systems Behind the Systems

You’ve spent two decades building infrastructure at companies like McAfee, Amazon, and Salesforce. What drew you to this particular corner of engineering, and what’s kept you there?

At Amazon, you wrote the code that opens locker doors when customers retrieve packages. Walk us through what that project taught you about building systems that have to work every single time.

You’ve worked across security (McAfee), e-commerce (Amazon), and enterprise software. How has moving between these domains shaped how you approach infrastructure problems?

Multi-cloud environments have become the norm for large enterprises but managing them is notoriously complex. What are the most common mistakes you see organizations make when operating across multiple cloud providers?

5. Kubernetes and Terraform have become foundational tools for infrastructure teams. For engineers earlier in their careers, what should they understand about these technologies beyond the documentation?

You’ve led cloud migration efforts at enterprise scale. What separates migrations that succeed from those that stall or fail?

Outside of work, you compete in triathlons, including completing an Ultraman. How does endurance athletics influence how you think about engineering challenges?

After 20 years in this field, what’s one piece of conventional wisdom about infrastructure that you think is wrong or outdated?

Subscribe

Related articles

About us

Quick Links

Latest

Subscribe