7.4 C
New York

How Nitish Mane Balances Speed, Cost, and Reliability in Modern Cloud Systems

As cloud infrastructure becomes the backbone of everything from connected vehicles to live video streaming, engineering teams are under constant pressure to move faster, spend less, and still deliver systems that never fail. The challenge is no longer choosing the right tools, but designing platforms and operating models that scale reliably under real-world constraints, regulatory requirements, and unpredictable demand. For many organizations, the hardest part is learning how to balance speed, cost, and reliability without trading one for the other.

In this exclusive interview, we spoke to Nitish Mane, a backend infrastructure and cloud platform engineer with more than a decade of experience building and operating large-scale distributed systems across AWS, GCP, Azure, and OCI. Drawing on his work, Nitish shares hard-won lessons from managing real-time video streaming platforms, global vehicle telemetry systems, and regulated government cloud environments. He explains how practices like infrastructure as code, GitOps, observability, and AI-assisted tooling help teams control cloud spend, reduce operational risk, and design systems that remain resilient under pressure. His perspective offers a practical look at what modern infrastructure engineering really looks like when reliability is non-negotiable and scale is the default.

You often describe your work as the science of making technology run faster and cheaper without sacrificing reliability. How do you define that balance in practice, and where do teams most often get it wrong?

In practice, that balance comes down to right-sizing infrastructure to real demand, automating to reduce human error, and investing in observability so issues are caught early. At Lucid Motors, this approach reduced infrastructure provisioning time from weeks to hours using Cluster API, ArgoCD, Terraform, and Atlantis.

Speed did not come at the expense of reliability because everything ran through GitOps workflows where changes were reviewed, versioned, and reversible. Cost savings came from eliminating overprovisioning and implementing efficient data archiving, resulting in more than $100,000 per month in savings.

Teams tend to get this wrong by over-engineering for scale they do not yet need, under-investing in observability, relying on manual processes that slow delivery, and treating cost optimization as a one-time effort instead of an ongoing practice. The goal is to move fast because guardrails exist, not despite them.

Many companies struggle to control cloud spend as systems scale. Based on your experience across AWS, GCP, Azure, and OCI, what are the most common sources of waste you see, and which changes tend to deliver the biggest cost reductions without hurting performance?

Across cloud providers, the same waste patterns repeat. These include overprovisioned compute, orphaned resources, inefficient storage tiering, idle non-production environments, and unnecessary data transfer across regions or availability zones.

The highest impact changes are usually straightforward. Right-sizing based on real usage metrics delivers immediate results. Automated archiving moves logs and telemetry to cheaper storage tiers. Reserved capacity and committed use discounts reduce costs for predictable workloads.

Consolidating environments through Kubernetes namespaces avoids duplicated clusters, and enforcing infrastructure as code makes all resources visible and reviewable. The most durable savings come from pairing technical improvements with process changes that make cost visibility part of everyday planning.

You have delivered significant cloud savings, including more than one hundred thousand dollars per month at Lucid Motors. What planning decisions or architectural principles made those results possible, and how early should teams be thinking about cost efficiency?

Three principles drove those results. The first was visibility. Comprehensive monitoring with the Kube Prometheus Stack, Thanos, and Grafana made resource usage measurable. The second was infrastructure as code with governance. Refactoring the infrastructure repository and implementing Terraform with Atlantis ensured that every change was reviewed before resources were created. The third was efficient data lifecycle management. Automated archiving and careful use of multi-region MongoDB Atlas allowed telemetry data to be aggregated where necessary instead of replicated everywhere.

Teams should think about cost efficiency from day one. When building OCI infrastructure in the Me Jeddah one region for KSA government workloads, cost awareness was embedded from the start through right-sizing, proper tiering, and automated cleanup. Waiting until cloud costs become alarming usually means inefficiency is already baked into both architecture and culture.

At Lucid Motors, you architected and deployed cloud infrastructure that now connects vehicles sold across the Middle East and the EU. What unique reliability and latency challenges come with building infrastructure that directly supports connected vehicles at that scale?

Connected vehicles introduce constraints that traditional software systems rarely face. Latency is critical because vehicles require near real-time responses for diagnostic and safety-related features. Deployments were therefore localized, with OCI me jeddah one serving KSA government workloads and AWS me south one supporting other regions to minimize round-trip latency.

Reliability challenges included intermittent connectivity as vehicles move through low coverage areas, global data aggregation across multi-region MongoDB Atlas clusters, and strict security requirements. These were addressed through resilient ingestion pipelines, secure VPN connectivity, private Kubernetes endpoints, and disciplined on-call and postmortem practices aligned with Google SRE standards. When vehicles are involved, the cost of failure is far higher than a frustrated user.

How did working in a regulated government cloud region influence your approach to security, resilience, and operational discipline compared to more typical commercial cloud environments?

Working in a regulated government cloud required stricter security, resilience, and operational controls. Data sovereignty was non-negotiable; access controls and audit logging were significantly tighter; and container images had to be copied to OCI using custom automation. Because a managed certificate service was unavailable at the time, an internal certificate generator was built using cert-manager and Let’s Encrypt to enable secure, automated TLS in Kubernetes.

From a resilience standpoint, systems had to be more self-contained, and disaster recovery plans could not rely on easy cross-region failover. Operational discipline increased through formal change management, reversible deployments, ArgoCD-based consistency, and stricter on-call procedures. The experience reinforced that constraints often lead to better engineering.

In your current role, you manage a large-scale video streaming infrastructure. What lessons from operating real-time media pipelines have shaped how you think about observability, failure recovery, and system design?

Operating live video streaming pipelines at Philo reinforced how unforgiving real-time systems are. Degradation must be detected in seconds, not minutes, and traditional metrics alone are insufficient. Distributed tracing and MCP integrations for observability tools enable faster root-cause identification during incidents.

Failure recovery is constrained because live streams cannot be retried. Migrations to EKS and GitOps-based deployments improved rollback speed and reliability. These systems reinforced core design principles such as idempotency, backpressure handling, and clear ownership boundaries so that on-call engineers know exactly where to look during incidents.

You have worked extensively with infrastructure as code, GitOps, and modern CI CD tooling. How do better tools change not just systems, but the way engineering teams collaborate and make decisions?

Modern tooling shifts teams from relying on tribal knowledge to making codified decisions. Infrastructure as code documents intent in pull requests, improves onboarding, and increases transparency. Self-service provisioning replaces gatekeeping, unblocks developers, and shifts ownership to teams.

GitOps enables proactive operations by making the desired state visible and detecting drift automatically. Postmortem culture reduces reliance on individual heroes. AI-assisted tooling further amplifies these effects. At Philo, tools like Claude Code, MCP servers, and internal GitLab integrations allow engineers to query infrastructure state, debug issues, and navigate CI CD workflows using natural language. These tools do not replace expertise; they amplify it.

Looking back across your career in infrastructure, SRE, and platform engineering, what advice would you give teams that want to improve performance and reliability while keeping cloud complexity and costs under control as they grow?

Start with observability before optimization. Treat infrastructure as code as non-negotiable. Design for the majority of workloads, not edge cases. Build cost awareness into workflows by tagging, setting budgets, and providing visibility in pull requests. Invest in deployment pipelines and treat postmortems as learning opportunities. Finally, embrace constraints and use AI assisted tooling to democratize infrastructure expertise. The future of infrastructure engineering is systems that are understandable, queryable, and improvable by the entire team.

You often describe your work as the science of making technology run faster and cheaper without sacrificing reliability. How do you define that balance in practice, and where do teams most often get it wrong?

In practice, I define this balance through three principles: right-sizing infrastructure to actual demand, automation that reduces human error, and observability that catches issues before they become outages.

At Lucid Motors, I improved infrastructure provisioning time from weeks to hours using Cluster API, ArgoCD, Terraform, and Atlantis. That’s speed. But we didn’t sacrifice reliability—we built in GitOps workflows so every change was reviewed, versioned, and reversible. Cost came from removing overprovisioning and implementing efficient archiving strategies, which saved over $100,000 per month.

Where teams get it wrong:

Over-engineering for scale they don’t have yet, leading to unnecessary cost

Under-investing in observability, so they can’t distinguish between real issues and noise

Manual processes that seem “safe” but introduce inconsistency and slow down delivery

Treating cost optimization as a one-time project rather than continuous practice

The key is building systems where you can move fast because you have guardrails—not despite them.

Many companies struggle to control cloud spend as systems scale. Based on your experience across AWS, GCP, Azure, and OCI, what are the most common sources of waste you see, and which changes tend to deliver the biggest cost reductions without hurting performance?

Across AWS, GCP, Azure, and OCI, I consistently see these waste patterns:

Common sources of waste:

1. Overprovisioned compute – instances sized for peak load that rarely occurs

2. Orphaned resources – load balancers, EBS volumes, snapshots nobody remembers creating

3. Inefficient storage tiering – keeping cold data in hot storage classes

4. Idle development environments – non-prod clusters running 24/7

5. Data transfer costs – cross-region or cross-AZ traffic that could be avoided with better architecture

Highest-impact changes:

Right-sizing based on actual metrics – Use Kubecost, AWS cloudwatch to pickup right instance size.

Automated archiving strategies – moving telemetry and log data to cheaper tiers programmatically.

Reserved capacity and committed use discounts for predictable workloads

Consolidating environments using namespaces in Kubernetes rather than separate clusters

Infrastructure as Code – Terraform with Atlantis helps prevent “shadow resources” by making all infrastructure visible and reviewable

The biggest wins come from combining technical changes with process changes—making cost visibility part of how teams plan, not just an afterthought.

You have delivered significant cloud savings, including more than $100,000 per month at Lucid Motors. What planning decisions or architectural principles made those results possible, and how early should teams be thinking about cost efficiency?

Three architectural principles drove those results:

1. Visibility first

We deployed comprehensive monitoring using the KPS Stack (Kube-Prometheus-Stack) with Thanos for scaling Prometheus and Grafana for visualization. You can’t optimize what you can’t measure. This gave us granular insight into actual resource utilization across all our environments.

2. Infrastructure as Code with governance

By refactoring our infrastructure repository and implementing Terraform with Atlantis, every change went through review. This prevented the “just spin up another instance” mentality and made costs visible in pull requests before resources were created.

3. Efficient data lifecycle management

Vehicle telemetry generates massive data volumes. We implemented automated archiving strategies and used multi-region MongoDB Atlas intelligently—aggregating data where needed rather than replicating everything everywhere.

How early should teams think about cost efficiency?

Day one. At Lucid, I established the OCI infrastructure in the me-jeddah-1 region from scratch for the KSA government cloud. Because we built cost-awareness into the architecture from the start—right-sizing, proper tiering, automated cleanup—we avoided the technical debt that makes optimization painful later.

The mistake I see is teams waiting until cloud bills become a crisis. By then, inefficiency is baked into architecture and culture.

At Lucid Motors, you architected and deployed cloud infrastructure that now connects vehicles sold across the Middle East and the EU. What unique reliability and latency challenges come with building infrastructure that directly supports connected vehicles at that scale?

Connected vehicles have zero tolerance for certain failures and unique constraints:

Latency challenges:

– Vehicles need near-real-time responses for safety-critical features—you can’t have a 500ms delay when the car is requesting diagnostic data

– We deployed in OCI me-jeddah-1 specifically to serve KSA GOV and AWS me-south-1 for others locally, minimizing round-trip times

– The Vehicle Telemetry Pipeline I deployed collected signal data for near-real-time analysis by hardware engineers—this required careful architecture to handle burst traffic when thousands of vehicles check in simultaneously

Reliability challenges:

Intermittent connectivity – cars go through tunnels, parking garages, and areas with poor coverage. Systems must handle disconnection gracefully and reconcile data when connectivity returns

Global distribution – we managed multi-region multi-cloud MongoDB Atlas clusters to aggregate vehicle data across regions while maintaining consistency

Security at scale – we deployed OpenVPN networks for cars to send data securely, plus Private Endpoints on Kubernetes to collect data without exposing services publicly

Operational discipline:

– Participated in regular on-call rotation and implemented postmortem culture adhering to Google SRE standards

– When vehicles are involved, every outage is scrutinized—people depend on these systems

The mindset shift is recognizing that your “users” are machines moving at highway speeds. The cost of failure isn’t a frustrated customer—it’s potentially much worse.

How did working in a regulated government cloud region influence your approach to security, resilience, and operational discipline compared to more typical commercial cloud environments?

Establishing Lucid’s OCI infrastructure in me-jeddah-1 for KSA government customers required fundamental shifts:

Security:

– Data sovereignty was non-negotiable—all data had to remain in-region

– We implemented stricter access controls and audit logging.

– Container images needed to be copied to OCI using automation I built with boto3, ensuring we controlled exactly what ran in that environment

– I also developed an internal certificate generator using Cert-manager and LetsEncrypt. This was necessary to bridge the gap since there was no AWS ACM equivalent in OCI at the time. Rather than waiting for OCI to provide a managed certificate service, we built a self-service solution that allowed teams to request and automatically renew TLS certificates within our Kubernetes clusters—maintaining security standards without manual certificate management overhead.

Resilience:

– We couldn’t assume we’d have every managed service available, so we built more self-contained systems

– Backup and disaster recovery plans had to account for the fact that we couldn’t just “fail over to another region”

Operational discipline:

– Change management became more formal—every deployment was documented and reversible

– The App of Apps framework I deployed using ArgoCD helped maintain consistency across orgs while keeping clear audit trails

– On-call procedures were more rigorous because the blast radius of mistakes in a government environment extends beyond just our company

The experience taught me that constraints breed better engineering. When you can’t take shortcuts, you build systems that are more robust by default.

In your current role, you manage large scale video streaming infrastructure. What lessons from operating real time media pipelines have shaped how you think about observability, failure recovery, and system design?

At Philo, I deploy and maintain video streaming pipelines for live channels using AWS Elemental and MediaLive. Real-time media is unforgiving—you can’t buffer your way out of problems when content is live.

Observability:

– In streaming, you need to detect degradation in seconds, not minutes. Traditional metrics aren’t enough; you need distributed tracing to understand where latency is introduced

– I’ve leveraged MCP (Model Context Protocol) integrations for Grafana, Alertmanager, AWS CloudWatch, and Kubernetes to enable faster debugging. This allows me to query metrics, logs, and cluster state conversationally—dramatically reducing the time to identify root causes during incidents and turning observability data into actionable insights more quickly

Failure recovery:

– Live streaming means you can’t retry—if a frame is dropped, it’s gone

– The migration from self-hosted KOPS to EKS and from Jenkins+ERB to ArgoCD+Helm was partly about improving our ability to recover quickly. GitOps means rollback is a single commit revert

System design principles:

Idempotency everywhere – components should handle duplicate events gracefully

Backpressure handling – when downstream systems slow down, upstream needs to adapt rather than queue infinitely

Clear ownership boundaries – when something fails at 2 AM, the on-call engineer needs to know exactly which component to investigate

The core lesson: observability isn’t a feature you add later—it’s foundational infrastructure that determines how fast you can identify and fix problems.

You have worked extensively with infrastructure as code, GitOps, and modern CI CD tooling. How do better tools change not just systems, but the way engineering teams collaborate and make decisions?

The tooling evolution I’ve driven—Terraform, Atlantis, ArgoCD, Helm, GitOps workflows—changes team dynamics in several ways. More recently, I’ve been pushing the boundaries further by integrating AI-assisted tooling into infrastructure workflows.

From tribal knowledge to codified decisions:

– When infrastructure is code, decisions are documented in pull requests

– New team members can understand why something is configured a certain way by reading commit history

– At Lucid, refactoring the infrastructure repo and deploying the App of Apps framework meant anyone could understand the full deployment topology

From gatekeeping to self-service:

– I improved provisioning time from weeks to hours—that’s not just faster infrastructure, it’s developers unblocked

– When deploying an environment requires a PR rather than a ticket to another team, ownership shifts. Teams become responsible for their own infrastructure decisions

From reactive to proactive:

– At Philo, using TF Atlantis for multiple environments means we can plan changes and see their impact before applying

– GitOps with ArgoCD means the desired state is always visible—drift is detected automatically

From hero culture to sustainable operations:

– Implementing postmortem culture at Lucid following Google SRE standards changed how we talked about failures

– Better tooling means less reliance on individuals who “know where the bodies are buried”

AI-augmented infrastructure operations:

– At Philo, I’ve embraced AI tools like Claude Code and Claude Desktop to accelerate infrastructure development and troubleshooting

– I’ve deployed MCP (Model Context Protocol) servers for our core infrastructure tools—Terraform, Kubernetes, Grafana, ArgoCD, Alertmanager, and AWS. This allows engineers to query infrastructure state, debug issues, and even draft configuration changes using natural language, dramatically lowering the barrier to entry for complex systems

– I built an internal GitLab MCP server using FastMCP to integrate our source control and CI/CD pipelines into this AI-assisted workflow. Now engineers can query merge request status, pipeline failures, and repository state conversationally—reducing context-switching and speeding up incident response

– These AI integrations don’t replace expertise—they amplify it. Senior engineers can move faster, and junior engineers can onboard more quickly because they can ask questions of the infrastructure itself

The shift is from infrastructure as a bottleneck to infrastructure as an enabler. When tools—including AI-assisted ones—make safe changes easy and complex systems queryable, teams make more changes, iterate faster, and build better systems. The next frontier isn’t just Infrastructure as Code—it’s Infrastructure as Conversation.

Looking back across your career in infrastructure, SRE, and platform engineering, what advice would you give teams that want to improve performance and reliability while keeping cloud complexity and costs under control as they grow?

Based on my experience across infrastructure, SRE, and platform engineering—including recent work integrating AI into operations:

Start with observability, not optimization:

You can’t improve what you can’t see. Before making changes, instrument your systems. Use modern stacks—I’ve deployed OpenTelemetry, Grafana LGTM, Thanos-scaled Prometheus. This investment pays dividends forever.

Embrace Infrastructure as Code religiously:

Every environment I’ve improved started with getting infrastructure into Terraform, version-controlled, and reviewed. Atlantis for Terraform, ArgoCD for Kubernetes. No exceptions. Shadow infrastructure is where cost and complexity hide.

Design for the 90%, optimize for the 10%:

Most workloads don’t need exotic architecture. Use managed services where they make sense, reserve complexity for problems that actually require it. The $100k/month I saved at Lucid wasn’t from clever tricks—it was from removing unnecessary complexity.

Build cost awareness into the workflow:

Don’t wait for monthly bills. Tag resources, set budgets, make costs visible in pull requests. When engineers see the cost impact of their decisions in real-time, behavior changes.

Invest in your deployment pipeline:

The migration from Jenkins+ERB to ArgoCD+Helm at Philo wasn’t just about tooling preferences. Faster, safer deployments mean you can fix problems quickly, iterate on improvements, and maintain reliability under change.

Treat postmortems as investments:

The postmortem culture I implemented at Lucid wasn’t about blame—it was about learning. Every incident is an opportunity to make systems more resilient. Document them, share them, act on them.

Plan for constraints:

My experience in regulated environments (KSA government cloud) taught me that constraints force better engineering. Apply that mindset even when you don’t have to. Build as if resources are limited, and you’ll make better decisions by default.

Leverage AI to democratize infrastructure expertise:

This is my newest and perhaps most impactful advice. At Philo, I’ve integrated AI tools deeply into our infrastructure workflows:

Claude Code and Claude Desktop have become force multipliers for writing Terraform modules, debugging Kubernetes manifests, and drafting runbooks. What used to take hours of documentation reading now takes minutes of conversation.

MCP servers for infrastructure tools—I’ve deployed MCP integrations for Terraform, Kubernetes, Grafana, ArgoCD, Alertmanager, and AWS. During incidents, engineers can query “what pods are crashlooping in production?” or “show me the alert history for the streaming pipeline” without memorizing kubectl commands or navigating multiple dashboards. This reduces mean-time-to-detection and mean-time-to-resolution significantly.

Building internal tooling with FastMCP —I used FastMCP to build a GitLab MCP server internally, connecting our source control and CI/CD system to AI-assisted workflows. Engineers can now ask about pipeline failures, recent deployments, and merge request status in natural language. This kind of internal tooling investment pays off quickly in reduced friction and faster debugging.

The democratization effect—AI doesn’t replace the need for deep expertise, but it does make that expertise more accessible. Junior engineers can troubleshoot issues that previously required escalation. Senior engineers spend less time answering repetitive questions and more time on architectural improvements. The entire team levels up.

The teams that will thrive in the next decade are the ones that treat AI not as a threat to expertise but as an amplifier of it. Combine strong fundamentals—observability, IaC, GitOps, cost discipline—with AI-assisted tooling, and you get teams that can manage increasingly complex systems without proportionally increasing headcount or burnout.

The future of infrastructure engineering isn’t just about building reliable systems—it’s about building systems that are understandable, queryable, and improvable by the entire team. AI tooling is the bridge that makes that possible.

Subscribe

Related articles

Designing Technology That Changes How Organizations Think

Digital transformation rarely fails because of technology alone. More...

Designing Network Security Systems That Assume Failure Is Inevitable

Amit Kumar Ray has spent nearly two decades architecting...

Jérémy Zimmermann Is Obsessed With Building Systems That Last

Jérémy Zimmermann has spent nearly two decades operating at...
About Author
Tanya Roy
Tanya Roy
Tanya is a technology journalist with over three years of experience covering the latest trends and developments in the tech industry. She has a keen eye for spotting emerging technologies and a deep understanding of the business and cultural impact of technology. Share your article ideas and news story pitches at contact@alltechmagazine.com