-10.9 C
New York

How Nitish Mane Balances Speed, Cost, and Reliability in Modern Cloud Systems

As cloud infrastructure becomes the backbone of everything from connected vehicles to live video streaming, engineering teams are under constant pressure to move faster, spend less, and still deliver systems that never fail. The challenge is no longer choosing the right tools, but designing platforms and operating models that scale reliably under real-world constraints, regulatory requirements, and unpredictable demand. For many organizations, the hardest part is learning how to balance speed, cost, and reliability without trading one for the other.

In this exclusive interview, we spoke to Nitish Mane, a backend infrastructure and cloud platform engineer with more than a decade of experience building and operating large-scale distributed systems across AWS, GCP, Azure, and OCI. Drawing on his work, Nitish shares hard-won lessons from managing real-time video streaming platforms, global vehicle telemetry systems, and regulated government cloud environments. He explains how practices like infrastructure as code, GitOps, observability, and AI-assisted tooling help teams control cloud spend, reduce operational risk, and design systems that remain resilient under pressure. His perspective offers a practical look at what modern infrastructure engineering really looks like when reliability is non-negotiable and scale is the default.

You often describe your work as the science of making technology run faster and cheaper without sacrificing reliability. How do you define that balance in practice, and where do teams most often get it wrong?

In practice, that balance comes down to right-sizing infrastructure to real demand, automating to reduce human error, and investing in observability so issues are caught early. At Lucid Motors, this approach reduced infrastructure provisioning time from weeks to hours using Cluster API, ArgoCD, Terraform, and Atlantis.

Speed did not come at the expense of reliability because everything ran through GitOps workflows where changes were reviewed, versioned, and reversible. Cost savings came from eliminating overprovisioning and implementing efficient data archiving, resulting in more than $100,000 per month in savings.

Teams tend to get this wrong by over-engineering for scale they do not yet need, under-investing in observability, relying on manual processes that slow delivery, and treating cost optimization as a one-time effort instead of an ongoing practice. The goal is to move fast because guardrails exist, not despite them.

You have delivered significant cloud savings, including more than one hundred thousand dollars per month at Lucid Motors. What planning decisions or architectural principles made those results possible, and how early should teams be thinking about cost efficiency?

Three principles drove those results. The first was visibility. Comprehensive monitoring with the Kube Prometheus Stack, Thanos, and Grafana made resource usage measurable. The second was infrastructure as code with governance. Refactoring the infrastructure repository and implementing Terraform with Atlantis ensured that every change was reviewed before resources were created. The third was efficient data lifecycle management. Automated archiving and careful use of multi-region MongoDB Atlas allowed telemetry data to be aggregated where necessary instead of replicated everywhere.

Teams should think about cost efficiency from day one. When building OCI infrastructure in the Me Jeddah one region for KSA government workloads, cost awareness was embedded from the start through right-sizing, proper tiering, and automated cleanup. Waiting until cloud costs become alarming usually means inefficiency is already baked into both architecture and culture.

At Lucid Motors, you architected and deployed cloud infrastructure that now connects vehicles sold across the Middle East and the EU. What unique reliability and latency challenges come with building infrastructure that directly supports connected vehicles at that scale?

Connected vehicles introduce constraints that traditional software systems rarely face. Latency is critical because vehicles require near real-time responses for diagnostic and safety-related features. Deployments were therefore localized, with OCI me jeddah one serving KSA government workloads and AWS me south one supporting other regions to minimize round-trip latency.

Reliability challenges included intermittent connectivity as vehicles move through low coverage areas, global data aggregation across multi-region MongoDB Atlas clusters, and strict security requirements. These were addressed through resilient ingestion pipelines, secure VPN connectivity, private Kubernetes endpoints, and disciplined on-call and postmortem practices aligned with Google SRE standards. When vehicles are involved, the cost of failure is far higher than a frustrated user.

How did working in a regulated government cloud region influence your approach to security, resilience, and operational discipline compared to more typical commercial cloud environments?

Working in a regulated government cloud required stricter security, resilience, and operational controls. Data sovereignty was non-negotiable; access controls and audit logging were significantly tighter; and container images had to be copied to OCI using custom automation. Because a managed certificate service was unavailable at the time, an internal certificate generator was built using cert-manager and Let’s Encrypt to enable secure, automated TLS in Kubernetes.

From a resilience standpoint, systems had to be more self-contained, and disaster recovery plans could not rely on easy cross-region failover. Operational discipline increased through formal change management, reversible deployments, ArgoCD-based consistency, and stricter on-call procedures. The experience reinforced that constraints often lead to better engineering.

In your current role, you manage a large-scale video streaming infrastructure. What lessons from operating real-time media pipelines have shaped how you think about observability, failure recovery, and system design?

Operating live video streaming pipelines at Philo reinforced how unforgiving real-time systems are. Degradation must be detected in seconds, not minutes, and traditional metrics alone are insufficient. Distributed tracing and MCP integrations for observability tools enable faster root-cause identification during incidents.

Failure recovery is constrained because live streams cannot be retried. Migrations to EKS and GitOps-based deployments improved rollback speed and reliability. These systems reinforced core design principles such as idempotency, backpressure handling, and clear ownership boundaries so that on-call engineers know exactly where to look during incidents.

You have worked extensively with infrastructure as code, GitOps, and modern CI CD tooling. How do better tools change not just systems, but the way engineering teams collaborate and make decisions?

Modern tooling shifts teams from relying on tribal knowledge to making codified decisions. Infrastructure as code documents intent in pull requests, improves onboarding, and increases transparency. Self-service provisioning replaces gatekeeping, unblocks developers, and shifts ownership to teams.

GitOps enables proactive operations by making the desired state visible and detecting drift automatically. Postmortem culture reduces reliance on individual heroes. AI-assisted tooling further amplifies these effects. At Philo, tools like Claude Code, MCP servers, and internal GitLab integrations allow engineers to query infrastructure state, debug issues, and navigate CI CD workflows using natural language. These tools do not replace expertise; they amplify it.

Looking back across your career in infrastructure, SRE, and platform engineering, what advice would you give teams that want to improve performance and reliability while keeping cloud complexity and costs under control as they grow?

Start with observability before optimization. Treat infrastructure as code as non-negotiable. Design for the majority of workloads, not edge cases. Build cost awareness into workflows by tagging, setting budgets, and providing visibility in pull requests. Invest in deployment pipelines and treat postmortems as learning opportunities. Finally, embrace constraints and use AI assisted tooling to democratize infrastructure expertise. The future of infrastructure engineering is systems that are understandable, queryable, and improvable by the entire team.

Subscribe

Related articles

Why Security Must Evolve From Gatekeeper to Growth Engine in the Age of AI

As enterprises race to deploy AI-driven systems at unprecedented...

Building Enterprise AI That Earns Trust

Hardial Singh has spent over 15 years designing data...
About Author
Tanya Roy
Tanya Roy
Tanya is a technology journalist with over three years of experience covering the latest trends and developments in the tech industry. She has a keen eye for spotting emerging technologies and a deep understanding of the business and cultural impact of technology. Share your article ideas and news story pitches at contact@alltechmagazine.com