With over 18 years of experience spanning enterprise applications to AI infrastructure, Rajesh Kesavalalji has built systems that power everything from warehouse operations to GPU-intensive machine learning workloads. His work has focused on solving the practical challenges of scaling AI systems, including monitoring GPU efficiency, designing event-driven architectures, and building observability platforms that turn raw telemetry into actionable insights.
In this interview, Rajesh discusses the evolution of system architecture in the AI era, the hidden challenges of GPU monitoring, and how traditional backend engineering principles apply to modern machine learning infrastructure. He shares specific strategies for balancing cost with performance, the role of event sourcing in AI systems, and what engineering teams should prioritize when integrating AI capabilities into existing platforms.
1. With 18+ years of experience spanning traditional enterprise applications to cutting-edge AI infrastructure, how has your perspective on system architecture evolved as you’ve moved into AI-focused engineering roles?
With over 18 years of experience, my perspective on system architecture has evolved from building traditional enterprise applications focused on business logic and stability to designing infrastructure that is optimized for the dynamic demands of AI workloads. In my earlier roles, the emphasis was on modularity, transaction safety, and long-term maintainability.
As I transitioned into AI-focused engineering, especially in startup environments, architectural priorities shifted toward low-cost, high-throughput data pipelines, GPU observability, distributed compute efficiency, and real-time metrics at scale. This evolution required a deeper integration of infrastructure with model execution patterns, performance tuning, and cost-aware scalability, pushing me to approach architecture not just as software design, but as a system of orchestrated compute, data, and operational intelligence tailored for AI.
2. You’ve led initiatives to reduce GPU inefficiencies through comprehensive monitoring strategies. What are the most critical performance metrics organizations should track when scaling AI workloads, and how do you balance cost optimization with computational performance?
When scaling AI workloads, the most critical GPU performance metrics to track include: GPU Utilization in percentage, which measures how actively the GPU is being used. This is key for identifying idle or underutilized hardware. Memory Usage (VRAM and Bandwidth), indicates whether models are memory-bound, helping to plan for the right-sizing of GPU resources. GPU Temperature and Power Draw: Useful for preventing thermal throttling and gaining insights into energy efficiency. Ultimately, the goal is to create a feedback loop between monitoring, scheduling, and cost modeling, allowing engineering teams to make informed trade-offs between time-to-result and infrastructure spend.
3. GPU resource allocation is becoming increasingly crucial as AI workloads scale. What are the biggest challenges in monitoring GPU utilization, and how do out-of-band metrics provide insights that traditional monitoring might miss?
One of the key challenges in scaling AI workloads is ensuring continuous and effective monitoring of GPU resources. Relying solely on in-band metrics such as standard GPU utilization, thermal readings, and power consumption can provide only limited visibility into deeper inefficiencies. In our early deployments, we discovered that performance degradation was often caused by thermal and power-related throttling. However, in-band metrics alone cannot reliably detect underlying hardware issues, such as failing fans or GPUs silently reaching thermal or power limits.
To overcome this, we introduced out-of-band (OOB) GPU telemetry, which enabled us to monitor hardware health independently of the operating system and workload layers. OOB monitoring offers several advantages:
- Persistent hardware-level visibility, even during OS failures or job crashes
- Zero impact on running workloads, since telemetry is collected outside the application path
- Cross-layer correlation, allowing us to link hardware events with software behavior for deeper diagnostics
- Enhanced security and isolation, especially valuable in regulated or multi-tenant environments
By integrating OOB metrics with our observability stack, we gained a more complete and reliable view of GPU health, leading to faster root cause analysis and more efficient resource allocation.
4. You’ve implemented observability solutions using tools like Grafana, Mimir, and Loki. How do you design these monitoring strategies to provide actionable insights for optimizing AI infrastructure, rather than just collecting data in complex distributed systems?
In complex distributed AI systems, collecting data is only valuable if it leads to actionable outcomes. We often encounter two common challenges: either data is available from both in-band and out-of-band sources but lacks context, or it’s collected but never used effectively. To address this, we focus on building purpose-driven observability pipelines that transform raw telemetry into actionable insights and informed decisions.
We begin by instrumenting key system components such as GPU health, job performance, and scheduling latency across both in-band and out-of-band layers. Then, using tools like Grafana, Mimir, and Loki, we create dashboards tailored for specific internal teams (platform, ML, hardware ops), enabling them to explore trends and quickly identify anomalies.
The goal is not just visualization, but action enablement. We integrate alerting mechanisms to flag issues such as thermal throttling, underutilized GPUs, or workload imbalance. These insights drive concrete actions, such as node scaling, hardware replacements, and workload reshuffling. Over time, the usage of these dashboards and alerts becomes embedded in operational workflows, ensuring that monitoring continuously improves system reliability, efficiency, and responsiveness.
5. Having worked with event sourcing applications for enterprise integration, how do event-driven architectures support the data flow requirements of modern AI systems, and what data integrity principles are essential for reliable AI infrastructure?
Event-driven architectures are a natural fit for modern AI systems because they allow data to flow asynchronously and reactively between loosely coupled components (data ingestion, feature stores, training pipelines, and inference services). By capturing every state change as an immutable event, we create durable audit trails that support reproducibility, time-travel debugging, and incremental retraining, all of which are critical for reliable AI. To maintain data integrity across this flow, I rely on principles such as idempotency (so events can be replayed safely), schema versioning (to support model evolution without breaking producers/consumers), strict ordering/causality guarantees (especially for temporal features), and exactly-once processing semantics to ensure that downstream AI pipelines operate on consistent, trustworthy data.
6. You’ve led migrations from monolithic architectures to microservices. How do you approach designing cloud-native AI systems that can handle both batch processing and real-time inference workloads?
To design cloud-native AI systems that support both batch processing and real-time inference, I separate the architecture into decoupled microservices with clear contracts but optimize each path differently. Batch workloads (e.g., training, large ETL jobs) are routed through asynchronous, queue-based pipelines with elastic compute backing (Spot instances, distributed schedulers), where throughput and fault-tolerance matter more than latency. Real-time inference traffic is served via lightweight stateless microservices fronted by autoscaling GPU/CPU pools and low-latency caches, where latency SLOs drive resource allocation. Shared observability, feature stores, and model registries ensure consistency across both paths, while concepts like idempotent event handling, versioned artifacts, and circuit-breaker patterns preserve reliability as the system scales.
7. With your extensive experience in ETL pipeline frameworks using Spark, Cassandra, and Kafka, how do these data engineering principles evolve when supporting AI and machine learning workflows versus traditional business intelligence?
Traditional ETL for BI focuses on periodic extraction, cleaning, and aggregation of structured data for reporting. When supporting AI/ML workflows, those same pipelines must become real-time, feature-aware, and model-centric. Technologies like Spark, Kafka, and Cassandra are still used, but their purpose evolves from aggregating facts after the fact to delivering high-throughput, low-latency streaming data that fuels model training and inference.
8. What are the key differences between monitoring traditional enterprise applications versus AI/ML workloads, and how should engineering teams adapt their observability strategies for AI infrastructure?
Traditional enterprise monitoring focuses on service uptime, CPU/DB health, and transaction errors, whereas AI/ML workloads require visibility into data freshness, model performance, and GPU-driven compute efficiency. In AI infrastructure, observability must cover both the infrastructure layer (GPU utilization, memory bandwidth, data pipeline latency) and the model layer (drift, inference latency, training convergence).
9. For organizations looking to optimize their AI infrastructure spending while maintaining performance, what metrics and strategies would you recommend for making data-driven decisions about resource allocation?
To make smart, data-driven decisions about AI infrastructure spend without sacrificing performance, I recommend tracking metrics across three dimensions and tying them to clear allocation strategies like GPU efficiency & utilization, cost-to-performance ratios, and elasticity & scheduling efficiency.
10. Looking ahead, how do you see the role of traditional backend engineers evolving as AI infrastructure becomes a core competency, and what advice would you give to teams beginning to integrate AI capabilities into their existing systems?
As AI becomes core infrastructure, traditional backend engineers are increasingly expected to understand not just services and APIs, but also data flows, GPU scheduling, and model lifecycle mechanics. Their role evolves from building CRUD systems to designing platforms where data pipelines, feature stores, and inference services behave like first-class production workloads. For teams beginning this transition, my advice is: start by treating AI pipelines like products, with clear SLAs, versioning, and observability; build decoupled interfaces between data, training, and serving layers; and invest early in reusable platform abstractions (e.g., feature stores, model registries, inference gateways) so that ML can plug into existing systems without creating brittle point-to-point integrations.
