Designing Network Security Systems That Assume Failure Is Inevitable

Amit Kumar Ray has spent nearly two decades architecting network security systems at companies where downtime costs millions and vulnerabilities can compromise entire customer infrastructures.

As Sr. Principal/Architect Engineer at Palo Alto Networks, he designs security products that protect critical systems around the clock. His work requires constant attention to three principles: resilience, security, and simplicity. Before Palo Alto Networks, Amit built massively scaled systems at Amazon and Cisco. He’s seen firsthand how distributed systems fail and what it takes to prevent those failures.

Now he’s focused on a newer challenge: integrating machine learning into security infrastructure without sacrificing the stability and deterministic controls that enterprises depend on. We spoke with Amit about transitioning from traditional network architectures to cloud-native systems, the evolution of AI in cybersecurity, and practical strategies for building infrastructure that scales across AWS and GCP. He shares specific approaches to ML integration, real-time threat detection, and designing systems that assume failure will happen and prepare for it.

What key lessons have you learned from architecting systems at scale across different organizations like Palo Alto Networks, Amazon, and Cisco?

One of the key lessons I have learnt from designing so many different systems at scale is that failure is not just a possibility; it is inevitable. Companies like Palo Alto Networks, Amazon, and Cisco may build systems that serve different purposes or address specific segments of the market, but they all follow the same foundational principles: resilience, security, and simplicity.

At Amazon and Cisco, I build systems that massively scale. A few hours of downtime can mean millions of dollars in losses. Hence, in that scale, we have to expect that failure may occur at any point. Therefore, the focus is on identifying all the possible failure points and building resiliency. At Palo Alto Networks, the system design adopts a zero-trust architecture at its core. Security in the product is not just an afterthought but rather built into every phase of the software design life cycle. As the customer relies on Palo Alto Networks’ security products to protect their infrastructure around the clock, any failure in the system may leave the customer’s infrastructure vulnerable to various threats. Hence, along with security I also focus on resilient system design.

The most important lessons I have learnt working in these companies are to stay customer-obsessed and to design systems that are simple yet powerful enough to solve customers’ problems. Building a layer of monitoring, alerting, and a responsive feedback loop is a must and should be built into every system. This will lay the foundation for improving the system over time and making it more resilient, secure, and horizontally scalable.

How do you approach integrating machine learning capabilities into network security and infrastructure systems while maintaining performance and reliability?

We take a very cautious approach while integrating any machine learning algorithms into our network security and infrastructure system. There is a massive customer infrastructure that relies on our network security and control plane infrastructure. So, we must have deterministic guardrails to enforce bounds and fallback paths. The following are the high-level steps for the ML integration process:

Identify a specific outcome – What are we trying to achieve by using an ML model. For example, anomaly detection, threat prioritization, etc. Start with a small segment of the network security and infrastructure system, then build on that gradually.
Input data – What data is shared with the machine learning model? Don’t share PII (Personally Identifiable Information) like email, physical address, etc.
Measure model performance – Use precision, recall, F1, ROC-AUC/PR-AUC for classification, or MAE/RMSE for regression, with special attention to class imbalance and calibration (e.g., reliability curves, Brier score) to measure the model performance.
Deploy in shadow mode – Initially, deploy the model in out-of-band mode so that model’s outcome does not impact traffic flow. Use it to enrich the events only until it is stable enough.
Explainability – To gain confidence, we understand how an ML system makes decisions. Use popular tools like SHAP or LIME to determine the flow of your ML model and log them.
Determine guardrails – Keep the enforcement engine deterministic and centralized unless the ML integration is mature enough.

What are the biggest challenges organizations face when transitioning from traditional network architectures to cloud-native, distributed systems, and how can they overcome them?

The transition from traditional networks to cloud-native architecture is not only a technical change but also a cultural and operational systemic shift. The following are some of the challenges faced in the process:

Architectural mindset: In a traditional network landscape, the focus is often limited to solving business problems within the confines of fixed parameters and tightly coupled systems. But in a cloud-native distributed architecture, those assumptions don’t hold anymore.
Failure handling: A cloud-native distributed systems are often built on the principle that failure can occur at any part of the network at any time. Cascading failures, partial outages, and degraded systems are very common phenomena in distributed systems. Hence, organizations often face challenges migrating traditional network services to cloud-native distributed frameworks because they are simply not designed for it.
Zero trust: In a traditional network architecture, applications often run within the perimeter of a trusted boundary. In a cloud native distributed system, that perimeter dissolves, and hence a zero-trust system design is a must requirement, not an afterthought.
Resiliency: Resiliency is a critical component of in a cloud-native network architecture and must be embedded across networking, compute, and data. Multi-AZ or multi-region designs are often very common. Designing a network system with a disaster recovery framework is a need for modern system architecture.
Robust pipelines: Unlike traditional networks, cloud-native architectures often require robust pipelines that deploy changes to infrastructure, applications, and models consistently and repeatedly. Canary deployment, continuous integration and deployment, and automated rollbacks are the basic needs of a distributed cloud-native network solution, which also requires a reliable pipeline.

In your experience, how has the role of AI and machine learning evolved in network security, and what practical applications are delivering the most value today?

In my experience, the role of AI and machine learning evolved in network security from being “AI-curious” to “AI-capable”. Initially, the AI and machine learning frameworks were deployed mostly off-band and not involved in the data path for decision-making. But slowly it has now changed over the last few years. Machine learning algorithms are used for integrated decision-support systems that sometimes override traditional controls. The role of explainable AI(XAI) has become very crucial in network security architectures. It helps convert opaque ML algorithm scores into trusted and actionable decisions.

Earlier adoption of machine learning and AI algorithms was mostly in pure anomaly detection. But now it works very close to the operations where it can analyze east-west traffic, and identifies unusual network behavior, prioritizing serious alerts, and spotting automated attacks. In IDS/IPS and SIEM pipelines, traditional rule-based correlation does not scale well due to the huge volumes of logs from network, endpoints etc. ML algorithms help group related events, detect behavioral anomalies across multiple data sources, and rank them based on the risk.

What strategies do you recommend for building scalable architectures in multi-cloud environments (AWS, GCP) that can handle unpredictable traffic patterns and security threats?

One of the advantages of cloud-native architectures is availability of various features that help organizations build scalable architectures. It is just a matter of understanding them and using it where required. Following are some of the strategies that can help with unpredictable traffic patterns and security threats:

Stateless services: Deploy services stateless. This helps horizontal scaling and fail-over easier.
Autoscale for infra: Use data layers that are horizontally scalable. Most of the managed database services provide autoscale replicas. This is one of the best strategies to scale only when there is unpredictable traffic.
Autoscale for Services: Use load balancers and compute autoscaler (like Kubernetes).
Resiliency: Use active-active architectures wherever possible in the stack. Most of the cloud services come with ready-made disaster recovery features which provides zonal and regional redundancies.
Zero trust: Design with least privilege principle. Service to service authentication using mTLS. Use service mesh like Istio that provides these functionalities already.
Use cache: Utilize caching heavily in the system. CDN, edge caching, In memory caching are some of the great options to integrate into the system.
Avoid cross-cloud: Minimize synchronous calls over internet and cross-cloud. Plan eventual or strong consistency depending on product requirement and design.
Restricted database access: Keep database access restricted by having table level and even column level RBAC. Don’t expose PII data.

How do you balance the need for innovation in AI-driven systems with the critical requirement for stability and security in enterprise networking?

Balancing AI-driven systems with stability and security demands of enterprise networking can be achieved through controlled adoption. I always treat AI systems as a smart companion rather than a tool that replaces the proven network and security controls. This is a fundamental strategy that helps adopt AI and ML models for enrichment and augmenting capability. We should not be opening up the entire system to AI/ML algorithms, rather, use it in specific segments in the network. Deploy them in a shadow mode and observe the insights. Use explainable AI tools like SHAP or LIME to understand model output and measure it’s accuracy. A strong observability and explainability boosts confidence and builds trusts.

What are the most important considerations when designing distributed systems that need to process and analyze data in real-time for security applications?

Define SLA: First identify the end-to-end SLA target of the system. This typically defines the timeline between data ingestion → detection → action.
Implement backpressure model: Distributed system resources are always limited and hence the processing capability. Design systems that use backpressure (e.g. Throttling, load shedding etc.) to slow down the traffic flow so that spikes and attacks don’t collapse the entire system.
Use streaming: There are various technologies like Kafka, pub-sub, kinesis that scales massively. Always choose managed cloud native solutions if it is a cloud architecture.
Enrich once and reuse: We often see applications in a distributed systems builds event context multiple times in a pipeline. This results in degraded performance. Streamline the pipeline that builds the context at the entry point and carry it over to the multiple applications in the pipeline.

Looking ahead, what emerging technologies or trends do you see reshaping the intersection of AI, networking, and cybersecurity, and how should enterprise architects prepare?

AI is getting mature at a rapid pace, and as a result, it is evolving as an intelligent companion across various technology stacks. MCP server capabilities are getting built into the applications and services, that enables AI chat-clients to use the application API to answer various questions. In cybersecurity, AI is used to recognize threat patterns and make predictions in real time.

At the same time, industry architects need to be careful about granting access to the data to AI applications. Attackers are using various AI to scale attacks. We are seeing various new AI-generated attack patterns that essentially force the defenders to raise their bars. Hence, zero trust design architectures are the new default and not an afterthought. Enterprise architects should also integrate policy-as-code, automation, and standardized observability across networks and clouds.

How do you approach database management and data architecture decisions for AI/ML workloads in production environments?

Data architecture and database management are very tightly coupled with the application or the services offered. One database does not work well for all use cases. So data architecture often requires understanding how the data will be used. The database should be chosen based on the performance. Large files and massive log volumes should be stored raw in the filesystem. They are often picked by the analytic engine that produces analytics data which is often stored in the data warehouse. Data that is referenced often should be stored in the cache. Time series data is a core part of many AL/ML systems, especially in cybersecurity. They should be stored in the right database like Prometheus, InfluxDB.

Some applications like anomaly detection or security analysis often rely on embeddings. So, Vector databases should be used as specialized layers. They allow similarity search which is crucial for any AI-driven application.

What best practices would you recommend for organizations looking to implement AI-powered threat detection and response systems at scale?

Data quality: AI powered threat detections are effective only if it is trained on a high quality data. A standarized telemetry across various data sources significantly boosts model performance.
Zero-trust: Don’t trust by default. Implement AI powerd threat detection that requires continuous verification of identity.
Build trust over time: Deploy AI capabilities to the threat detection system in shadow mode and gain confidence over time. Use it to enrich the events only until it is stable enough.
Use streaming: There are various technologies like Kafka, pub-sub, and kinesis that scales massively. Always choose managed cloud native solutions if it is a cloud architecture.
Measure model performance: Regularly measure model performance and drift. Use the explainable AI to get insights about why something was flagged. This helps overall to build confidence.

Designing Network Security Systems That Assume Failure Is Inevitable

Designing Network Security Systems That Assume Failure Is Inevitable

What key lessons have you learned from architecting systems at scale across different organizations like Palo Alto Networks, Amazon, and Cisco?

How do you approach integrating machine learning capabilities into network security and infrastructure systems while maintaining performance and reliability?

What are the biggest challenges organizations face when transitioning from traditional network architectures to cloud-native, distributed systems, and how can they overcome them?

In your experience, how has the role of AI and machine learning evolved in network security, and what practical applications are delivering the most value today?

What strategies do you recommend for building scalable architectures in multi-cloud environments (AWS, GCP) that can handle unpredictable traffic patterns and security threats?

How do you balance the need for innovation in AI-driven systems with the critical requirement for stability and security in enterprise networking?

Looking ahead, what emerging technologies or trends do you see reshaping the intersection of AI, networking, and cybersecurity, and how should enterprise architects prepare?

How do you approach database management and data architecture decisions for AI/ML workloads in production environments?

What best practices would you recommend for organizations looking to implement AI-powered threat detection and response systems at scale?

Subscribe

Related articles

About us

Quick Links

Latest

Subscribe