Christophe Ploujoux, Co-founder and CTO of Blaxel, is setting a new benchmark for how AI agents will run at scale. Backed by a $7.3 million seed round led by First Round Capital, Blaxel is building a purpose-built cloud that can spin up secure, sandboxed microVMs in under 25 milliseconds and orchestrate workloads across the globe.
Drawing on his deep expertise in distributed systems and large-scale data pipelines, Ploujoux has led his team in developing a platform architecture that blends speed, resilience, and modularity for a new class of high-frequency, ephemeral workloads.
In this interview, he reveals the strategies, design patterns, and infrastructure choices powering Blaxel’s vision for billions of AI agents.
How have your experiences at OVH and ForePaaS shaped your strategy for building microservices architectures that support LLM APIs, sandbox environments, MCP gateways, observability, and batch jobs at Blaxel?
First, at ForePaaS, my work on our multi-cloud platform gave me the core vision. Operating across the major clouds made me realize that a new architectural approach was needed to unlock the full potential of AI agents. It became clear that a platform purpose-built for high-frequency, ephemeral, and dynamic workloads could create a step-change in performance and efficiency for this new class of applications.
Then, at OVH, I acquired the technical capability to realize that mission. My work was focused on building hyper-efficient, reliable infrastructure at a massive scale, learning to optimize everything from the hardware up. This hands-on experience with bare-metal provisioning and automation is precisely how we can engineer extreme performance, like our sub-25ms sandbox boot times, at Blaxel.
Blaxel’s strategy directly combines these experiences. We are using the deep infrastructure expertise from OVH to build the new, purpose-built cloud for AI agents that my time at ForePaaS inspired.
Blaxel supports a plug-and-play model for serverless agents across various frameworks—can you walk us through how this architecture enables rapid deployments and scalability?
Of course. Our architecture is built to deliver both speed for developers and massive scale for agentic workloads.
For rapid deployments, we provide total abstraction. Developers focus on agent logic, not infrastructure. They bring their code, regardless of the framework, and our platform handles the complex provisioning and integration work automatically. This reduces deployment time from days to minutes.
For scalability, our architecture is serverless and globally distributed. We spin up fresh, isolated sandboxes for each agent task, enabling massive parallel processing. This is a true pay-as-you-go model, as you only pay for the milliseconds of compute you use. This is amplified by our global footprint; deploying worldwide provides low latency and a vast, resilient capacity pool to handle unpredictable workloads at any scale.
The “plug-and-play” aspect is the glue. Using standards like MCP or Model Gateway, we ensure any agent can connect seamlessly, leveraging the full speed, scale, and efficiency of our purpose-built, global platform.
What strategies and patterns have you implemented to manage latency, service versioning, and observability across distributed microservices in production?
We have a deliberate strategy for each of these core operational pillars.
For Latency: Our approach is two-pronged. Globally, our distributed infrastructure runs agent sandboxes in the region closest to the user to cut down round-trip time. Locally, within each region, we co-locate key components like the agent sandbox and MCP gateways. This minimizes internal network hops, making communication within the agent’s environment incredibly fast.
For versioning: Our strategy here is focused on our users. We provide a robust revision system for agents and MCP servers. When a developer deploys an agent, we create a new, immutable revision. This allows them to manage their deployments safely, enabling patterns like A/B testing. Most importantly, if a new revision is faulty, they can perform an instant rollback to a previous, stable revision with zero downtime. This same principle applies to their MCP servers.
For observability: We treat it as a core product feature, not an afterthought. Our strategy is built on the three pillars:
- Structured Logs: Every action, from an API call to an agent’s tool usage, generates structured, machine-readable and human-readable logs. This allows for powerful, high-speed querying to diagnose issues.
- Metrics: We use a time-series database to track the health and performance of every component in our system. We have dashboards and automated alerting on key indicators like error rates, resource consumption, and, of course, latency.
- Distributed Tracing: This is non-negotiable for microservices. We use distributed tracing to follow a single agent’s request as it travels through our entire system, allowing us to pinpoint the exact service that’s causing a bottleneck or failure.
Drawing from your background in distributed systems, how do you frame the design of microservices for AI agents to balance resilience, performance, and modularity?
Our platform is designed to provide developers with the building blocks for agents that are modular, performant, and resilient by default.
We enable modularity through the Model Context Protocol (MCP), which allows developers to define tools as independent, plug-and-play components. This makes their agents far more adaptable and easy to maintain without code changes.
For performance, developers get that automatically. Our global, serverless infrastructure provides low-latency execution by co-locating agents and their tools, and it scales instantly to meet any demand.
Finally, we bake in resilience at the platform level, our revision system allows developers to perform instant, zero-downtime rollbacks to a previous stable version if anything goes wrong. We handle the hard parts so they can focus on their agent’s logic.
Booting sandboxed VMs in under 25 ms is technically impressive. What challenges did you face in achieving that, and how did your experience in microservices and large-scale data pipelines help you overcome them?
The raw speed comes from using microVM technology based on Firecracker, but you’re right, the sub-25ms boot time of a single VM is only the starting point. The real challenge was turning that into a reliable system that could handle massive, concurrent workloads.
The hardest part was plugging everything together and engineering the resilience to handle the deployment of thousands of workloads in a very short period. This is where our deep background in distributed Kubernetes was invaluable. Kubernetes taught us the patterns for building large-scale, resilient orchestration systems. We applied those same principles to design our Sandbox Orchestrator. It’s not just about starting VMs, but about managing state, scheduling capacity, and handling failures gracefully when dealing with a huge burst of requests.
The second major difficulty was networking. We had to design and implement a high-performance, secure network fabric that could instantly connect thousands of ephemeral sandboxes to Agents and Agents to MCPs. Again, our experience with Kubernetes, particularly with custom CNI (Container Network Interface) plugins and complex service meshes, gave us the blueprint. We knew how to build the dynamic routing and network isolation required to handle the high churn of workloads being created and destroyed every second.
So while Firecracker provides the fast-booting primitive, our experience with Kubernetes gave us the ability to solve the much harder systems-level problems of orchestration and networking at a massive scale.
Agentic workloads introduce complex security considerations. How have you adapted microservices and sandboxing principles to support robust zero-trust models for autonomous AI agents?
Our security model is built on Zero Trust, meaning we assume no implicit trust anywhere. The cornerstone of this is our sandboxing. Every agent task can run in a fresh, ephemeral microVM that provides complete isolation. This contains any potential compromise to a single, temporary task that is destroyed seconds later.
For external access, control is placed entirely in the developer’s hands through a tool-based architecture. By default, an agent in its sandbox has no ability to interact with the outside world. A developer grants specific capabilities by equipping their agent with explicit tools. For example, a tool to call a particular API or a tool to browse a website. Our platform acts as the secure intermediary, ensuring the agent can only use the tools it has been explicitly given.
Ultimately, our Zero Trust strategy combines platform-level isolation with developer-defined control. The sandbox contains the agent, and the tool system ensures it can only perform actions its creator has authorized. The agent never has implicit access; every capability is a deliberate choice made by the developer, enforcing a true least-privilege model.
When dealing with autonomous agents at scale, what patterns do you rely on to ensure observability, monitoring, and graceful failover across microservices?
To manage agents at scale, our strategy is to build for transparency and automatic resilience. For transparency, we rely on the three pillars of observability. We use structured logs for key events and metrics for high-level trends, but the most critical tool is distributed tracing. A trace provides a replayable story of an agent’s “thought process”, showing every decision it makes, which is the only way to truly debug unpredictable behavior.
Our monitoring is then built on top of these observability tools. We set up proactive alerts on agent-centric metrics like task success rates, not just system health. This focus on user impact tells us precisely when an issue needs investigation, and it acts as the direct trigger for our automated failover systems.
When a failure is detected, we rely on graceful failover at multiple layers. The sandbox itself is the ultimate bulkhead for any single agent failure. Our services run with redundancy within each region, and our multi-region architecture protects against large-scale outages. This multi-layered defense ensures the platform remains stable and highly available, no matter what the agents do.
Blaxel recently closed a $7.3 million seed round led by First Round Capital. How will this new capital fuel your roadmap for building infrastructure tailored to AI agents, and what key technical priorities will you tackle first?
We see this funding as a powerful endorsement of our vision to build the defining infrastructure for AI agents. It allows our team to lead from the front in a rapidly evolving market. The agent ecosystem is innovating at an incredible speed, and this capital ensures we can be a standard-bearer, shaping the future of these systems.
Our strategy involves fundamental investments in our core infrastructure, enhancing our network for lower latency and greater reliability. This foundational work will support our key focus areas, like enabling full App Hosting and seamlessly integrating existing Database projects right onto our architecture. This creates the perfect environment for building and managing complex, multi-step Agentic Workflows. We also envision hosting specialized models that are far more optimized for inference, not training, to give our users a decisive performance advantage.