Infrastructure as Code (IaC) is an approach where the entire infrastructure is defined and managed through code rather than manual configuration. This enables the automation of deployment, configuration, and management processes, significantly improving speed, reliability, and scalability. Oleksandr Shevchenko, a Site Reliability Engineer with over 10 years of experience, shares how he implements IaC, the tools he uses to accelerate operations, and how he tailors his solutions to meet regulatory requirements.
Oleksandr, you work with Infrastructure as Code in a multi-cloud environment. What are the first tasks you automate when launching a new project, and why?
The first thing I do is build the foundational infrastructure: Virtual Private Cloud, IAM system, storage, CI/CD integrations, and monitoring system. I make sure networks are isolated, access rights are correctly assigned, and centralized logging is properly configured. If these aren’t in place from the beginning, you’re likely to run into problems early on — for example, open ports, excessive permissions for certain employees, lack of logging for critical system components, etc.
After that, I deploy clusters (EKS, GKE), which all subsequent applications in the cluster will rely on. This stage ensures a clear entry point and reproducible environments. Without it, we risk losing valuable time if the infrastructure goes down or a force majeure event occurs.
To manage cloud resources, I use Terraform and Terragrunt. These tools help automate deployments, structure the code, and maintain transparency throughout development. This speeds up the work significantly and keeps every team member informed about all significant changes.
So, you’re actively using Terraform and Terragrunt in both GCP and AWS. Based on your experience, what makes this combination particularly effective in a multi-cloud architecture?
This combination offers both flexibility and structure. Terraform alone is great for rapidly deploying infrastructure and preserving environment parameters. Even if something happens on the cloud provider’s side, you can quickly recover by redeploying the infrastructure in another region or even with another provider using Terraform. It also helps keep resources organized, which simplifies optimization and iteration.
However, as you scale — especially across multiple clouds — an additional tool becomes essential: Terragrunt. It centralizes variables, enforces a DRY (Don’t Repeat Yourself) code structure, and — most importantly — helps manage differences between environments and providers.
This allows us to use a unified structure and configuration logic. While Terraform modules are cloud-specific (AWS or GCP), Terragrunt enables us to call the necessary modules with the right set of variables and maintain a coherent project structure. This simplifies onboarding for new teams and reduces the risk of human error when switching between projects.
How do you structure your infrastructure code: do you prefer a monorepo or a modular approach with separated environments? How does this impact scalability and maintenance?
I prefer a modular approach with clear separation by environment: dev, staging, and prod. Each environment resides in its own directory, and modules are versioned. This greatly speeds up scaling. For example, if we need to deploy infrastructure in a new region, we can simply reuse the existing modules with the new parameters.
It also makes infrastructure maintenance easier, since changes only affect local components. As a result, auditing or incident root cause analysis (RCA) takes less time and effort.
In my view, how you structure infrastructure depends on the scale and maturity of the project. A monorepo is suitable for startups or small teams where speed is of the essence. But when it comes to large systems with multiple teams and strict stability requirements, a modular approach is the only viable solution. It provides the scalability, security, and clear separation of responsibilities that enterprise-grade systems demand.
What measures do you implement to ensure IaC security — from validation and static analysis to working with secrets and access control?
Security starts with automating infrastructure and application deployment through the GitOps methodology — all code goes through pull requests and reviews. To catch issues before they even make it to the plan stage, we use static analysis tools like TFlint, Checkov, and tfsec. These tools automatically inspect Terraform configurations for potential issues and vulnerabilities.
We manage all sensitive data through Secret Manager (for GCP) or AWS Secrets Manager (for AWS), integrated using Terraform data sources. This ensures that secrets never end up in the Terraform state file. Access is strictly limited based on the principle of least privilege: each service gets only the access it truly needs. We also implement automatic secret rotation — updates happen without manual intervention. It’s simple: minimum access, maximum control.
Additionally, we use Policy-As-Code tools like OPA and Sentinel to manage security policies. These practices reduce the risk of data leakage due to human error or compliance violations.
As I understand, you’re currently working in a bank with strict regulatory requirements. How have you adapted your IaC practices to meet these conditions, and what compromises have been necessary?
Yes, our security and audit requirements meet the highest standards in the banking sector. This means we log every action during development, use versioning, and strictly separate environments.
Naturally, this sometimes requires compromises. For instance, we must be selective about the managed services we adopt. If a service doesn’t provide the fine-grained control or detailed audit logs demanded by our regulators, we often have to build and manage the solution ourselves on core infrastructure. When that need arises, we must ensure fine-grained control — precise data access controls that prevent a bad actor from damaging the infrastructure.
Another challenge is redundancy. To pass audits, we need to duplicate infrastructure across multiple regions. But IaC helps manage this — the documentation is embedded in the code itself, and reproducibility reduces the risk of human error. So redundancy is implemented quickly and with minimal cost.
What business outcomes do you associate with implementing IaC? Have you been able to speed up releases, reduce incidents, or cut costs?
Thanks to IaC, we’ve reduced infrastructure provisioning time from several hours to just minutes. The entire process has become significantly faster and more predictable by minimizing manual intervention.
What personal approach helps you stay focused and manage stress during incidents — especially when an IaC mistake can affect production?
The key is having a well-defined rollback plan — a procedure that lets you quickly return to the previous system state if something goes wrong. Every infrastructure pull request must be reversible and tested in staging, so we can recover quickly in case of an emergency.
To minimize the potential impact of security breaches, I stick to the “one change — one PR” rule. This makes it easier to spot vulnerabilities early and address them quickly.
I view incidents as opportunities to learn and improve. The focus should never be on blaming the individual who made a mistake, but on identifying where the system broke down and why. We even conduct incident analysis without naming anyone responsible. This reduces pressure on the team and fosters a healthy environment for growth. Like in any technically demanding field, there’s always stress — but with the right team support and well-established processes, it becomes genuinely manageable.