Infrastructure Drift Detection
Advanced techniques for detecting and managing configuration drift in your cloud environments.
What You'll Learn
🏷️ Topics Covered
📋 Prerequisites
- Expertise in an Infrastructure as Code (IaC) tool like Terraform or CloudFormation.
- Deep understanding of cloud provider APIs and eventing systems (e.g., EventBridge, Cloud Logging).
- Experience with policy-as-code engines (OPA, Sentinel) and scripting languages (Python, Go).
- Familiarity with cloud provider security and configuration management tools.
💡 What is Infrastructure Drift?
Infrastructure drift occurs when the real-world state of your live environment deviates from the state defined in your Infrastructure as Code (IaC). This is often caused by manual changes made directly in the cloud console, out-of-band scripts, or even automated actions by cloud provider services. Drift undermines the reliability, security, and compliance of your infrastructure.
Infrastructure Drift Detection Methods: Stateful vs Stateless vs Event-Driven
There are three primary strategies for detecting drift. A comprehensive approach often involves a combination of all three.
Stateful Comparison
How it works: Compare the last "known good" state (e.g., a Terraform state file) with the live environment.
Tooling: `terraform plan` is the classic example. It refreshes its state against the cloud provider APIs and shows you the difference.
Pros: Very precise, shows exactly what changed.
Cons: Only works for resources managed by that specific state file.
Stateless Scanning
How it works: Periodically scan the live cloud environment and evaluate resources against a set of policies that define the desired configuration.
Tooling: Cloud Custodian, Steampipe, or custom scripts that use OPA.
Pros: Can discover unmanaged ("shadow IT") resources and checks the entire environment.
Cons: Can be resource-intensive; drift is only detected at the time of the scan.
Event-Driven Detection
How it works: Use cloud provider event streams (like AWS CloudTrail or Azure Activity Logs) to trigger a policy check in real-time whenever a resource is modified.
Tooling: AWS Config, EventBridge rules that trigger Lambda/OPA.
Pros: Near real-time detection.
Cons: Can be complex to set up; potential for high volume of events.
Terraform Drift Detection Tutorial: Automated Configuration Monitoring Tools
Let's see how these strategies are implemented with specific tools.
Terraform: Stateful Comparison
The simplest way to detect drift is to run `terraform plan` in a CI/CD job on a regular schedule. If the plan is not empty (and no code has changed), it indicates drift.
📋 Scheduled Terraform Plan Check
A GitHub Actions workflow that runs nightly to detect drift.
name: Terraform Drift Detection
on:
schedule:
- cron: '0 2 * * *' # Run at 2 AM UTC every day
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Plan
id: plan
run: terraform plan -no-color -detailed-exitcode
continue-on-error: true
- name: Alert on Drift
if: steps.plan.outputs.exitcode == 2
run: |
echo "🚨 Drift Detected! The terraform plan is not empty."
# Add commands to send a Slack/Teams notification
exit 1 OPA: Stateless Scanning of Live Resources
This approach involves writing a script to fetch the current configuration of a resource from the cloud provider's API, then using OPA to evaluate that configuration against a policy.
🛡️ OPA Policy for Live S3 Bucket
This policy defines the desired state for a critical S3 bucket, including encryption and versioning.
package drift.aws.s3
# Desired state is defined in an external data file
import data.desired_state
# Compare the live config (input) to the desired state
deny[msg] {
input.versioning.enabled != desired_state.s3_buckets.critical_data.versioning
msg := "S3 Versioning has drifted from desired state."
}
deny[msg] {
input.server_side_encryption_configuration.rule.apply_server_side_encryption_by_default.sse_algorithm != "AES256"
msg := "S3 Encryption has drifted from AES256."
}
deny[msg] {
input.public_access_block_configuration.block_public_acls != true
msg := "S3 public access block has drifted."
} Real-Time Infrastructure Drift Detection: Event-Driven Cloud Monitoring Setup
The most advanced strategy uses cloud events to get immediate notification of changes. This allows you to detect and respond to drift in seconds, not hours.
A Manual Change
An administrator manually disables encryption on an RDS database via the AWS Console.
Event is Logged
This action generates a `ModifyDBInstance` event in AWS CloudTrail.
Trigger is Fired
An EventBridge rule is configured to match this specific event and invokes a Lambda function.
Policy is Evaluated
The Lambda passes the event payload to a policy that checks if `StorageEncrypted` was changed to `false`. If so, the policy returns a "drift detected" decision.
Alert / Remediate
Based on the policy decision, the system sends a high-priority alert to the security team and can optionally trigger an automated remediation workflow.
💡 Infrastructure Drift Management Best Practices: Prevention and Remediation Guide
Key Takeaways
- IaC is the Single Source of Truth: Enforce a strict GitOps culture where all changes *must* go through your IaC pipeline. This is the best way to prevent drift in the first place.
- Lock Down Manual Access: Use policies to restrict direct console or API access for most users. Provide read-only access by default and require a "break-glass" procedure for emergency write access.
- Choose the Right Detection for the Job: Use scheduled Terraform plans for your core IaC-managed resources. Use stateless scanning to find unmanaged resources. Use event-driven detection for your most critical, high-security assets.
- Tag Everything: A good tagging strategy is essential. Tags should identify the owner, the IaC workspace managing the resource, and the desired state. This helps your detection tools know what to compare against.
- Automate the Response: Detecting drift is only half the battle. Your detection should trigger an automated workflow to either alert the resource owner, create a ticket, or (for safe changes) automatically remediate the drift.