Incident Response Procedures
Create and automate incident response procedures using policy-as-code to rapidly contain and recover from security events.
📋 Prerequisites
- Experience with incident response lifecycle (e.g., NIST framework).
- Proficiency with policy-as-code (OPA, Sentinel).
- Deep knowledge of cloud logging, monitoring, and security services (e.g., CloudTrail, GuardDuty, Security Hub).
- Familiarity with scripting and automation tools (e.g., Lambda, Azure Functions, Python).
💡 From Reactive to Proactive
Policy-as-code shifts incident response from a purely reactive, manual process to a proactive, automated discipline. By enforcing a secure and observable baseline, policies reduce the likelihood of incidents and provide the automation hooks needed for rapid, consistent response when they do occur.
What You'll Learn
🏷️ Topics Covered
Policy-as-Code Incident Response: Automating Security Event Management Lifecycle
Policy-as-code can be integrated into nearly every phase of the standard NIST Incident Response lifecycle.
1. Preparation
Goal: Ensure systems are ready to be defended.
Policy's Role: Enforce the presence of necessary security controls. Policies verify that logging is enabled, security agents are installed, and network configurations are correct *before* an incident happens.
2. Detection & Analysis
Goal: Identify that an incident has occurred.
Policy's Role: Act as the detection mechanism. A policy violation (e.g., "S3 bucket made public") is not just a misconfiguration; it's a security event that can trigger an automated alert or response.
3. Containment
Goal: Limit the scope and magnitude of the incident.
Policy's Role: Provide the automation hooks for containment actions. The output of a policy decision can trigger a workflow that isolates a host, revokes credentials, or reverts a network change.
4. Eradication & Recovery
Goal: Remove the threat and restore systems to a known-good state.
Policy's Role: Define what "good" looks like. Policies can be used to validate that a restored system is compliant before it's brought back online, preventing re-infection from a misconfigured backup.
Automated Incident Response Examples: Policy-Driven Security Event Handling
Let's explore practical examples of how policies can automate different phases of incident response.
Preparation: Enforcing Logging on All Resources
You cannot respond to what you cannot see. This policy ensures all critical resources have logging enabled, providing the necessary data for forensic analysis.
📋 Rego Policy for Universal Logging
{`package ir.preparation
# Deny if S3 bucket logging is disabled
deny[msg] {
input.resource.aws_s3_bucket.main.logging[_] == null
msg := "S3 bucket logging is not enabled."
}
# Deny if ELB access logs are disabled
deny[msg] {
input.resource.aws_elb.main.access_logs[_].enabled == false
msg := "ELB access logs are not enabled."
}
# Deny if CloudTrail is not logging
deny[msg] {
input.resource.aws_cloudtrail.main.enable_logging == false
msg := "CloudTrail is not enabled."
}`} Detection & Containment: Exposed IAM Credentials
This is a classic incident scenario. An access key is leaked and detected by AWS GuardDuty. A policy-driven workflow can provide immediate, automated containment.
Detection
AWS GuardDuty generates a Creds:IAMUser/AnomalousBehavior finding.
Trigger
An EventBridge rule captures the finding and triggers a Lambda function.
Policy Decision
The Lambda queries a policy: allow_credential_revocation(finding). The policy checks if the finding is high severity and not on an exception list.
Enforcement
If the policy returns allow, the Lambda function makes an API call to AWS to immediately deactivate the compromised access key.
Break-Glass Access Policies: Emergency Privilege Management with Policy-as-Code
During a major incident, administrators may need temporary, elevated privileges to resolve an issue. A "break-glass" procedure uses policy to grant this access in a secure, audited, and time-bound manner.
🛡️ Sentinel Policy for Emergency Role Assumption
This policy checks if a user is trying to assume a highly privileged `EmergencyAdminRole`. It allows the action but ensures it is logged and has a short session duration.
{`import "tfplan/v2" as tfplan
# Get the role being assumed from the input
assumed_role_arn = tfplan.variables.assumed_role.value
session_duration = tfplan.variables.duration_seconds.value
# Rule: allow if NOT the emergency role
is_normal_role = rule {
assumed_role_arn is not "arn:aws:iam::123456789012:role/EmergencyAdminRole"
}
# Rule: allow if it IS the emergency role, but only for 1 hour
is_valid_break_glass = rule {
assumed_role_arn is "arn:aws:iam::123456789012:role/EmergencyAdminRole" and
session_duration <= 3600
}
# Main rule passes if it's a normal role OR a valid break-glass session
main = rule {
is_normal_role or is_valid_break_glass
}`} This policy would be paired with a high-priority alert that fires every time the is_valid_break_glass rule evaluates to true, notifying the security team of the emergency access.
Incident Response Automation Best Practices: Policy-as-Code Security Playbooks
💡 Best Practices
- Automate Containment, Not Eradication: It's generally safe to automate containment actions (isolating a host, disabling a key). Be very cautious about automating eradication (deleting a host), as you may destroy crucial forensic evidence.
- Prepare Your Policies Before the Incident: Your IR policies for logging, security tooling, and baseline configurations must be in place and enforced *before* an incident. You can't add logging after a compromise.
- Use Policy Evaluation as a Trigger: The real power comes from using the *result* of a policy evaluation to kick off a workflow. A `deny` decision should be an event that your automation platform (like a SOAR tool, Lambda, or other scripts) can act on.
- Create an Incident Response "Playbook" Library: Codify your response to common incidents. For "malware detected," your playbook might be: 1. Query policy to get network details. 2. Trigger policy to apply "isolate" network tag. 3. Trigger policy to take a disk snapshot for forensics.
- Test Your IR Automation: Regularly run drills and simulations (like chaos engineering) to ensure your automated response playbooks work as expected.