AWS Cost Management & FinOps Policies
Implement AWS cost governance with budgets, cost allocation tags, rightsizing policies, and automated cost optimization strategies.
π Prerequisites
- Deep knowledge of AWS Billing, Cost Explorer, and Cost & Usage Reports (CUR).
- Strong experience with IAM Policies, Service Control Policies (SCPs), and AWS Budgets.
- Proficiency with Terraform for deploying governance and automation.
- Familiarity with AWS Lambda (Python) and EventBridge for building event-driven workflows.
π‘ FinOps: Beyond Cost Savings to Cloud Financial Management
FinOps is a cultural practice that brings financial accountability to the variable spend model of the cloud. It's not just about cutting costs; it's about making data-driven spending decisions to maximize business value. In AWS, this is achieved by combining visibility (Cost Explorer, CUR), proactive controls (Budgets Actions, SCPs), and automated optimization (Lambda, Compute Optimizer) into a continuous lifecycle.
What You'll Learn
π·οΈ Topics Covered
The FinOps Lifecycle in AWS
FinOps operates in a continuous loop of three phases: Inform, Optimize, and Operate. A mature FinOps practice maps AWS services to each phase to create an automated, data-driven system.
π Inform
Gaining visibility and understanding costs. This phase is about accurate allocation, benchmarking, and forecasting.
Services: Cost & Usage Reports (CUR), Cost Explorer, Cost Allocation Tags.βοΈ Optimize
Acting on the data to find efficiencies. This involves rightsizing, eliminating waste, and optimizing pricing models.
Services: Compute Optimizer, Trusted Advisor, Savings Plans, Reserved Instances.π Operate
Implementing policies and automation to enforce decisions and maintain efficiency continuously.
Services: AWS Budgets, Service Control Policies (SCPs), Lambda, EventBridge.Proactive Controls: AWS Budgets with Automated Actions
**AWS Budgets Actions** are an expert-level feature that allows you to programmatically respond to a budget breach. Instead of just sending an alert, you can automatically apply a restrictive IAM policy or SCP to prevent further spending.
ποΈ HCL & JSON: Deploying a Budget with a Restrictive Action
This Terraform code creates a budget that, upon breaching 100% of its limit, triggers an action to attach a `Deny-All-EC2-Creation` policy to a developer role.
# budgets.tf
# 1. The Budget itself, tracking a specific cost center tag
resource "aws_budgets_budget" "cost_center_123" {
name = "CostCenter-123-Budget"
budget_type = "COST"
limit_amount = "1000.0"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filters = {
TagKeyValue = "user:CostCenter$123"
}
}
# 2. The Budget Action, linked to the budget
resource "aws_budgets_budget_action" "ec2_lockdown" {
account_id = data.aws_caller_identity.current.account_id
budget_name = aws_budgets_budget.cost_center_123.name
action_type = "APPLY_IAM_POLICY"
approval_model = "AUTOMATIC"
execution_role_arn = aws_iam_role.budget_action_role.arn # Must have permissions to attach policies
notification_type = "ACTUAL"
action_threshold {
action_threshold_type = "PERCENTAGE"
action_threshold_value = 100
}
definition {
iam_action_definition {
policy_arn = aws_iam_policy.deny_ec2_creation.arn
roles = [aws_iam_role.developer_role.name]
}
}
}
# 3. The restrictive IAM Policy
resource "aws_iam_policy" "deny_ec2_creation" {
name = "Deny-EC2-Creation-Budget-Action"
policy = data.aws_iam_policy_document.deny_ec2_creation_doc.json
}
data "aws_iam_policy_document" "deny_ec2_creation_doc" {
statement {
effect = "Deny"
actions = [
"ec2:RunInstances",
"ec2:CreateVolume"
]
resources = ["*"]
}
} Detective Controls: Cost Anomaly Detection
A fixed budget can't catch a sudden, unexpected spike in a normally low-cost service. **AWS Cost Anomaly Detection** uses machine learning to find these outliers and alert you immediately.
ποΈ HCL: Deploying a Cost Anomaly Monitor and Subscription
This Terraform sets up a monitor for all services and a subscription to send immediate alerts to SNS for any anomaly with a total impact greater than $50.
resource "aws_ce_anomaly_monitor" "service_monitor" {
name = "All-Services-Monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "email_alerts" {
name = "HighImpact-Anomaly-Alerts"
frequency = "IMMEDIATE"
monitor_arn_list = [aws_ce_anomaly_monitor.service_monitor.arn]
subscriber {
type = "SNS"
address = aws_sns_topic.finops_alerts.arn
}
threshold_expression {
and {
dimension {
key = "TotalImpact"
match_options = ["GREATER_THAN_OR_EQUAL"]
values = ["50"]
}
}
}
} Automated Optimization & Rightsizing Workflows
The "Optimize" phase involves acting on recommendations. You can automate the process of notifying resource owners about rightsizing opportunities from **AWS Compute Optimizer**.
π Python: Lambda for Rightsizing Notification
This Lambda is triggered by an EventBridge rule that listens for "EC2 Instance Over Provisioned" findings from Compute Optimizer. It then parses the finding and sends a formatted message.
import json
import boto3
import os
sns = boto3.client('sns')
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
def lambda_handler(event, context):
finding_type = event['detail']['finding']
if finding_type != 'Overprovisioned':
return
instance_arn = event['detail']['resourceArn']
instance_id = instance_arn.split('/')[-1]
current_type = event['detail']['instanceDetails']['instanceType']
recommendations = event['detail']['recommendations']
top_recommendation = recommendations[0]['instanceType']
estimated_savings = recommendations[0]['estimatedMonthlySavings']['value']
message = f"""
[ACTION REQUIRED] Rightsizing Recommendation for {instance_id}
- Account: {event['account']}
- Instance ID: {instance_id}
- Current Type: {current_type}
- Recommended Type: {top_recommendation}
- Estimated Monthly Savings: ${estimated_savings}
Please evaluate this recommendation.
"""
sns.publish(TopicArn=SNS_TOPIC_ARN, Message=message, Subject=f"Rightsizing Alert for {instance_id}")
return {"status": "SUCCESS"}
Troubleshooting Common FinOps Challenges
Implementing a FinOps practice often reveals complex issues. Hereβs how to debug common challenges.
β Budget Action Fails to Execute
- Symptom: A budget threshold is breached and you get an SNS alert, but the IAM policy or SCP is not applied.
- Root Cause: The role specified in the budget action's `execution_role_arn` lacks the necessary permissions (e.g., `iam:AttachRolePolicy`) OR its trust policy does not allow the Budgets service principal (`budgets.amazonaws.com`) to assume it.
- Solution: Verify the execution role has both the correct permissions to perform the action and a trust policy allowing `budgets.amazonaws.com` to assume it.
π΅οΈ Finding Owners of Untagged Resources
- Symptom: Your Cost & Usage Report (CUR) shows significant costs from resources with no `Owner` or `Project` tag.
- Root Cause: Lack of tag enforcement.
- Solution: This requires correlating data. Get the `line_item_resource_id` from the CUR. Then, query your centralized CloudTrail logs (e.g., using Amazon Athena) for an event like `RunInstances` where the `responseElements.instancesSet.items[0].instanceId` matches your resource ID. The `userIdentity` object in that CloudTrail event will show the ARN of the principal that created the resource.
π Expert-Level FinOps Best Practices
- Tag Everything, Enforce with Policy: A successful FinOps practice is built on a foundation of consistent tagging. Use SCPs and AWS Config rules to enforce your tagging strategy.
- Automate Proactive Controls: Don't just rely on alerts. Use AWS Budgets Actions to automatically apply restrictions when costs exceed forecasts, preventing major overruns.
- Centralize Your Data: Federate all Cost and Usage Reports (CUR) to a central data lake account. Use Amazon Athena and QuickSight to build unified, organization-wide cost dashboards.
- Close the Loop on Optimization: Create automated workflows (EventBridge + Lambda) to act on recommendations from Compute Optimizer and Trusted Advisor, ensuring that optimization is a continuous process.
- Embed Cost in CI/CD: Use tools like Infracost to show developers the cost impact of their infrastructure changes directly in their pull requests, shifting cost awareness left.