Monday, September 22, 2025

CloudWatch Alarms | Overview & Hands-On.


Amazon CloudWatch Alarms - Overview & Hands-On.

 Scope:

  •  Intro,
  • Key Features and Types,
  • Configuration and Management, 
  • Link to official documentation,
  • The Concept: CloudWatch Alarms,
  • Alarm States,
  • Architecture Flow,
  • Key Features,
  • Alarm Actions,
  • Advanced Use Cases,
  • Final thoughts.
  • Hands-On.

Intro:

    • Amazon CloudWatch alarms are a feature of the Amazon CloudWatch monitoring service that watches a single metric, a metric math expression, or a composite of other alarms
    • When the value of the monitored item exceeds a specified threshold for a certain number of time periods, the alarm changes state (e.g., from OK to ALARM) and can perform automated actions, such as sending notifications or stopping an EC2 instance. 
Key Features and Types
    • Standard Alarms: These are based on a single metric, like CPU utilization or the number of Lambda function errors.
    • Metric Math Expressions: twtech can create alarms based on the results of mathematical expressions involving multiple metrics.
    • Composite Alarms: These trigger based on the states of multiple child alarms (Originating from initial Parent alarm), allowing for more complex alerting scenarios, such as alarming only when both CPU usage and disk space are high.
    • Anomaly Detection: CloudWatch can use machine learning to set dynamic, data-driven thresholds, alerting twtech when a metric's behavior is unusual rather than just exceeding a static value.
    • Log Alarming: twtech can define metric filters from log data and create alarms based on those filtered metrics.
    • Recommended Alarms: CloudWatch provides out-of-the-box alarm recommendations for various AWS services, helping twtech quickly set up essential monitoring. 
Configuration and Management
    • States: Alarms can be in one of three states: OKALARM, or INSUFFICIENT_DATA.
    • Actions: When an alarm state changes, it can trigger actions such as:Sending notifications to an Amazon SNS topic.
      • Initiating Auto Scaling actions.
      • Stopping, terminating, rebooting, or recovering an Amazon EC2 instance.
      • Invoking an AWS Lambda function.
      • Sending events to Amazon EventBridge for more complex automations.
    • Missing Data Treatment: twtech can configure how the alarm treats missing data points (as good, bad, maintain current state, or ignore) to prevent false alarms or missed alerts on sparse metrics.
    • Management: Alarms can be managed through the CloudWatch console, the AWS CLI, the AWS SDKs, and AWS CloudFormation templates.
Link to official documentation:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

1. The Concept: CloudWatch Alarms

    • CloudWatch Alarms monitor a single CloudWatch metric (or a math expression involving metrics) over a period of time and take action when the value breaches a defined threshold
    • They are reactive automation triggers tied to metrics.
      • Metrics source: EC2, RDS, Lambda, S3, DynamoDB, Load Balancers, Custom Metrics, etc.
      • Actions on state change: SNS notifications, Auto Scaling, EC2 stop/terminate/reboot, Systems Manager OpsItems, EventBridge integration.

2. Alarm States

    • OK Metric is within the defined threshold.
    • ALARM Metric is outside the threshold.
    • INSUFFICIENT_DATA Not enough data points yet (new metric, missing data, or insufficient samples).

NB:

twtech can configure how missing data should be treated (as good, bad, ignore, or missing).

3. Architecture & Flow

    1. Metrics generated (e.g., CPUUtilization, Lambda Errors, API Gateway Latency).
    2. Metrics stored in CloudWatch Logs/Metrics backend.
    3. Alarms evaluate metrics periodically (based on period & evaluation windows).
    4. Alarm transitions state ALARM, OK, INSUFFICIENT_DATA.
    5. Action executed (via SNS, Auto Scaling, Systems Manager, EventBridge, etc.).

4. Key Features

  • Standard & Composite Alarms
    • Standard Alarm: Evaluates a single metric or math expression.
    • Composite Alarm: Combines multiple alarms with Boolean logic (AND, OR) to reduce alert noise.
  • Anomaly Detection Alarms
    • Uses ML to automatically set dynamic thresholds instead of static ones.
    • Detects deviations from expected metric patterns.
  • Evaluation Periods & Datapoints
    • Example: “Trigger ALARM if CPUUtilization > 80% for 3 out of the last 5 minutes.”
  • Treating Missing Data
    • Prevents false positives/negatives by defining how missing metrics behave.

5. Alarm Actions

  • SNS Topic Notifications
    • Email, SMS, Lambda, PagerDuty, Slack, etc.
  • EC2 Actions
    • Stop, terminate, reboot, or recover instances.
  • Auto Scaling Policies
    • Scale out/in based on metrics.
  • EventBridge
    • Trigger workflows, automation runbooks, ticketing systems.
  • Systems Manager OpsCenter
    • Create OpsItems for incident response.

6. Advanced Use Cases

  • Dynamic Scaling
    • Scale ECS/Fargate or EC2 fleets automatically using CPU, Memory, or custom app metrics.
  • Proactive Cost Control
    • Alarm on billing metrics (e.g., AWS/Billing > EstimatedCharges).
  • SLA (Service-Level-Management) Monitoring
    • Alarm on API latency or error rate.
  • Security Monitoring
    • Alarm on unusual IAM activity (via CloudTrail metrics in CloudWatch).
  • Health & Recovery
    • Auto-recover critical EC2 instances if system status check fails.

7. Best Practices

    • Use composite alarms to reduce noise.
    • Leverage anomaly detection instead of static thresholds where workloads are variable.
    • Group alarms into dashboards for observability.
    • Apply naming conventions (Prod/AppName/CPUHighAlarm) for clarity.
    • Use Tags to organize alarms by environment, team, or application.
    • Ensure alarm action permissions (SNS topics, EC2 actions) are properly configured.

Final thoughts:

    • CloudWatch Alarms are the automation backbone of monitoring in AWS.
    • CloudWatch Alarms transform raw metrics into actionable triggers, enabling proactive scaling, recovery, security, and incident response.


Project: Hands-On

  • How twtech creates CloudWatch Alarms, use it to monitor its resources (logs + metrics), and transform raw metrics into actionable triggers, enabling proactive scaling, recovery, security, and incident response.

Search for aws service: CloudWatch

  • Go to EC2 instance UI to: Create or restart an instance

  • keep the instance in a running state.

  • Create an alarm to monitor an EC2 instance (docker-trivy): The idea is to stop the instance if CUPUtilization is greater than the set-threshold at 90%

  • Specify metric and conditions

  • Select metric : with instanceID (copy with Ctrl+c)

  • Paste the instance id and press enter butter on keyboard to:  search

  • Click on:  EC2 > Per-Instance Metrics 17

  • Navigate through the 17 Metric to locate and select:  CPUUtilization


  • Configure actions

  • Select Action: EC2 action


Add alarm details: Name and description

  • Alarm name: twtechDockerTrivyAlarm

Preview and create Alarm: twtechDockerTrivyAlarm

  • Conditions: CPUUtilization > 90

  • Configure actions: SNS topic & alarm detaits

  • Create alarm: twtechDockerTrivyAlarm

  • To see alarm created: CloudWatch Alarm/Alarms/All Alarms

  • Details of the Alarm: twtechDockerTrivyAlarm


How twtech uses the CLI command: “set-alarm-state” to test the configuration of the alarm created.

  • The idea is to test whether the alarm would trigger if the conditions set for the EC2 instance ( docker-trivy) prevailed (went above threshold > 90%)
  • Click on Coudshell to: Make API calls to docker-trivy instance (that CPUUtilization alarm is created to monitor)

Google Search for commands to use: aws ClouWatch set alarm state

  • Link to official documentation:

https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/set-alarm-state.html

  • Navigate to the section with the command: Copy, configure and run

aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"


# SampletwtechSynopsis:

aws CloudWatch set-alarm-state --alarm-name  twtechDockerTrivyAlarm  --state-value  ALARM  --state-reason twtech alarm testing purposes

  • Go back and refresh twtechDockerTrivyAlarm  page in CloudWach: Navigate to Ac

From:

To:

  • ActionsActions enabled

Type

Description

Config

Notification

When alarm transitions to in alarm, send message to topic "Default_CloudWatch_Alarms_Topic".

-

EC2

When alarm transitions to in alarm, stop the instance with id "i-02609a928xxx".

-

  • How twtech checks history of Alarm: History

Finally twtech needs to go EC2 instance UI, to verify if the alarm Action(stop instance) was applied to docker-trivy: when alarm was trigger.

Yes:

    • An SNS notification was sent to twtech default topic: twtech671@gmail.com
    •  twtech also successfully created an Alarm for its instance (docker-trivy), configure the alarm that triggers and automatication stop the instance when CPUUtilization exceeded threshold value set.







No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...