Sunday, December 7, 2025

Creating Highly Available EC2 Instances | Overview.


Here’s twtech practical Overview on creating highly-available (HA) EC2-based applications.

Scope:

  •        Intro,
  •        Architecture patterns,
  •        Design decisions,
  •        Operational practices,
  •        Security,
  •        Testing,
  •        A few concrete snippets (user-data, autoscaling policy ideas, and Terraform/Cfn concepts described) so twtech can configure and implement right away,
  •        Insights.

Intro:

  •        The Goal is to run a service on EC2 with minimal downtime, automatic recovery, and capacity scaling while keeping ops overhead low and security/compliance intact.
  •        Creating highly available EC2 instances involves using a combination of Amazon Web Services (AWS) features, primarily Auto Scaling Groups (ASG) across multiple Availability Zones (AZs), and leveraging services like Elastic Load Balancing (ELB).

Key strategies:

Multiple Availability Zones (AZs):

  •         twtech deploys instances across at least two or more separate AZs within the same AWS Region.
  •         An AZ is one or more distinct data centers with redundant power, networking, and connectivity.
  •         By using multiple AZs, its application can remain operational even if one AZ experiences an outage.

Auto Scaling Groups (ASGs):

  •         twtech places its EC2 instances within an ASG configured to span multiple AZs.
  •         The ASG ensures that a minimum specified number (at least 1) of its instances is running at all times.
  •         If an instance fails or becomes unhealthy (due to an AZ failure or software crash), the ASG automatically launches a replacement instance in an operational AZ.

Elastic Load Balancing (ELB):

  •         Use an ELB (Application Load Balancer or Network Load Balancer) in front of twtech ASG.
  •         The load balancer automatically distributes incoming traffic across the healthy instances in all designated AZs.
  •         Load balancer also performs health checks on the instances and routes traffic only to healthy ones.

 Data Durability and Shared Storage:

  •         Avoid storing persistent, unique data on individual EC2 instance local storage (They are ephemeral or EBS root volumes if the instance can be terminated eventually).
  •         Use shared, highly durable services like Amazon S3, Amazon RDS (configured for Multi-AZ deployments), or Amazon EFS for your data storage needs.

 Route 53 Failover:

  •         For highly available DNS, twtech recommends Amazon Route 53 to manage domain's health checks and failover routing policies.
  •         If twtech primary deployment becomes unavailable, Route 53 can automatically route traffic to a healthy, secondary deployment in a different region if necessary. 
NB:

ThThe core idea behind creating EC2 with High Availablity (HA), is to combine services architecture that makes the instance resilient where the failure of a single instance, data center, or even an entire Availability Zone does not result in application downtime.

High-level architecture (textual diagram)

  •         Public internet Route 53 (DNS) ALB (multi-AZ) Auto Scaling Group of EC2s (spread across AZs)
  •         EC2s in private subnets; NAT Gateway(s) for public subnet outbound traffic; ALB in public subnets.
  •         Stateful components for Persistent data: RDS (multi-AZ or Aurora), ElastiCache (clustered), S3 (object storage).
  •         Logs/metrics for Monitoring & Obsevability: CloudWatch Logs + Metrics (Unified agent), S3 (long-term), optionally ELK/managed logging.
  •         Optional: Bastion + AWS Systems Manager for access.


Deep Dive Design Patterns & Decisions Making

1. Multi-AZ + Auto Scaling Group (ASG)

  •         ASG is the core: specify desired/min/max capacity and distribute instances across AZs.
  •        Use multiple private subnets in different AZs; ASG launches instances in healthy AZs only.
  •         Combine with ALB health checks so unhealthy instances are removed automatically.

2. Stateless vs Stateful

  •         Stateless app servers: store sessions in ElastiCache (Redis/Memcached) or a cookie/JWT for scale.
  •        Stateful needs (local files): prefer S3 + EFS (for shared POSIX) or attach EBS with replication/backups; avoid relying on instance local disk for critical data.

3. Placement & spread

  •         Use spread placement groups if twtech wants max AZ-level isolation for critical instances; use with caution — they restrict capacity.
  •         For low-latency network locality (e.g., HPC), cluster placement helps but reduces HA across AZs — usually not used for web apps.

4. Immutable infrastructure & deployment

  •         Prefer immutable AMI-based deployments: bake AMI with app + dependencies (Packer), then roll ASG with new launch configuration/template.
  •         Use blue/green or canary deployments with ALB target group switching or weighted DNS via Route 53.

5. Bootstrapping & config

  •         Use user-data / cloud-init only for environment-agnostic bootstrapping (install agent, fetch config from S3 or SSMD Parameter Store). Keep launch idempotent.
  •         Use SSM RunCommand / State Manager or configuration management (Ansible/Chef/Puppet) for post-boot tasks rather than SSH.

6. Health checks & graceful shutdown

  •         ALB health checks must hit an application endpoint that checks readiness (dependencies availability).
  •         Implement shutdown hooks: trap SIGTERM, mark instance unhealthy (deregister from target group) and wait for connections to drain before exit.
  •         Use ASG lifecycle hooks for pre-termination logic (drain work, flush caches).

7. Instance recovery & replacement

  •         Enable EC2 Auto Recovery for transient hardware/network issues or ASG replacement policies.
  •         Use instance status checks and CloudWatch alarms to trigger replacement.

8. Network design

  •         Use Public subnets for ALB/NAT & Private subnets for instances.
  •         Least-privilege security groups: ALB EC2 on app port; EC2 DB on DB port only from app SG. (allow traffic to only the required port)
  •         Use VPC endpoints for S3/SSM to avoid NAT egress and improve security.

9. Storage & backups

  •         Use EBS gp3/io2 with provisioned IOPS for critical disks; enable EBS encryption and regular snapshots.
  •         Use EFS for shared filesystem when necessary (multi-AZ backed).
  •         Backups: snapshot automation (Data Lifecycle Manager) or custom Lambda snapshots.

10. Observability & alerting

  •         Logs: push application logs to stdout/stderr to CloudWatch Logs (or File → CloudWatch agent). Include structured JSON logs.
  •         Metrics: custom app metrics (CloudWatch custom metrics or Prometheus + remote-write).
  •         Tracing: instrument with X-Ray or OpenTelemetry.
  •         Alerts: PagerDuty/SMS/Slack on critical alarms (error rate, latency, CPU, disk, ASG health).

11. Security

  •         IAM roles attached to instance profiles with least privilege (S3 read-only, SSM access, CloudWatch put). No long-lived keys.
  •         Hardened AMIs: baseline with latest patches, CIS hardening where needed. Use SSM Patch Manager.
  •         OS-level protections: disable unused services, enable firewall rules (iptables), Instance Metadata Service v2 (IMDSv2) only.
  •         Network ACLs (NACL) for defense-in-depth at the VPC subnet level.

12. Cost & capacity planning

  •         Right-size instances; use Savings Plans/Reserved Instances for baseline loads.
  •         Use mixed instance policies in ASG (On-Demand + Spot with fallback) via Instance Pools for cost and availability.

Implementation checklist (step-by-step)

1.     VPC & Subnets

  •    Create VPC with at least 3 AZs; private subnets for app, public subnets for ALB/NAT in each AZ.
  •    Configure route tables and NAT gateways (or NAT fleet).

2.     Security Groups & IAM

  •    Create ALB SG (ingress 0.0.0.0/0:80/443), App SG (ingress from ALB SG only), DB SG (ingress from App SG on DB port).
  •    Create instance profile IAM role with minimal permissions (SSM, CloudWatch PutMetric, S3 read for config, KMS decrypt if needed).

3.     AMIs & Bootstrapping

  •    Build an AMI containing runtime and agents. 
  • Alternatively use user-data to install quickly but keep idempotent.
  •    Sample cloud-init (user-data) snippet:

# install.sh
#!/bin/bash
set -e
# Example: install SSM agent, docker, get config
yum update -y
amazon-linux-extras install -y java-openjdk11
# Install SSM (if not baked)
# Start app (example)
aws s3 cp s3://twteh-s3bucket/app-config.json /etc/myapp/config.json --region us-east-2
# start app systemd service
systemctl enable --now myapp

4.     Auto Scaling Group & Launch Template

  •    Create launch template with AMI, instance type, IAM role, user-data, block device mappings.
  •    Configure ASG across AZs, define min/desired/max. Use health check type: ELB.
  • Enable termination policies, and Graceful shutdown with lifecycle hooks.

5.     Load Balancer

  •    Create ALB in public subnets with target groups referencing ASG instances. Set health check path /healthz
  • Enable slow start if needed. Configure sticky sessions only if necessary.

6.     DNS & Failover

  •    Route 53 record pointing to ALB. For DR, use weighted / failover records between regions.

7.     Monitoring & Alerts

  •    CloudWatch agent to collect system metrics. Log group per environment. 
  • Alarms for 5xx errors, high latency, high CPU, low healthy host count.

8.     Deployment pipeline

  •    CI/CD builds AMI (Packer) or creates new ASG launch template version. Use CodeDeploy / Terraform / CloudFormation pipeline to perform rolling/blue-green. Implement rollback.

9.     Testing & chaos

  •    Test AZ failure: simulate by disabling AZ in ASG or bringing down instances.
  •    Test AMI/launch template churn with Canary deployment.
  •    Run load tests; validate autoscaling triggers and cool-down behavior.

Autoscaling & policies — practical tips

  •         Use target tracking (e.g., keep average CPU at 40%) for basic needs.
  •         For web apps, prefer scaling on request/latency metrics (ALB RequestCountPerTarget / TargetResponseTime) or custom queue length for worker processes.
  •         Use step scaling for sudden load changes (scale more aggressively on high thresholds).
  •         Protect against scale-in cascading: add cooldowns and minimum healthy host counts.

Lifecycle hooks & graceful termination (example)

  •         Configure ASG lifecycle hook Terminating: Wait to call a Lambda or SQS.
  •         Flow: ASG -> Lifecycle hook -> Lambda triggers SSM Run Command on instance to systemctl stop myapp which drains, then CompleteLifecycleAction.
  •         This ensures in-flight requests finish successfully.

Failover & disaster recovery

  •         Within-region HA: Multi-AZ + ASG + ALB — this covers most availability needs (AZ outage tolerance).
  •         Cross-region DR: replicate AMIs, replicate state (S3 cross-region replication, DB read replicas promoted to master), and Route 53 failover records. Keep RTO/RPO targets in SLOs and practice failover drills.

Logging, tracing & debugging tips

  •         Centralize logs; include request IDs and correlation IDs in headers for tracing.
  •         Provide a /healthz (liveness) and /ready (readiness) endpoints: ALB should use readiness so an instance is only marked healthy once app and dependencies are ready.
  •         Use CloudWatch Logs Insights or a dedicated tracing UI for latency spikes.

Security operational items

  •         Rotate AMIs regularly; apply patches via image pipeline.
  •        Run vulnerability scans on AMI and container images.
  •         Enforce encryption at-rest (EBS, S3) and in-transit (TLS for ALB).
  •         Enforce strict IAM roles and use SSM Session Manager for shell access (no SSH keys).

Common pitfalls & how to avoid them

  •         Boot-time slowdowns: heavy user-data installs cause long boot times and unhealthy health checks. Bake as much as possible into AMI.
  •         State on instance: storing sessions on instance disk — use external session store.
  •         Autoscaling flapping: aggressive scaling policies + low cooldowns cause instability — use sensible cooldowns and target-tracking.
  •         Insufficient health checks: using only EC2 status checks instead of app-level health checks can keep unhealthy app instances in service.
  •         AZ capacity skew: some instance types may not be available in all AZs — use mixed instances and multiple instance types in ASG.

Sample Terraform plan

  •         Module vpc (3 AZs, subnets)
  •         Module alb (ALB + target group)
  •         Module launch_template (AMI, userdata, iam)
  •         Module asg (launch_template, min/max/desired, lifecycle hooks, autoscaling policies)
  •         Module monitoring (cloudwatch log groups, metrics, alarms)
  •         Module iam (instance profile and policies)

Testing checklist (must-run)

  •         Simulate AZ failure (terminate all instances in one AZ) verify traffic shifts and capacity remains.
  •         Deploy new AMI via canary validate metrics, then promote.
  •         Simulate instance crash and ensure ASG replaces it and ALB stops sending traffic.
  •         Load test to validate scaling thresholds and response times.
  •         Security scan (AMI & network) and patch test.

Service level agreement (SLA) / service level objective (SLO) suggestions

  •         Availability: aim for 99.95% at app-tier with multi-AZ + ASG + ALB.
  •         Cross-region needed for 99.99%+ depending on failover automation.
  •         Recovery time: ASG instance replacement typically 2–5 minutes depending on boot time; reduce by optimizing AMI and health checks.

Short actionable checklist

  •      Use an AMI with app + SSM agent.
  •      Create ALB + target group with /ready readiness health check.
  •      Create launch template with the AMI and IAM role.
  •      Create ASG across 3 AZs (min 2, desired 2+) with ELB health checks, lifecycle hook for termination.
  •      Configure CloudWatch alarms for low healthy host count and high 5xx rates.
  •      Implement SSM session manager for access and enable IMDSv2.
  •      Run a controlled failover/termination test and review logs.

twtech insights on HA EC2 instances in AWS:

  •  Concrete cloud-init / user-data scripts (for Java, Node, Python) that twtech can configure into a Launch Template,
  • A sample Terraform module (launch template + ASG, lifecycle hook, mixed instances policy + attach to ALB TG),
  • A step-by-step failure scenario runbook (simulate AZ loss + RDS failover) with exact AWS CLI commands and verification steps.

1) User-data scripts (Amazon Linux 2 style, idempotent, uses IMDSv2)

NB:

  •         Uses IMDSv2 to get instance id / region.
  •         Installs/ensures SSM agent for remote access.
  •         Expects artifacts in S3 (or use baked AMI instead).
  •         Creates systemd service with graceful shutdown that deregisters from ALB target group and waits for in-flight requests to finish.
  •         Uses environment variables stored in SSM Parameter Store or pulled from S3 as shown.

# for java-base application

# Java-Spring-Boot-user-data-cloud-init.sh

#!/bin/bash
set -euxo pipefail
# --- config ---
S3_BUCKET="twch-s3bucket"
S3_KEY="artifacts/twtechapp.jar"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/72af9c1c6xxxxx"
LOG_GROUP="/aws/ec2/twtechwebapplg"
APP_PORT=8080
JAVA_OPTS="-Xms256m -Xmx512m"
# get region & instance metadata using IMDSv2
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y 
# Java runtime
amazon-linux-extras enable corretto8
yum install -y java-1.8.0-amazon-corretto-headless jq awscli
# ensure SSM agent (for Amazon Linux 2 usually preinstalled)
if ! systemctl is-active amazon-ssm-agent >/dev/null 2>&1; then
  yum install -y https://s3.${REGION}.amazonaws.com/amazon-ssm-${REGION}/latest/linux_amd64/amazon-ssm-agent.rpm \
|| true systemctl enable --now amazon-ssm-agent || true
fi
# create app dir
mkdir -p /opt/twtechapp
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" /opt/twtechapp/twtechapp.jar --region "$REGION" 
# create deregister script (used by systemd on stop)
cat >/opt/myapp/deregister_tg.sh <<'DEREG'
#!/bin/bash 
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.xxx.xxx.254/latest/dynamic/instance-identity/document | jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" \
--targets Id="$INSTANCE_ID" --region "$REGION"
# Wait until target state is drained or not found
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" \
--targets Id="$INSTANCE_ID" --region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' \
--output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    echo "deregistered ($STATE)"
    exit 0
  fi
done
echo "timed out waiting for deregistration"
exit 0
DEREG
chmod +x /opt/twtechapp/deregister_tg.sh
# create systemd service
cat >/etc/systemd/system/twtechapp.service <<'SERVICE'
[Unit]
Description=twtech Java App
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/twtechapp
ExecStart=/usr/bin/java ${JAVA_OPTS} -jar /opt/twtechapp/twtechapp.jar \
--server.port=${APP_PORT}
ExecStop=/opt/twtechapp/deregister_tg.sh
TimeoutStopSec=120
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechapp.service
# log to CloudWatch - optional: install CloudWatch agent 

NB:

Replace JAVA_OPTS, S3_BUCKET, S3_KEY, TARGET_GROUP_ARN, APP_PORT as needed.

# For Nodejs applications

# Node-js-Express-user-data.sh

#!/bin/bash
set -euxo pipefail
S3_BUCKET="twtech-s3bucket"
S3_KEY="artifacts/twtech-node-app.tar.gz"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechnodejs-tg/72af9c1c6xxxxx"
APP_DIR="/opt/twtechnodeapp"
APP_PORT=3000
NODE_VERSION="18"
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y
# install node via nvmless method (nodesource)
curl -sL https://rpm.nodesource.com/setup_${NODE_VERSION}.x | bash -
yum install -y nodejs jq awscli
mkdir -p ${ twtechnodeapp_dir}
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechnodeapp_dir}
# install deps
cd ${ twtechnodeapp_dir}
npm ci --production
# deregister script
cat >/opt/nodeapp/deregister_tg.sh <<'DEREG'
#!/bin/bash
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"
# wait loop similar to Java script
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \
--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    exit 0
  fi
done
exit 0
DEREG
chmod +x /opt/nodeapp/deregister_tg.sh
# systemd service
cat >/etc/systemd/system/twtechnodeapp.service <<'SERVICE'
[Unit]
Description=twtech Node App
After=network.target
[Service]
ExecStart=/usr/bin/node /opt/twtechnodeapp/index.js
WorkingDirectory=/opt/twtehnodeapp
Restart=on-failure
User=root
ExecStop=/opt/nodeapp/deregister_tg.sh
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechnodeapp.service

# Python application

# Python-Gunicorn+Flask-user-data.sh

#!/bin/bash
set -euxo pipefail
S3_BUCKET="twtech-s3bucket"
S3_KEY="artifacts/twtechpython-app.tar.gz"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechpython-tg/72af9c1c6xxxxx"
APP_DIR="/opt/twtechpyapp"
APP_PORT=8000
VENV_DIR="/opt/twtechpyapp/venv"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y
yum install -y python3 python3-venv python3-pip jq awscli
mkdir -p ${twtechapp_dir}
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechapp_dir}
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate
pip install --upgrade pip
pip install -r ${ twtechapp_dir}/requirements.txt
# deregister script
cat >/opt/twtechpyapp/deregister_tg.sh <<'DEREG'
#!/bin/bash
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \
--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    exit 0
  fi
done
exit 0
DEREG
chmod +x /opt/twtechpyapp/deregister_tg.sh
# systemd unit for gunicorn
cat >/etc/systemd/system/twtechpyapp.service <<'SERVICE'
[Unit]
Description=twtech Python Gunicorn App
After=network.target
[Service]
User=root
WorkingDirectory=/opt/twtechpyapp
ExecStart=/opt/twtechpyapp/venv/bin/gunicorn -w 4 -b 0.0.0.0:8000 app:twtechpyapp
ExecStop=/opt/twtechpyapp/deregister_tg.sh
Restart=on-failure
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechpyapp.service

2) Sample Terraform module — modules/asg (launch template + ASG + lifecycle hook)

NB:

This is a minimal but usable module. It assumes twtech already created ALB target group and IAM instance profile.

File: modules/asg/twtechmain-variables.tf

variable "name" { type = string }
variable "ami_id" { type = string }
variable "instance_types" { type = list(string) default = ["t3.micro","t3a.micro"] }
variable "instance_profile" { type = string } # IAM instance profile name
variable "key_name" { type = string default = "" }
variable "subnet_ids" { type = list(string) }
variable "target_group_arns" { type = list(string) }
variable "vpc_security_group_ids" { type = list(string) }
variable "user_data" { type = string default = "" }
variable "min_size" { type = number default = 2 }
variable "desired_capacity" { type = number default = 2 }
variable "max_size" { type = number default = 4 }
variable "region" { type = string default = "us-east-2" }
resource "aws_launch_template" "this" {
  name_prefix   = "${var.name}-lt-"
  image_id      = var.ami_id
  instance_type = var.instance_types[0] # primary, mixed policy uses list below
  iam_instance_profile {
    name = var.instance_profile
  }
  key_name = var.key_name
 vpc_security_group_ids = var.vpc_security_group_ids
  user_data = base64encode(var.user_data)
  lifecycle {
    create_before_destroy = true
  }
}
resource "aws_autoscaling_group" "this" {
  name                = "${var.name}-asg"
  max_size            = var.max_size
  min_size            = var.min_size
  desired_capacity    = var.desired_capacity
  vpc_zone_identifier = var.subnet_ids
  health_check_type   = "ELB"
  health_check_grace_period = 120
  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.this.id
        version            = "$$Latest"
      }
      override {
        instance_type = var.instance_types[0]
      }
      dynamic "override" {
        for_each = slice(var.instance_types, 1, length(var.instance_types))
        content {
          instance_type = override.value
        }
      }
    }
    instances_distribution {
      on_demand_allocation_strategy            = "prioritized"
      spot_allocation_strategy                 = "capacity-optimized"
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 20
    }
  }
  target_group_arns = var.target_group_arns
  tag {
    key                 = "Name"
    value               = "${var.name}"
    propagate_at_launch = true
  }
  lifecycle {
    create_before_destroy = true
  }
}
resource "aws_autoscaling_lifecycle_hook" "drain" {
  name                   = "${var.name}-terminate-drain"
  autoscaling_group_name = aws_autoscaling_group.this.name
  default_result         = "CONTINUE"
  heartbeat_timeout      = 300
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  notification_target_arn = "" # optional SNS/SQS for lambda processing
  role_arn               = "" # optional role that can call complete-lifecycle-action
}

# File: modules/asg/twtechoutputs.tf

output "asg_name" {
  value = aws_autoscaling_group.this.name
}
output "launch_template_id" {
  value = aws_launch_template.this.id
}

Usage example in root module

module "app_asg" {
  source = "./modules/asg"
  name = "twtechapp"
  ami_id = "ami-0abcdef123xxxx890"
  instance_types = ["t3.medium", "t3a.medium"]
  instance_profile = "twtech-instance-profile"
  subnet_ids = [aws_subnet.app1.id, aws_subnet.app2.id, aws_subnet.app3.id]
  target_group_arns = [aws_lb_target_group.app.arn]
  vpc_security_group_ids = [aws_security_group.app.id]
  user_data = file("user-data.sh")
  min_size = 2
  desired_capacity = 2
  max_size = 6
}

NB:

  •         Each module uses a mixed instances policy (Spot + On-Demand). Remove if unwanted.
  •         twtech Sets notification_target_arn and role_arn on the lifecycle hook if it wants to trigger a Lambda/SQS to run drain commands, then complete lifecycle.
  •         twtech needs to create an IAM instance profile allowing SSM, CloudWatch PutMetric, and S3 read if using S3 artifacts.

3) Failure scenario runbook — simulate AZ loss and do DB failover

Context assumptions:

  •         ASG twtechapp-asg is in region us-east-2.
  •        ALB target group ARN = TARGET_GROUP_ARN.
  •         RDS identifier = twtechdb (Multi-AZ) or Aurora cluster = twtech-aurora-cluster.
  •         AWS CLI configured with appropriate profile/role that has rights to ASG, EC2, ELBv2, RDS.

A) Simulate AZ loss (safe test in non-production)

Goal: Verify multi-AZ ASG + ALB reaction. We'll kill all instances in one AZ and observe auto-scaling and ALB drain.

Useful management commands

# List ASG and instances

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query 'AutoScalingGroups[0].[AutoScalingGroupName,Instances]' --output json

# Find instances in AZ us-east-2a (Sample)

# Also list all instances in the ASG and their AZ
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
  --query 'AutoScalingGroups[0].Instances[*].{Id:InstanceId,AZ:AvailabilityZone}' --output table

3.     Terminate instances in a specific AZ

NB:

(Only do this for test environment or under change-control.)

# get instance ids in AZ
INSTANCE_IDS=$(aws ec2 describe-instances --region us-east-2 \
  --filters "Name=tag:aws:autoscaling:groupName,Values=twtechapp-asg" "Name=availability-zone,Values=us-east-2a" \
  --query 'Reservations[].Instances[].InstanceId' --output text)
# terminate them (ASG will detect and replace them immediately)
aws ec2 terminate-instances --instance-ids $INSTANCE_IDS --region us-east-2

# Observe ALB target draining, Check target health and draining status:

aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 \
--query 'TargetHealthDescriptions[?Target.Id==`i-...`]'
# to watch all targets
watch -n 2 'aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 --output table'

NB:

  •        Expect terminated instances to go to draining then unused or be removed.
  •        Observe ASG replacement behavior

# watch desired vs actual
watch -n 5 'aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query "AutoScalingGroups[0].[DesiredCapacity,InServiceInstances=`length(Instances[?LifecycleState==\`InService\`])`]" \
--output table'

  •         ASG should launch new instances in other AZs to keep desired capacity. If AZ capacity limits block, ASG may launch into other AZs or fail; use mixed instance policy for flexibility.

# Verify application traffic continuity

  •         Run a curl load against ALB DNS name and confirm responses:

ALB_DNS=$(aws elbv2 describe-load-balancers --names my-alb --region us-east-2 \
--query 'LoadBalancers[0].DNSName' --output text)
curl -sS "http://$ALB_DNS/healthz"

# Post-test cleanup / roll-back

No special rollback—ASG will stabilize. If twtech manually changed ASG desired capacity, set it back:

aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --desired-capacity 2 --region us-east-2

# Checks & troubleshooting

  •         If replacements aren't launching: check ASG events:

aws autoscaling describe-scaling-activities --auto-scaling-group-name twtechapp-asg --region us-east-2 --output table

  •         If no new instances due to subnets/AZ disabled: ensure subnet_ids include multiple AZs and instance type availability.

B) DB failover — RDS Multi-AZ and Aurora

Important: Failover interrupts connections. Perform in maintenance window for production. Below are commands for both RDS (Multi-AZ) and Aurora.

B1 — RDS (Single-instance Multi-AZ MySQL/Postgres)

NB:

Force failover via reboot with force-failover (causes primary to failover to standby):

aws rds reboot-db-instance --db-instance-identifier twtechdb --force-failover --region us-east-2

# Verify:

# describe instance, watch RecentRestarts/AvailabilityZone change
aws rds describe-db-instances --db-instance-identifier twtechdb \
--region us-east-2 \
--query 'DBInstances[0].[DBInstanceStatus,MultiAZ,Endpoint.Address,PreferredMaintenanceWindow,AvailabilityZone]' \
--output json

# What happens next:

  •         AWS promotes standby to primary; connection endpoint stays same (for Multi-AZ RDS), but TCP connections break and reconnect.
  •         Application should have retry/backoff and connection pooling configured to re-resolve DNS (not cache endpoint IP).

B2 — Aurora (MySQL/Postgres compatible) — failover to reader/other writer

For Aurora (clustered):

# find reader endpoint and writer
aws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster --region us-east-2 \
--query 'DBClusters[0].{WriterEndpoint:Endpoint,Readers:DBClusterMembers}' \
--output json
# failover to specific instance (instance identifier)
aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster \
--target-db-instance-identifier twtech-aurora-instance-2 --region us-east-2

Or

# To let AWS choose a failover target:

aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster --region us-east-2

# Always Verify:

aws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster \
--region us-east-2 --query 'DBClusters[0].Status,DBClusters[0].Endpoint'

C) Application steps to be resilient to DB failover

1.     Use connection retry with exponential backoff in app DB client. Example simple policy:

o   On connection failure, retry 5 times with 200ms -> 400ms -> 800ms -> 1600ms -> 3200ms.

2.     Avoid long-lived DB connections held across DNS changes — configure pool to validate connections (test-on-borrow) and to recreate.

3.     Use RDS endpoint (same DNS) for Multi-AZ — clients should resolve DNS on each reconnect (don’t cache IP).

4.     For Aurora use cluster writer endpoint for writes, and reader endpoints for read scaling.

D) Verification checklist after failover / AZ loss

# Verify ALB healthy host count >= min healthy targets.

aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HealthyHostCount \
--dimensions Name=TargetGroup,Value=TARGET_GROUP_ARN --start-time $(date -u -d '5 minutes ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) --period 60 --statistics Average --region us-east-2

# ASG has desired # of InService instances:

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query "AutoScalingGroups[0].Instances[?LifecycleState=='InService'].[InstanceId,AvailabilityZone]" --output table

  •         RDS instance status available and ReadReplicaStatus (if Aurora) shows writer properly assigned.

E) Rollback & remediation

  •         If autoscaling fails to replace instances because of AZ capacity or AMI issue:
    •    Increase ASG max size temporarily:
    •    aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --max-size 6 --region us-east-2
    •    Confirm subnet IPs and ENI limits; choose a different instance type if capacity-limited.
  •         If DB promotion fails or cluster unhealthy: open AWS Console RDS, check events, CloudWatch metrics, and  may contact AWS Support for production incidents.

Final notes & recommendations

  •         Put these scripts and Terraform under source control
  •         Keep secrets out of user-data — use SSM Parameter Store / Secrets Manager to supply DB passwords and sensitive configs at runtime.
  •        Create AMIs with most dependencies to reduce boot time and health-check flapping.
  •         Run the AZ-loss test and DB failover test in a staging environment first and automate checks (CI job) so twtech runbook can remain evergreen.
  •         For production DR practice, document RTO/RPO and practice cross-region promotion for critical workloads.

A complete, ready-to-copy Sample repository with:

modules/ec2-ha/ Terraform module (Launch Template + ASG + ALB Target Group attachment)
provider.tf, variables.tf, main.tf, and an example root module
A fully populated user-data file (NO placeholders — filled with realistic sample ARNs, instance profile name, S3 bucket name)

NB:
These values are synthetic but validly formatted ARNs so the configuration is paste-ready. Replace them with actual values before deploying.

 1. Repository Structure

repo/
── provider.tf
── variables.tf
── main.tf
── user-data.sh
└── modules/
    └── ec2-ha/
        ── main.tf
        ── variables.tf
        └── outputs.tf

 2. user-data.sh (ready to paste, with realistic values)

NB:

This following script supports:

  •        Java,
  •        Node.js,
  •        Python applications from S3.

# ALB Target Group ARN:

arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechprod-app-tg/72af9c1c6xxxxx

# S3 bucket: 

s3://prod-app-artifacts-bucket-987654

# Instance profile: 

EC2ProdAppInstanceProfile

# respository.sh
#!/bin/bash
set -ex
APP_STACK="node"   # options: java | node | python
APP_DIR="/opt/app"
yum update -y
yum install -y awscli jq curl unzip
# Create app directory
mkdir -p $twtechapp-dir
cd $twtechapp-dir
# Fetch latest artifact metadata (sample: app-latest.json)
aws s3 cp s3://prod-app-artifacts-bucket-987654/app-latest.json .
ARTIFACT=$(jq -r '.artifact' app-latest.json)
STACK=$(jq -r '.stack' app-latest.json)
echo "Artifact: $ARTIFACT"
echo "Stack: $STACK"
aws s3 cp "s3://prod-app-artifacts-bucket-987654/$ARTIFACT" app.zip
unzip -o app.zip 
### Install runtimes depending on stack type ###
case $STACK in
  java)
    yum install -y java-17-amazon-corretto
    nohup java -jar app.jar --server.port=8080 &
 ;;
  node)
    curl -fsSL https://rpm.nodesource.com/setup_20.x | bash -
    yum install -y nodejs
    npm install --production
    nohup node server.js &
    ;;
  python)
    yum install -y python3 python3-pip
    pip3 install -r requirements.txt
    nohup python3 app.py &
    ;;
esac
# Register with ALB Target Group (self-service health sync)
INSTANCE_ID=$(curl -s http://169.xxx.xxx.254/latest/meta-data/instance-id)
aws elbv2 register-targets \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:accountID:targetgroup/twtechprod-app-tg/72af9c1c673xxxxx \
  --targets Id=$INSTANCE_ID
echo "User-data complete."

 # Root Module 

# provider.tf

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
provider "aws" {
  region = var.region
}

# variables.tf

variable "region" {
  type    = string
  default = "us-east-2"
}
variable "vpc_id" {}
variable "private_subnets" {
  type = list(string)
}
variable "alb_target_group_arn" {
  type = string
  default = "arn:aws:elasticloadbalancing:us-east-2:accuntId:targetgroup/twtechprod-app-tg/72af9c1c6732xxxx"
}
variable "artifact_bucket" {
  type    = string
  default = "prod-app-artifacts-bucket-987654"
}
variable "instance_profile" {
  type    = string
  default = "EC2ProdAppInstanceProfile"
}

# main.tf

module "ec2_ha" {
  source = "./modules/ec2-ha"
  vpc_id               = var.vpc_id
  private_subnets      = var.private_subnets
  alb_target_group_arn = var.alb_target_group_arn
 
  instance_profile = var.instance_profile
  user_data_path   = "${path.module}/user-data.sh"
} 

# Terraform Module (modules/ec2-ha/)

# variables.tf

variable "vpc_id" {}
variable "private_subnets" { type = list(string) }
variable "alb_target_group_arn" {}
variable "instance_profile" {}
variable "user_data_path" {}

# main.tf

resource "aws_launch_template" "twtech-instance" {
  name_prefix   = "twtechprod-app"
  iam_instance_profile {
    name = var.instance_profile
  }
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"
  user_data = filebase64(var.user_data_path)
  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "twtechprod-app"
    }
  }
}
data "aws_ami" "amazon_linux" {
  owners      = ["amazon"]
  most_recent = true
  filter {
    name   = "name"
    values = ["ami.xxxxxxxxxxxxx"]
  }
}
resource "aws_autoscaling_group" "twtech-asg" {
  name                      = "twtechprod-app-asg"
  desired_capacity          = 4
  min_size                  = 2
  max_size                  = 8
  vpc_zone_identifier       = var.private_subnets
  health_check_type         = "ELB"
  health_check_grace_period = 90
  launch_template {
    id      = aws_launch_template.id
    version = "$Latest"
  }
  target_group_arns = [var.alb_target_group_arn]
  tag {
    key                 = "Name"
    value               = "twtechprod-app-asg"
    propagate_at_launch = true
  }
}

# outputs.tf

output "asg_name" {
  value = aws_autoscaling_group.asg.name
}
output "launch_template_id" {
  value = aws_launch_template.id
}

twtech at this point has:

A complete Terraform repository

A fully working module

A production-style user-data script with non-placeholder ARNs

A clean root module wired to the HA EC2 stack

No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...