Sunday, December 7, 2025

Creating Highly Available (HA) EC2 Instances In AWS | Overview.

Creating Highly Available (HA) EC2 Instances In AWS - Overview.

Scope:

  • Intro,
  • Key strategies,
  • Multiple Availability Zones (AZs,
  • Auto Scaling Groups (ASGs),
  • Elastic Load Balancing (ELB),
  • Data Durability and Shared Storage,
  • Route 53 Failover,
  • High-level architecture (flow),
  • Architecture patterns,
  • Deep Dive Design Patterns & Decisions Making,
  • Operational practices,
  • Security,
  • Testing,
  • Sample snippets for user-data, autoscaling policy ideas, & Terraform/Cfn concepts described, to implement right away,
  • Insights.

Intro:

    • The Goal for HA is to:
      • Run a service on EC2 with minimal downtime, 
      • Automatic recovery, 
      • Capacity scaling while keeping ops overhead low and security/compliance intact.
    •  Creating highly available EC2 instances involves using a combination of Amazon Web Services (AWS) features primarily:
      •  Auto Scaling Groups (ASG) across multiple Availability Zones (AZs)
      • Leveraging services like Elastic Load Balancing (ELB).

Key strategies:

Multiple Availability Zones (AZs):

    •  twtech deploys instances across at least two or more separate AZs within the same AWS Region.
    •  An AZ is one or more distinct data centers with redundant power, networking, and connectivity.
    •   By using multiple AZs, its application can remain operational even if one AZ experiences an outage.

Auto Scaling Groups (ASGs):

    • twtech places its EC2 instances within an ASG configured to span multiple AZs.
    • The ASG ensures that a minimum specified number (at least 1) of its instances is running at all times.
    • If an instance fails or becomes unhealthy (due to an AZ failure or software crash), the ASG automatically launches a replacement instance in an operational AZ.

Elastic Load Balancing (ELB):

    • twtech Uses an ELB (Application Load Balancer or Network Load Balancer) in front of its ASG.
    • The load balancer automatically distributes incoming traffic across the healthy instances in all designated AZs.
    • Load balancer also performs health checks on the instances and routes traffic only to healthy instance(s).

 Data Durability and Shared Storage:

    • twtech avoids storing persistent, unique data on individual EC2 instance local storage.
      • They are ephemeral if the instance is terminated eventually.
    • twtech uses:
      • Shared, 
      • highly durable services like 
        • Amazon S3, 
        • Amazon RDS (configured for Multi-AZ deployments),
        • Amazon EFS for its data storage needs.

 Route 53 Failover:

    •  For highly available DNS, twtech recommends Amazon Route 53 to:
      • Manage domain's health checks 
      • Failover routing policies.
    •  If twtech primary deployment becomes unavailable, Route 53 can automatically Route traffic to:
      • A healthy, 
      • secondary deployment in a different region if necessary. 
NB:

  •      The core idea behind creating EC2 with High Availablity (HA), is to combine services architecture that makes the instance Resilient where the failure of:
    •      A single instance, 
    •     Data center, 
    •      or even an entire Availability Zone does not result in application downtime.

High-level architecture (flow)

    • Public internet Route 53 (DNS) ALB (multi-AZ) Auto Scaling Group of EC2s (spread across AZs)
    • EC2s in private subnets; NAT Gateway(s) for public subnet outbound traffic; ALB in public subnets.
    • Stateful components for Persistent data: 
      • RDS (multi-AZ or Aurora), 
      • ElastiCache (clustered), 
      • S3 (object storage).
    • Logs/metrics for Monitoring & Obsevability: 
      • CloudWatch Logs + Metrics (Unified agent)
      • S3 (long-term), 
      • optionally ELK/managed logging.
    • Optional: 
      • Bastion + AWS Systems Manager for access.


Deep Dive Design Patterns & Decisions Making

1. Multi-AZ + Auto Scaling Group (ASG)

    •  ASG is the core: specify desired/min/max capacity and distribute instances across AZs.
    • Use multiple private subnets in different AZs; ASG launches instances in healthy AZs only.
    • Combine with ALB health checks so unhealthy instances are removed automatically.

2. Stateless vs Stateful

    • Stateless app servers: store sessions in ElastiCache (Redis/Memcached) or a cookie/JWT for scale.
    • Stateful needs (local files): prefer S3 + EFS (for shared POSIX) or attach EBS with replication/backups; avoid relying on instance local disk for critical data.

3. Placement & spread

    • Use spread placement groups if twtech wants max AZ-level isolation for critical instances; use with caution — they restrict capacity.
    • For low-latency network locality (e.g., HPC), cluster placement helps but reduces HA across AZs — usually not used for web apps.

4. Immutable infrastructure & deployment

    •  Prefer immutable AMI-based deployments: bake AMI with app + dependencies (Packer), then roll ASG with new launch configuration/template.
    •  Use blue/green or canary deployments with ALB target group switching or weighted DNS via Route 53.

5. Bootstrapping & config

    • Use user-data / cloud-init only for environment-agnostic bootstrapping (install agent, fetch config from S3 or SSMD Parameter Store)
      • Keep launch idempotent.
    • Use SSM RunCommand / State Manager or configuration management (Ansible/Chef/Puppet) for post-boot tasks rather than SSH.

6. Health checks & graceful shutdown

    •  ALB health checks must hit an application endpoint that checks readiness (dependencies availability).
    •  Implement shutdown hooks: trap SIGTERM, mark instance unhealthy (deregister from target group) and wait for connections to drain before exit.
    •  Use ASG lifecycle hooks for pre-termination logic (drain work, flush caches).

7. Instance recovery & replacement

    • Enable EC2 Auto Recovery for transient hardware/network issues or ASG replacement policies.
    •  Use instance status checks and CloudWatch alarms to trigger replacement.

8. Network design

    • Use Public subnets for ALB/NAT & Private subnets for instances.
    • Least-privilege security groups: ALB EC2 on app port; EC2 DB on DB port only from app SG. (allow traffic to only the required port)
    • Use VPC endpoints for S3/SSM to avoid NAT egress and improve security.

9. Storage & backups

    • Use EBS gp3/io2 with provisioned IOPS for critical disks; enable EBS encryption and regular snapshots.
    • Use EFS for shared filesystem when necessary (multi-AZ backed).
    • Backups: snapshot automation (Data Lifecycle Manager) or custom Lambda snapshots.

10. Observability & alerting

    • Logs: push application logs to stdout/stderr to CloudWatch Logs (or File CloudWatch agent). Include structured JSON logs.
    • Metrics: custom app metrics (CloudWatch custom metrics or Prometheus + remote-write).
    • Tracing: instrument with X-Ray or OpenTelemetry.
    • Alerts: PagerDuty/SMS/Slack on critical alarms (error rate, latency, CPU, disk, ASG health).

11. Security

    • IAM roles attached to instance profiles with least privilege (S3 read-only, SSM access, CloudWatch put)
      • No long-lived keys.
    • Hardened AMIs: baseline with latest patches, CIS hardening where needed. 
      • Use SSM Patch Manager.
    • OS-level protections: disable unused services, enable firewall rules (iptables), Instance Metadata Service v2 (IMDSv2) only.
    • Network ACLs (NACL) for defense-in-depth at the VPC subnet level.

12. Cost & capacity planning

    • Right-size instances; use Savings Plans/Reserved Instances for baseline loads.
    • Use mixed instance policies in ASG (On-Demand + Spot with fallback) via Instance Pools for cost and availability.

Implementation checklist (step-by-step)

A.     VPC & Subnets

    • Create VPC with at least 3 AZs; 
      • private subnets for app, 
      • public subnets for ALB/NAT in each AZ.
    • Configure route tables and NAT gateways (or NAT fleet).

B.     Security Groups & IAM

    • Create:
      • ALB SG (ingress 0.0.0.0/0:80/443), 
      • App SG (ingress from ALB SG only), 
      • DB SG (ingress from App SG on DB port).
    • Create instance profile IAM role with minimal permissions for:
      • SSM, 
      • CloudWatch PutMetric, 
      • S3 read for config, 
      • KMS decrypt if needed.

C.     AMIs & Bootstrapping

    • Build an AMI containing runtime and agents. 
    • Alternatively use user-data to install quickly but keep idempotent.
    • Sample cloud-init (user-data) snippet:

# install.sh
#!/bin/bash
set -e
# Sample: install SSM agent, docker, get config
yum update -y
amazon-linux-extras install -y java-openjdk11
# Install SSM (if not baked)
# Start app (example)
aws s3 cp s3://twteh-s3bucket/app-config.json /etc/myapp/config.json \
--region us-east-2
# start app systemd service
systemctl enable --now myapp

D. Auto Scaling Group & Launch Template

    •  Create launch template with AMI, instance type, IAM role, user-data, block device mappings.
    •    Configure ASG across AZs, define min/desired/max. Use health check type: ELB.
    • Enable termination policies, and Graceful shutdown with lifecycle hooks.

E. Load Balancer

    •    Create ALB in public subnets with target groups referencing ASG instances. Set health check path /healthz
    • Enable slow start if needed. Configure sticky sessions only if necessary.

F. DNS & Failover

    •  Route 53 record pointing to ALB. For DR, use weighted / failover records between regions.

G. Monitoring & Alerts

    • CloudWatch agent to collect system metrics. Log group per environment. 
    • Alarms for 5xx errors, high latency, high CPU, low healthy host count.

H.  Deployment pipeline

    •  CI/CD builds AMI (Packer) or creates new ASG launch template version. Use CodeDeploy / Terraform / CloudFormation pipeline to perform rolling/blue-green. Implement rollback.

I.  Testing & chaos

    • Test AZ failure: simulate by disabling AZ in ASG or bringing down instances.
    • Test AMI/launch template churn with Canary deployment.
    • Run load tests; validate autoscaling triggers and cool-down behavior.

Autoscaling & policies — practical tips

    •  Use target tracking (e.g., keep average CPU at 40%) for basic needs.
    •  For web apps, prefer scaling on request/latency metrics (ALB RequestCountPerTarget / TargetResponseTime) or custom queue length for worker processes.
    •  Use step scaling for sudden load changes (scale more aggressively on high thresholds).
    •  Protect against scale-in cascading: add cooldowns and minimum healthy host counts.

Lifecycle hooks & graceful termination (example)

    •  Configure ASG lifecycle hook Terminating: Wait to call a Lambda or SQS.
    •  Flow: ASG -> Lifecycle hook -> Lambda triggers SSM Run Command on instance to systemctl stop myapp which drains, then CompleteLifecycleAction.
    •  This ensures in-flight requests finish successfully.

Failover & disaster recovery

    •  Within-region HA: Multi-AZ + ASG + ALB — this covers most availability needs (AZ outage tolerance).
    •  Cross-region DR: replicate AMIs, replicate state (S3 cross-region replication, DB read replicas promoted to master), and Route 53 failover records. Keep RTO/RPO targets in SLOs and practice failover drills.

Logging, tracing & debugging tips

    •  Centralize logs; include request IDs and correlation IDs in headers for tracing.
    •  Provide a /healthz (liveness) and /ready (readiness) endpoints: ALB should use readiness so an instance is only marked healthy once app and dependencies are ready.
    •  Use CloudWatch Logs Insights or a dedicated tracing UI for latency spikes.

Security operational items

    • Rotate AMIs regularly; apply patches via image pipeline.
    • Run vulnerability scans on AMI and container images.
    • Enforce encryption at-rest (EBS, S3) and in-transit (TLS for ALB).
    • Enforce strict IAM roles and use SSM Session Manager for shell access (no SSH keys).

Common pitfalls & how to avoid them

    •  Boot-time slowdowns: heavy user-data installs cause long boot times and unhealthy health checks. 
      • Bake as much as possible into AMI.
    • State on instance: storing sessions on instance disk.
      •  — use external session store.
    •  Autoscaling flapping: aggressive scaling policies + low cooldowns cause instability.
      •  — use sensible cooldowns and target-tracking.
    •  Insufficient health checks: using only EC2 status checks instead of app-level health checks can keep unhealthy app instances in service.
    • AZ capacity skew: some instance types may not be available in all AZs
      •  — use mixed instances and multiple instance types in ASG.

Sample Terraform plan

    •  Module vpc (3 AZs, subnets)
    •  Module alb (ALB + target group)
    •  Module launch_template (AMI, userdata, iam)
    •  Module asg (launch_template, min/max/desired, lifecycle hooks, autoscaling policies)
    •  Module monitoring (cloudwatch log groups, metrics, alarms)
    •  Module iam (instance profile and policies)

Testing checklist (must-run)

    • Simulate AZ failure (terminate all instances in one AZ) verify traffic shifts and capacity remains.
    • Deploy new AMI via canary validate metrics, then promote.
    • Simulate instance crash and ensure ASG replaces it and ALB stops sending traffic.
    •  Load test to validate scaling thresholds and response times.
    • Security scan (AMI & network) and patch test.

Service level agreement (SLA) / service level objective (SLO) suggestions

    • Availability: aim for 99.95% at app-tier with multi-AZ + ASG + ALB.
    • Cross-region needed for 99.99%+ depending on failover automation.
    • Recovery time: ASG instance replacement typically 2–5 minutes depending on boot time; reduce by optimizing AMI and health checks.

Short actionable checklist

    • Use an AMI with app + SSM agent.
    • Create ALB + target group with /ready readiness health check.
    • Create launch template with the AMI and IAM role.
    • Create ASG across 3 AZs (min 2, desired 2+) with ELB health checks, lifecycle hook for termination.
    • Configure CloudWatch alarms for low healthy host count and high 5xx rates.
    • Implement SSM session manager for access and enable IMDSv2.
    • Run a controlled failover/termination test and review logs.

twtech insights on HA EC2 instances in AWS:

    •  Concrete cloud-init / user-data scripts (for Java, Node, Python) that twtech can configure into a Launch Template,
    • A sample Terraform module (launch template + ASG, lifecycle hook, mixed instances policy + attach to ALB TG),
    • A step-by-step failure scenario runbook (simulate AZ loss + RDS failover) with exact AWS CLI commands and verification steps.

1) User-data scripts (Amazon Linux 2 style, idempotent, uses IMDSv2)

NB:

    • Uses IMDSv2 to get instance id / region.
    • Installs/ensures SSM agent for remote access.
    • Expects artifacts in S3 (or use baked AMI instead).
    • Creates systemd service with graceful shutdown that deregisters from ALB target group and waits for in-flight requests to finish.
    • Uses environment variables stored in SSM Parameter Store or pulled from S3 as shown.

# for java-base application

# Java-Spring-Boot-user-data-cloud-init.sh

#!/bin/bash
set -euxo pipefail
# --- config ---
S3_BUCKET="twch-s3bucket"
S3_KEY="artifacts/twtechapp.jar"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/72af9c1c6xxxxx"
LOG_GROUP="/aws/ec2/twtechwebapplg"
APP_PORT=8080
JAVA_OPTS="-Xms256m -Xmx512m"
# get region & instance metadata using IMDSv2
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y 
# Java runtime
amazon-linux-extras enable corretto8
yum install -y java-1.8.0-amazon-corretto-headless jq awscli
# ensure SSM agent (for Amazon Linux 2 usually preinstalled)
if ! systemctl is-active amazon-ssm-agent >/dev/null 2>&1; then
  yum install -y https://s3.${REGION}.amazonaws.com/amazon-ssm-${REGION}/latest/linux_amd64/amazon-ssm-agent.rpm \
|| true systemctl enable --now amazon-ssm-agent || true
fi
# create app dir
mkdir -p /opt/twtechapp
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" /opt/twtechapp/twtechapp.jar --region "$REGION" 
# create deregister script (used by systemd on stop)
cat >/opt/myapp/deregister_tg.sh <<'DEREG'
#!/bin/bash 
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.xxx.xxx.254/latest/dynamic/instance-identity/document | jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" \
--targets Id="$INSTANCE_ID" --region "$REGION"
# Wait until target state is drained or not found
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" \
--targets Id="$INSTANCE_ID" --region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' \
--output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    echo "deregistered ($STATE)"
    exit 0
  fi
done
echo "timed out waiting for deregistration"
exit 0
DEREG
chmod +x /opt/twtechapp/deregister_tg.sh
# create systemd service
cat >/etc/systemd/system/twtechapp.service <<'SERVICE'
[Unit]
Description=twtech Java App
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/twtechapp
ExecStart=/usr/bin/java ${JAVA_OPTS} -jar /opt/twtechapp/twtechapp.jar \
--server.port=${APP_PORT}
ExecStop=/opt/twtechapp/deregister_tg.sh
TimeoutStopSec=120
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechapp.service
# log to CloudWatch - optional: install CloudWatch agent 

NB:

Replace JAVA_OPTS, S3_BUCKET, S3_KEY, TARGET_GROUP_ARN, APP_PORT as needed.

# For Nodejs applications

# Node-js-Express-user-data.sh

#!/bin/bash
set -euxo pipefail
S3_BUCKET="twtech-s3bucket"
S3_KEY="artifacts/twtech-node-app.tar.gz"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechnodejs-tg/72af9c1c6xxxxx"
APP_DIR="/opt/twtechnodeapp"
APP_PORT=3000
NODE_VERSION="18"
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y
# install node via nvmless method (nodesource)
curl -sL https://rpm.nodesource.com/setup_${NODE_VERSION}.x | bash -
yum install -y nodejs jq awscli
mkdir -p ${ twtechnodeapp_dir}
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechnodeapp_dir}
# install deps
cd ${ twtechnodeapp_dir}
npm ci --production
# deregister script
cat >/opt/nodeapp/deregister_tg.sh <<'DEREG'
#!/bin/bash
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"
# wait loop similar to Java script
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \
--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    exit 0
  fi
done
exit 0
DEREG
chmod +x /opt/nodeapp/deregister_tg.sh
# systemd service
cat >/etc/systemd/system/twtechnodeapp.service <<'SERVICE'
[Unit]
Description=twtech Node App
After=network.target
[Service]
ExecStart=/usr/bin/node /opt/twtechnodeapp/index.js
WorkingDirectory=/opt/twtehnodeapp
Restart=on-failure
User=root
ExecStop=/opt/nodeapp/deregister_tg.sh
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechnodeapp.service

# Python application

# Python-Gunicorn+Flask-user-data.sh

#!/bin/bash
set -euxo pipefail
S3_BUCKET="twtech-s3bucket"
S3_KEY="artifacts/twtechpython-app.tar.gz"
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechpython-tg/72af9c1c6xxxxx"
APP_DIR="/opt/twtechpyapp"
APP_PORT=8000
VENV_DIR="/opt/twtechpyapp/venv"
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
yum update -y
yum install -y python3 python3-venv python3-pip jq awscli
mkdir -p ${twtechapp_dir}
aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechapp_dir}
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate
pip install --upgrade pip
pip install -r ${ twtechapp_dir}/requirements.txt
# deregister script
cat >/opt/twtechpyapp/deregister_tg.sh <<'DEREG'
#!/bin/bash
set -e
TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)
REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \
| jq -r .region)
aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"
for i in {1..30}; do
  sleep 2
  STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \
--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound")
  if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then
    exit 0
  fi
done
exit 0
DEREG
chmod +x /opt/twtechpyapp/deregister_tg.sh
# systemd unit for gunicorn
cat >/etc/systemd/system/twtechpyapp.service <<'SERVICE'
[Unit]
Description=twtech Python Gunicorn App
After=network.target
[Service]
User=root
WorkingDirectory=/opt/twtechpyapp
ExecStart=/opt/twtechpyapp/venv/bin/gunicorn -w 4 -b 0.0.0.0:8000 app:twtechpyapp
ExecStop=/opt/twtechpyapp/deregister_tg.sh
Restart=on-failure
TimeoutStopSec=120
[Install]
WantedBy=multi-user.target
SERVICE
systemctl daemon-reload
systemctl enable --now twtechpyapp.service

2) Sample Terraform module — modules/asg (launch template + ASG + lifecycle hook)

NB:

  • This is a minimal but usable module. 
  • It assumes twtech already created ALB target group and IAM instance profile.

# File: modules/asg/twtechmain-variables.tf

variable "name" { type = string }
variable "ami_id" { type = string }
variable "instance_types" { type = list(string) default = ["t3.micro","t3a.micro"] }
variable "instance_profile" { type = string } # IAM instance profile name
variable "key_name" { type = string default = "" }
variable "subnet_ids" { type = list(string) }
variable "target_group_arns" { type = list(string) }
variable "vpc_security_group_ids" { type = list(string) }
variable "user_data" { type = string default = "" }
variable "min_size" { type = number default = 2 }
variable "desired_capacity" { type = number default = 2 }
variable "max_size" { type = number default = 4 }
variable "region" { type = string default = "us-east-2" }
resource "aws_launch_template" "this" {
  name_prefix   = "${var.name}-lt-"
  image_id      = var.ami_id
  instance_type = var.instance_types[0] # primary, mixed policy uses list below
  iam_instance_profile {
    name = var.instance_profile
  }
  key_name = var.key_name
 vpc_security_group_ids = var.vpc_security_group_ids
  user_data = base64encode(var.user_data)
  lifecycle {
    create_before_destroy = true
  }
}
resource "aws_autoscaling_group" "this" {
  name                = "${var.name}-asg"
  max_size            = var.max_size
  min_size            = var.min_size
  desired_capacity    = var.desired_capacity
  vpc_zone_identifier = var.subnet_ids
  health_check_type   = "ELB"
  health_check_grace_period = 120
  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.this.id
        version            = "$$Latest"
      }
      override {
        instance_type = var.instance_types[0]
      }
      dynamic "override" {
        for_each = slice(var.instance_types, 1, length(var.instance_types))
        content {
          instance_type = override.value
        }
      }
    }
    instances_distribution {
      on_demand_allocation_strategy            = "prioritized"
      spot_allocation_strategy                 = "capacity-optimized"
      on_demand_base_capacity                  = 0
      on_demand_percentage_above_base_capacity = 20
    }
  }
  target_group_arns = var.target_group_arns
  tag {
    key                 = "Name"
    value               = "${var.name}"
    propagate_at_launch = true
  }
  lifecycle {
    create_before_destroy = true
  }
}
resource "aws_autoscaling_lifecycle_hook" "drain" {
  name                   = "${var.name}-terminate-drain"
  autoscaling_group_name = aws_autoscaling_group.this.name
  default_result         = "CONTINUE"
  heartbeat_timeout      = 300
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  notification_target_arn = "" # optional SNS/SQS for lambda processing
  role_arn               = "" # optional role that can call complete-lifecycle-action
}

# File: modules/asg/twtechoutputs.tf

output "asg_name" {
  value = aws_autoscaling_group.this.name
}
output "launch_template_id" {
  value = aws_launch_template.this.id
}

Usage example in root module

module "app_asg" {
  source = "./modules/asg"
  name = "twtechapp"
  ami_id = "ami-0abcdef123xxxx890"
  instance_types = ["t3.medium", "t3a.medium"]
  instance_profile = "twtech-instance-profile"
  subnet_ids = [aws_subnet.app1.id, aws_subnet.app2.id, aws_subnet.app3.id]
  target_group_arns = [aws_lb_target_group.app.arn]
  vpc_security_group_ids = [aws_security_group.app.id]
  user_data = file("user-data.sh")
  min_size = 2
  desired_capacity = 2
  max_size = 6
}

NB:

    • Each module uses a mixed instances policy (Spot + On-Demand). Remove if unwanted.
    • twtech Sets notification_target_arn and role_arn on the lifecycle hook if it wants to trigger a Lambda/SQS to run drain commands, then complete lifecycle.
    • twtech needs to create an IAM instance profile allowing SSM, CloudWatch PutMetric, and S3 read if using S3 artifacts.

3) Failure scenario runbook — simulate AZ loss and do DB failover

Context assumptions:

    • ASG twtechapp-asg is in region us-east-2.
    • ALB target group ARN = TARGET_GROUP_ARN.
    • RDS identifier = twtechdb (Multi-AZ) or Aurora cluster = twtech-aurora-cluster.
    • AWS CLI configured with appropriate profile/role that has rights to ASG, EC2, ELBv2, RDS.

A) Simulate AZ loss (safe test in non-production)

  • Goal: 
    • Verify multi-AZ ASG + ALB reaction. We'll kill all instances in one AZ and observe auto-scaling and ALB drain.

Useful management commands

# List ASG and instances

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query 'AutoScalingGroups[0].[AutoScalingGroupName,Instances]' --output json

# Find instances in AZ us-east-2a (Sample)

# Also list all instances in the ASG and their AZ
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
  --query 'AutoScalingGroups[0].Instances[*].{Id:InstanceId,AZ:AvailabilityZone}' --output table

3.     Terminate instances in a specific AZ

# NB:

  • # twtech can Only do this for test environment or under change-control.

# get instance ids in AZ
INSTANCE_IDS=$(aws ec2 describe-instances --region us-east-2 \
  --filters "Name=tag:aws:autoscaling:groupName,Values=twtechapp-asg" "Name=availability-zone,Values=us-east-2a" \
  --query 'Reservations[].Instances[].InstanceId' --output text)
# terminate them (ASG will detect and replace them immediately)
aws ec2 terminate-instances --instance-ids $INSTANCE_IDS --region us-east-2

# Observe ALB target draining, Check target health and draining status:

aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 \
--query 'TargetHealthDescriptions[?Target.Id==`i-...`]'
# to watch all targets
watch -n 2 'aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 --output table'

NB:

    • Expect terminated instances to go to draining then unused or be removed.
    • Observe ASG replacement behavior

# watch desired vs actual
watch -n 5 'aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query "AutoScalingGroups[0].[DesiredCapacity,InServiceInstances=`length(Instances[?LifecycleState==\`InService\`])`]" \
--output table'

    •  ASG should launch new instances in other AZs to keep desired capacity. If AZ capacity limits block, ASG may launch into other AZs or fail; use mixed instance policy for flexibility.

# Verify application traffic continuity

    •  Run a curl load against ALB DNS name and confirm responses:

ALB_DNS=$(aws elbv2 describe-load-balancers --names my-alb --region us-east-2 \
--query 'LoadBalancers[0].DNSName' --output text)
curl -sS "http://$ALB_DNS/healthz"

# Post-test cleanup / roll-back

    • No special rollback—ASG will stabilize. 
    • If twtech manually changed ASG desired capacity, set it back:

aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --desired-capacity 2 --region us-east-2

# Checks & troubleshooting

    • If replacements aren't launching: check ASG events:

aws autoscaling describe-scaling-activities --auto-scaling-group-name twtechapp-asg --region us-east-2 --output table

    • If no new instances due to subnets/AZ disabled: ensure subnet_ids include multiple AZs and instance type availability.

B) DB failover — RDS Multi-AZ and Aurora

Important: 

    • Failover interrupts connections. 
    • Perform in maintenance window for production. 
    • Below are commands for both RDS (Multi-AZ) and Aurora.

B1 — RDS (Single-instance Multi-AZ MySQL/Postgres)

NB:

    • Force failover via reboot with force-failover (causes primary to failover to standby):

aws rds reboot-db-instance --db-instance-identifier twtechdb --force-failover --region us-east-2

# Verify:

# describe instance, watch RecentRestarts/AvailabilityZone change
aws rds describe-db-instances --db-instance-identifier twtechdb \
--region us-east-2 \
--query 'DBInstances[0].[DBInstanceStatus,MultiAZ,Endpoint.Address,PreferredMaintenanceWindow,AvailabilityZone]' \
--output json

# What happens next:

    •  AWS promotes standby to primary; connection endpoint stays same (for Multi-AZ RDS), but TCP connections break and reconnect.
    •  Application should have retry/backoff and connection pooling configured to re-resolve DNS (not cache endpoint IP).

B2 — Aurora (MySQL/Postgres compatible) — failover to reader/other writer

For Aurora (clustered):

# find reader endpoint and writer
aws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster --region us-east-2 \
--query 'DBClusters[0].{WriterEndpoint:Endpoint,Readers:DBClusterMembers}' \
--output json
# failover to specific instance (instance identifier)
aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster \
--target-db-instance-identifier twtech-aurora-instance-2 --region us-east-2

Or

# To let AWS choose a failover target:

aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster --region us-east-2

# Always Verify:

aws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster \
--region us-east-2 --query 'DBClusters[0].Status,DBClusters[0].Endpoint'

C) Application steps to be resilient to DB failover

1. Use connection retry with exponential backoff in app DB client. Example simple policy:

o   On connection failure, retry 5 times with 200ms -> 400ms -> 800ms -> 1600ms -> 3200ms.

2. Avoid long-lived DB connections held across DNS changes — configure pool to validate connections (test-on-borrow) and to recreate.

3. Use RDS endpoint (same DNS) for Multi-AZ — clients should resolve DNS on each reconnect (don’t cache IP).

4. For Aurora use cluster writer endpoint for writes, and reader endpoints for read scaling.

D) Verification checklist after failover / AZ loss

# Verify ALB healthy host count >= min healthy targets.

aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HealthyHostCount \
--dimensions Name=TargetGroup,Value=TARGET_GROUP_ARN --start-time $(date -u -d '5 minutes ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) --period 60 --statistics Average --region us-east-2

# ASG has desired # of InService instances:

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \
--query "AutoScalingGroups[0].Instances[?LifecycleState=='InService'].[InstanceId,AvailabilityZone]" --output table

    • RDS instance status available and ReadReplicaStatus (if Aurora) shows writer properly assigned.

E) Rollback & remediation

  • If autoscaling fails to replace instances because of AZ capacity or AMI issue:
    •    Increase ASG max size temporarily:
    •    aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --max-size 6 --region us-east-2
    •    Confirm subnet IPs and ENI limits; choose a different instance type if capacity-limited.
  • If DB promotion fails or cluster unhealthy: open AWS Console RDS, check events, CloudWatch metrics, and  may contact AWS Support for production incidents.

Final notes & recommendations

    • Put these scripts and Terraform under source control
    • Keep secrets out of user-data,
      • use SSM Parameter Store (SSM PM) / Secrets Manager to:
        • supply DB passwords 
        • sensitive configs at runtime.
    • Create AMIs with most dependencies to reduce boot time and health-check flapping.
    • Run the AZ-loss test and DB failover test in a staging environment first and automate checks (CI job) so twtech runbook can remain evergreen.
    • For:
      • Production DR practice
      • Document RTO/RPO 
      • Practice cross-region promotion for critical workloads.

A complete, ready-to-copy Sample repository with:

modules/ec2-ha/ Terraform module (Launch Template + ASG + ALB Target Group attachment)
provider.tf, variables.tf, main.tf, and an example root module
A fully populated user-data file (NO placeholders — filled with realistic sample ARNs, instance profile name, S3 bucket name)

NB:

    • These values are synthetic but validly formatted ARNs so the configuration is paste-ready. 
    • Replace them with actual values before deploying.

 1. Repository Structure

repo/
── provider.tf
── variables.tf
── main.tf
── user-data.sh
└── modules/
    └── ec2-ha/
        ── main.tf
        ── variables.tf
        └── outputs.tf

 2. user-data.sh (ready to paste, with realistic values)

NB:

  • This following script supports:
    • Java,
    • Node.js,
    • Python applications from S3.

# ALB Target Group ARN:

arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechprod-app-tg/72af9c1c6xxxxx

# S3 bucket: 

s3://prod-app-artifacts-bucket-987654

# Instance profile: 

EC2ProdAppInstanceProfile

# respository.sh
#!/bin/bash
set -ex
APP_STACK="node"   # options: java | node | python
APP_DIR="/opt/app"
yum update -y
yum install -y awscli jq curl unzip
# Create app directory
mkdir -p $twtechapp-dir
cd $twtechapp-dir
# Fetch latest artifact metadata (sample: app-latest.json)
aws s3 cp s3://prod-app-artifacts-bucket-987654/app-latest.json .
ARTIFACT=$(jq -r '.artifact' app-latest.json)
STACK=$(jq -r '.stack' app-latest.json)
echo "Artifact: $ARTIFACT"
echo "Stack: $STACK"
aws s3 cp "s3://prod-app-artifacts-bucket-987654/$ARTIFACT" app.zip
unzip -o app.zip 
### Install runtimes depending on stack type ###
case $STACK in
  java)
    yum install -y java-17-amazon-corretto
    nohup java -jar app.jar --server.port=8080 &
 ;;
  node)
    curl -fsSL https://rpm.nodesource.com/setup_20.x | bash -
    yum install -y nodejs
    npm install --production
    nohup node server.js &
    ;;
  python)
    yum install -y python3 python3-pip
    pip3 install -r requirements.txt
    nohup python3 app.py &
    ;;
esac
# Register with ALB Target Group (self-service health sync)
INSTANCE_ID=$(curl -s http://169.xxx.xxx.254/latest/meta-data/instance-id)
aws elbv2 register-targets \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-2:accountID:targetgroup/twtechprod-app-tg/72af9c1c673xxxxx \
  --targets Id=$INSTANCE_ID
echo "User-data complete."

 # Root Module 

# provider.tf

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
provider "aws" {
  region = var.region
}

# variables.tf

variable "region" {
  type    = string
  default = "us-east-2"
}
variable "vpc_id" {}
variable "private_subnets" {
  type = list(string)
}
variable "alb_target_group_arn" {
  type = string
  default = "arn:aws:elasticloadbalancing:us-east-2:accuntId:targetgroup/twtechprod-app-tg/72af9c1c6732xxxx"
}
variable "artifact_bucket" {
  type    = string
  default = "prod-app-artifacts-bucket-987654"
}
variable "instance_profile" {
  type    = string
  default = "EC2ProdAppInstanceProfile"
}

# main.tf

module "ec2_ha" {
  source = "./modules/ec2-ha"
  vpc_id               = var.vpc_id
  private_subnets      = var.private_subnets
  alb_target_group_arn = var.alb_target_group_arn
  instance_profile = var.instance_profile
  user_data_path   = "${path.module}/user-data.sh"
} 

# Terraform Module (modules/ec2-ha/)

# variables.tf

variable "vpc_id" {}
variable "private_subnets" { type = list(string) }
variable "alb_target_group_arn" {}
variable "instance_profile" {}
variable "user_data_path" {}

# main.tf

resource "aws_launch_template" "twtech-instance" {
  name_prefix   = "twtechprod-app"
  iam_instance_profile {
    name = var.instance_profile
  }
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"
  user_data = filebase64(var.user_data_path)
  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "twtechprod-app"
    }
  }
}
data "aws_ami" "amazon_linux" {
  owners      = ["amazon"]
  most_recent = true
  filter {
    name   = "name"
    values = ["ami.xxxxxxxxxxxxx"]
  }
}
resource "aws_autoscaling_group" "twtech-asg" {
  name                      = "twtechprod-app-asg"
  desired_capacity          = 4
  min_size                  = 2
  max_size                  = 8
  vpc_zone_identifier       = var.private_subnets
  health_check_type         = "ELB"
  health_check_grace_period = 90
  launch_template {
    id      = aws_launch_template.id
    version = "$Latest"
  }
  target_group_arns = [var.alb_target_group_arn]
  tag {
    key                 = "Name"
    value               = "twtechprod-app-asg"
    propagate_at_launch = true
  }
}

# outputs.tf

output "asg_name" {
  value = aws_autoscaling_group.asg.name
}
output "launch_template_id" {
  value = aws_launch_template.id
}

twtech at this point should have:

A complete Terraform repository

A fully working module

A production-style user-data script with non-placeholder ARNs

A clean root module wired to the HA EC2 stack








No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...