Here’s twtech practical Overview on creating highly-available (HA) EC2-based applications.
Scope:
- Intro,
- Architecture patterns,
- Design decisions,
- Operational practices,
- Security,
- Testing,
- A few concrete snippets (user-data, autoscaling policy ideas, and Terraform/Cfn concepts described) so twtech can configure and implement right away,
- Insights.
Intro:
- The Goal is to run a service on EC2 with minimal downtime,
automatic recovery, and capacity scaling while keeping ops overhead low and
security/compliance intact.
- Creating highly available EC2 instances involves using a combination of Amazon Web Services (AWS) features, primarily Auto Scaling Groups (ASG) across multiple Availability Zones (AZs), and leveraging services like Elastic Load Balancing (ELB).
Key strategies:
Multiple
Availability Zones (AZs):
- twtech
deploys instances across at least two or more separate AZs within the same AWS
Region.
- An
AZ is one or more distinct data centers with redundant power, networking, and
connectivity.
- By using multiple AZs, its application can remain operational even if one AZ experiences an outage.
Auto Scaling Groups (ASGs):
- twtech places its EC2 instances within an ASG configured to span multiple AZs.
- The
ASG ensures that a minimum
specified number (at least 1)
of its instances is running at all times.
- If an instance fails or becomes unhealthy (due to an AZ failure or software crash), the ASG automatically launches a replacement instance in an operational AZ.
Elastic Load Balancing (ELB):
- Use an ELB (Application Load Balancer or Network Load Balancer) in front of twtech ASG.
- The
load balancer automatically distributes
incoming traffic across the healthy instances in all designated AZs.
- Load balancer also performs health checks on the instances and routes traffic only to healthy ones.
Data Durability and Shared Storage:
- Avoid storing persistent, unique data on individual EC2 instance local storage (They are ephemeral or EBS root volumes if the instance can be terminated eventually).
- Use shared, highly durable services like Amazon S3, Amazon RDS (configured for Multi-AZ deployments), or Amazon EFS for your data storage needs.
Route 53 Failover:
- For highly available DNS, twtech recommends Amazon Route 53 to manage domain's health checks and failover routing policies.
- If twtech primary deployment becomes unavailable, Route 53 can automatically route traffic to a healthy, secondary deployment in a different region if necessary.
ThThe core idea behind creating EC2 with High Availablity (HA), is to combine services architecture that makes the instance resilient where the failure of a single instance, data center, or even an entire Availability Zone does not result in application downtime.
High-level architecture (textual
diagram)
- Public internet → Route 53 (DNS) → ALB (multi-AZ) → Auto Scaling Group of EC2s (spread across AZs)
- EC2s
in private subnets; NAT Gateway(s) for public subnet outbound traffic; ALB in public subnets.
- Stateful
components for Persistent data: RDS
(multi-AZ or Aurora), ElastiCache (clustered),
S3 (object
storage).
- Logs/metrics
for Monitoring & Obsevability: CloudWatch
Logs + Metrics (Unified agent), S3 (long-term), optionally
ELK/managed logging.
- Optional: Bastion + AWS Systems Manager for access.
Deep Dive Design Patterns & Decisions Making
1. Multi-AZ + Auto Scaling Group (ASG)
- ASG
is the core: specify desired/min/max capacity and distribute instances across
AZs.
- Use multiple private subnets in different
AZs; ASG launches instances in healthy AZs only.
- Combine with ALB health checks so unhealthy
instances are removed automatically.
2. Stateless vs Stateful
- Stateless app servers: store
sessions in ElastiCache (Redis/Memcached)
or a cookie/JWT for scale.
- Stateful needs
(local files): prefer
S3 + EFS (for
shared POSIX) or attach EBS with replication/backups; avoid relying on
instance local disk for critical data.
3. Placement & spread
- Use spread placement groups
if twtech wants max AZ-level isolation for critical instances; use with caution — they
restrict capacity.
- For low-latency network locality (e.g., HPC), cluster placement
helps
but reduces HA across AZs — usually not used for web apps.
4. Immutable infrastructure & deployment
- Prefer immutable AMI-based deployments:
bake
AMI with app +
dependencies (Packer), then roll ASG
with new launch configuration/template.
- Use blue/green
or canary deployments with ALB target group switching or weighted DNS via
Route 53.
5. Bootstrapping & config
- Use user-data / cloud-init
only
for environment-agnostic bootstrapping (install
agent, fetch config from S3 or SSMD Parameter Store). Keep launch
idempotent.
- Use SSM RunCommand / State Manager
or
configuration management (Ansible/Chef/Puppet)
for post-boot tasks rather than SSH.
6. Health checks & graceful shutdown
- ALB
health checks must hit an application endpoint that checks readiness (dependencies availability).
- Implement
shutdown hooks: trap SIGTERM, mark instance unhealthy (deregister from target
group) and wait for connections to drain before exit.
- Use ASG lifecycle hooks for pre-termination logic (drain work, flush caches).
7. Instance recovery & replacement
- Enable EC2 Auto Recovery
for
transient hardware/network issues or ASG replacement policies.
- Use
instance status checks and CloudWatch
alarms to trigger replacement.
8. Network design
- Use
Public subnets for ALB/NAT & Private subnets for instances.
- Least-privilege
security groups: ALB → EC2 on app port; EC2 →
DB on DB port only from app SG. (allow traffic to only the required port)
- Use
VPC endpoints for S3/SSM to avoid NAT egress and
improve security.
9. Storage & backups
- Use EBS gp3/io2
with
provisioned IOPS for critical disks; enable EBS encryption
and
regular snapshots.
- Use EFS
for shared filesystem when
necessary (multi-AZ backed).
- Backups:
snapshot
automation (Data Lifecycle Manager)
or custom Lambda snapshots.
10. Observability & alerting
- Logs:
push
application logs to stdout/stderr to CloudWatch Logs (or File → CloudWatch agent). Include
structured JSON logs.
- Metrics:
custom
app metrics (CloudWatch custom metrics
or Prometheus + remote-write).
- Tracing:
instrument
with X-Ray or OpenTelemetry.
- Alerts:
PagerDuty/SMS/Slack
on critical alarms (error rate, latency,
CPU, disk, ASG health).
11. Security
- IAM
roles attached to instance profiles with least privilege
(S3 read-only, SSM access, CloudWatch put). No long-lived keys.
- Hardened
AMIs: baseline with
latest patches, CIS hardening where needed. Use SSM Patch Manager.
- OS-level
protections: disable
unused services, enable firewall rules (iptables),
Instance Metadata Service v2 (IMDSv2)
only.
- Network
ACLs (NACL) for defense-in-depth at the VPC subnet level.
12. Cost & capacity planning
- Right-size
instances; use Savings Plans/Reserved Instances for baseline loads.
- Use
mixed instance policies in ASG (On-Demand + Spot with fallback) via Instance
Pools for cost and availability.
Implementation checklist (step-by-step)
1.
VPC & Subnets
- Create VPC with at least 3 AZs; private subnets for app, public
subnets for ALB/NAT in each AZ.
- Configure route tables and NAT gateways (or NAT fleet).
2.
Security Groups & IAM
- Create ALB SG (ingress
0.0.0.0/0:80/443), App SG (ingress
from ALB SG only), DB SG (ingress
from App SG on DB port).
- Create instance profile IAM role with minimal permissions (SSM, CloudWatch PutMetric, S3 read for
config, KMS decrypt if needed).
3.
AMIs & Bootstrapping
- Build an AMI containing runtime and agents.
- Alternatively use
user-data to install quickly but keep idempotent.
- Sample cloud-init (user-data)
snippet:
# install.sh#!/bin/bashset -e# Example: install SSM agent, docker, get configyum update -yamazon-linux-extras install -y java-openjdk11# Install SSM (if not baked)# Start app (example)aws s3 cp s3://twteh-s3bucket/app-config.json /etc/myapp/config.json --region us-east-2# start app systemd servicesystemctl enable --now myapp4.
Auto Scaling Group & Launch
Template
- Create launch template with AMI, instance type, IAM role,
user-data, block device mappings.
- Configure ASG across AZs, define min/desired/max. Use health check
type:
ELB. - Enable
termination policies, and
Graceful shutdownwith lifecycle hooks.
5.
Load Balancer
- Create ALB in public subnets with target groups referencing ASG
instances. Set health check path
/healthz. - Enable slow start if needed. Configure sticky sessions only if necessary.
6.
DNS & Failover
- Route 53 record pointing to ALB. For DR, use weighted / failover
records between regions.
7.
Monitoring & Alerts
- CloudWatch agent to collect system metrics. Log group per environment.
- Alarms for 5xx errors, high latency, high CPU, low healthy host
count.
8.
Deployment pipeline
- CI/CD builds AMI (Packer) or creates new ASG launch template
version. Use CodeDeploy / Terraform / CloudFormation pipeline to perform
rolling/blue-green. Implement rollback.
9.
Testing & chaos
- Test AZ failure: simulate by disabling AZ in ASG or bringing down
instances.
- Test AMI/launch template churn with Canary deployment.
- Run load tests; validate autoscaling triggers and cool-down behavior.
Autoscaling & policies — practical tips
- Use target tracking
(e.g., keep average CPU at 40%) for basic needs.
- For web apps, prefer scaling on request/latency
metrics (ALB RequestCountPerTarget
/ TargetResponseTime) or custom queue length for worker processes.
- Use step scaling for sudden
load changes (scale more aggressively on
high thresholds).
- Protect against scale-in cascading: add cooldowns and minimum
healthy host counts.
Lifecycle hooks & graceful termination (example)
- Configure
ASG lifecycle hook
Terminating:Waitto call a Lambda or SQS. - Flow:
ASG -> Lifecycle hook -> Lambda triggers SSM Run Command on instance to
systemctl stop myappwhich drains, thenCompleteLifecycleAction. - This ensures in-flight requests finish successfully.
Failover & disaster recovery
- Within-region HA:
Multi-AZ + ASG + ALB — this covers most availability needs (AZ outage tolerance).
- Cross-region DR:
replicate
AMIs, replicate state (S3 cross-region
replication, DB read replicas promoted to master), and Route 53 failover records.
Keep RTO/RPO targets in SLOs and practice failover drills.
Logging, tracing & debugging tips
- Centralize logs; include request IDs and correlation IDs in
headers for tracing.
- Provide a
/healthz(liveness) and/ready(readiness) endpoints: ALB should use readiness so an instance is only marked healthy once app and dependencies are ready. - Use CloudWatch Logs Insights or a dedicated tracing UI for latency
spikes.
Security operational items
- Rotate AMIs regularly; apply patches via image pipeline.
- Run vulnerability scans on AMI and container images.
- Enforce encryption at-rest (EBS,
S3) and in-transit (TLS for ALB).
- Enforce strict IAM roles and use SSM Session Manager for shell
access (no SSH keys).
Common pitfalls & how to avoid them
- Boot-time slowdowns:
heavy
user-data installs cause long boot times and unhealthy health checks. Bake as
much as possible into AMI.
- State on instance:
storing
sessions on instance disk — use external session store.
- Autoscaling flapping:
aggressive
scaling policies + low cooldowns cause instability — use sensible cooldowns and
target-tracking.
- Insufficient health checks:
using
only EC2 status checks instead of app-level health checks can keep unhealthy
app instances in service.
- AZ capacity skew:
some
instance types may not be available in all AZs — use mixed instances and
multiple instance types in ASG.
Sample Terraform plan
- Module
vpc(3 AZs, subnets) - Module
alb(ALB + target group) - Module
launch_template(AMI, userdata, iam) - Module
asg(launch_template, min/max/desired, lifecycle hooks, autoscaling policies) - Module
monitoring(cloudwatch log groups, metrics, alarms) - Module
iam(instance profile and policies)
Testing checklist (must-run)
- Simulate
AZ failure (terminate
all instances in one AZ) → verify traffic shifts and capacity remains.
- Deploy
new AMI via canary → validate metrics, then
promote.
- Simulate
instance crash and ensure ASG replaces it and ALB stops sending traffic.
- Load
test to validate scaling thresholds and response times.
- Security
scan (AMI &
network)
and patch test.
Service level agreement (SLA) / service
level objective (SLO) suggestions
- Availability:
aim
for 99.95% at app-tier with multi-AZ + ASG + ALB.
- Cross-region needed for 99.99%+ depending on failover automation.
- Recovery
time: ASG instance replacement typically 2–5 minutes depending on boot
time; reduce by optimizing AMI and health checks.
Short actionable checklist
- Use an AMI with app + SSM agent.
- Create ALB + target group with
/readyreadiness health check. - Create launch template with the AMI and IAM role.
- Create ASG across 3 AZs (min 2, desired 2+) with
ELB health checks, lifecycle hook for termination.
- Configure CloudWatch alarms for low healthy host count and high
5xx rates.
- Implement SSM session manager for access and enable IMDSv2.
- Run a controlled failover/termination test and review logs.
twtech insights on HA EC2 instances in AWS:
- Concrete cloud-init / user-data scripts (for Java, Node, Python) that twtech can configure into a Launch Template,
- A sample Terraform module (launch template + ASG, lifecycle hook, mixed instances policy + attach to ALB TG),
- A step-by-step failure scenario runbook (simulate AZ loss + RDS failover) with exact AWS CLI commands and verification steps.
1) User-data scripts (Amazon Linux 2 style, idempotent,
uses IMDSv2)
NB:
- Uses IMDSv2 to get instance id / region.
- Installs/ensures SSM agent for remote access.
- Expects artifacts in S3 (or
use baked AMI instead).
- Creates systemd service with graceful shutdown that deregisters
from ALB target group and waits for in-flight requests to finish.
- Uses environment variables stored in SSM Parameter Store or pulled
from S3 as shown.
# for java-base application
# Java-Spring-Boot-user-data-cloud-init.sh
#!/bin/bashset -euxo pipefail# --- config ---S3_BUCKET="twch-s3bucket"S3_KEY="artifacts/twtechapp.jar"TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/72af9c1c6xxxxx"LOG_GROUP="/aws/ec2/twtechwebapplg"APP_PORT=8080JAVA_OPTS="-Xms256m -Xmx512m"# get region & instance metadata using IMDSv2TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/dynamic/instance-identity/document \| jq -r .region)INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http:// 169.xxx.xxx.254/latest/meta-data/instance-id)yum update -y # Java runtimeamazon-linux-extras enable corretto8yum install -y java-1.8.0-amazon-corretto-headless jq awscli# ensure SSM agent (for Amazon Linux 2 usually preinstalled)if ! systemctl is-active amazon-ssm-agent >/dev/null 2>&1; then yum install -y https://s3.${REGION}.amazonaws.com/amazon-ssm-${REGION}/latest/linux_amd64/amazon-ssm-agent.rpm \|| true systemctl enable --now amazon-ssm-agent || truefi# create app dirmkdir -p /opt/twtechappaws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" /opt/twtechapp/twtechapp.jar --region "$REGION" # create deregister script (used by systemd on stop)cat >/opt/myapp/deregister_tg.sh <<'DEREG'#!/bin/bash set -eTOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \http://169.xxx.xxx.254/latest/dynamic/instance-identity/document | jq -r .region)aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" \--targets Id="$INSTANCE_ID" --region "$REGION"# Wait until target state is drained or not foundfor i in {1..30}; do sleep 2 STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" \--targets Id="$INSTANCE_ID" --region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' \--output text 2>/dev/null || echo "notfound") if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then echo "deregistered ($STATE)" exit 0 fidoneecho "timed out waiting for deregistration"exit 0DEREGchmod +x /opt/twtechapp/deregister_tg.sh# create systemd servicecat >/etc/systemd/system/twtechapp.service <<'SERVICE'[Unit]Description=twtech Java AppAfter=network.target[Service]Type=simpleUser=rootWorkingDirectory=/opt/twtechappExecStart=/usr/bin/java ${JAVA_OPTS} -jar /opt/twtechapp/twtechapp.jar \--server.port=${APP_PORT}ExecStop=/opt/twtechapp/deregister_tg.shTimeoutStopSec=120Restart=on-failureRestartSec=5[Install]WantedBy=multi-user.targetSERVICEsystemctl daemon-reloadsystemctl enable --now twtechapp.service# log to CloudWatch - optional: install CloudWatch agent NB:
Replace JAVA_OPTS,
S3_BUCKET, S3_KEY, TARGET_GROUP_ARN, APP_PORT as needed.
# For Nodejs applications
# Node-js-Express-user-data.sh
#!/bin/bashset -euxo pipefailS3_BUCKET="twtech-s3bucket"S3_KEY="artifacts/twtech-node-app.tar.gz"TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechnodejs-tg/72af9c1c6xxxxx"APP_DIR="/opt/twtechnodeapp"APP_PORT=3000NODE_VERSION="18"TOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \| jq -r .region)INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)yum update -y# install node via nvmless method (nodesource)curl -sL https://rpm.nodesource.com/setup_${NODE_VERSION}.x | bash -yum install -y nodejs jq awsclimkdir -p ${ twtechnodeapp_dir}aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechnodeapp_dir}# install depscd ${ twtechnodeapp_dir}npm ci --production# deregister scriptcat >/opt/nodeapp/deregister_tg.sh <<'DEREG'#!/bin/bashset -eTOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \| jq -r .region)aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"# wait loop similar to Java scriptfor i in {1..30}; do sleep 2 STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound") if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then exit 0 fidoneexit 0DEREGchmod +x /opt/nodeapp/deregister_tg.sh# systemd servicecat >/etc/systemd/system/twtechnodeapp.service <<'SERVICE'[Unit]Description=twtech Node AppAfter=network.target[Service]ExecStart=/usr/bin/node /opt/twtechnodeapp/index.jsWorkingDirectory=/opt/twtehnodeappRestart=on-failureUser=rootExecStop=/opt/nodeapp/deregister_tg.shTimeoutStopSec=120[Install]WantedBy=multi-user.targetSERVICEsystemctl daemon-reloadsystemctl enable --now twtechnodeapp.service# Python application
# Python-Gunicorn+Flask-user-data.sh
#!/bin/bashset -euxo pipefailS3_BUCKET="twtech-s3bucket"S3_KEY="artifacts/twtechpython-app.tar.gz"TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechpython-tg/72af9c1c6xxxxx"APP_DIR="/opt/twtechpyapp"APP_PORT=8000VENV_DIR="/opt/twtechpyapp/venv"TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \| jq -r .region)INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)yum update -yyum install -y python3 python3-venv python3-pip jq awsclimkdir -p ${twtechapp_dir}aws s3 cp "s3://${S3_BUCKET}/${S3_KEY}" - | tar -xz -C ${ twtechapp_dir}python3 -m venv ${VENV_DIR}source ${VENV_DIR}/bin/activatepip install --upgrade pippip install -r ${ twtechapp_dir}/requirements.txt# deregister scriptcat >/opt/twtechpyapp/deregister_tg.sh <<'DEREG'#!/bin/bashset -eTOKEN=$(curl -s -X PUT "http://169.xxx.xxx.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/meta-data/instance-id)REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.xxx.xxx.254/latest/dynamic/instance-identity/document \| jq -r .region)aws elbv2 deregister-targets --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" --region "$REGION"for i in {1..30}; do sleep 2 STATE=$(aws elbv2 describe-target-health --target-group-arn "$TARGET_GROUP_ARN" --targets Id="$INSTANCE_ID" \--region "$REGION" --query 'TargetHealthDescriptions[0].TargetHealth.State' --output text 2>/dev/null || echo "notfound") if [[ "$STATE" == "draining" || "$STATE" == "unused" || "$STATE" == "notfound" ]]; then exit 0 fidoneexit 0DEREGchmod +x /opt/twtechpyapp/deregister_tg.sh# systemd unit for gunicorncat >/etc/systemd/system/twtechpyapp.service <<'SERVICE'[Unit]Description=twtech Python Gunicorn AppAfter=network.target[Service]User=rootWorkingDirectory=/opt/twtechpyappExecStart=/opt/twtechpyapp/venv/bin/gunicorn -w 4 -b 0.0.0.0:8000 app:twtechpyappExecStop=/opt/twtechpyapp/deregister_tg.shRestart=on-failureTimeoutStopSec=120[Install]WantedBy=multi-user.targetSERVICEsystemctl daemon-reloadsystemctl enable --now twtechpyapp.service2) Sample Terraform module — modules/asg (launch
template + ASG +
lifecycle hook)
NB:
This is a minimal but usable module. It assumes
twtech already created ALB target group and IAM instance profile.
File: modules/asg/twtechmain-variables.tf
variable "name" { type = string }variable "ami_id" { type = string }variable "instance_types" { type = list(string) default = ["t3.micro","t3a.micro"] }variable "instance_profile" { type = string } # IAM instance profile namevariable "key_name" { type = string default = "" }variable "subnet_ids" { type = list(string) }variable "target_group_arns" { type = list(string) }variable "vpc_security_group_ids" { type = list(string) }variable "user_data" { type = string default = "" }variable "min_size" { type = number default = 2 }variable "desired_capacity" { type = number default = 2 }variable "max_size" { type = number default = 4 }variable "region" { type = string default = "us-east-2" }resource "aws_launch_template" "this" { name_prefix = "${var.name}-lt-" image_id = var.ami_id instance_type = var.instance_types[0] # primary, mixed policy uses list below iam_instance_profile { name = var.instance_profile } key_name = var.key_name vpc_security_group_ids = var.vpc_security_group_ids user_data = base64encode(var.user_data) lifecycle { create_before_destroy = true }}resource "aws_autoscaling_group" "this" { name = "${var.name}-asg" max_size = var.max_size min_size = var.min_size desired_capacity = var.desired_capacity vpc_zone_identifier = var.subnet_ids health_check_type = "ELB" health_check_grace_period = 120 mixed_instances_policy { launch_template { launch_template_specification { launch_template_id = aws_launch_template.this.id version = "$$Latest" } override { instance_type = var.instance_types[0] } dynamic "override" { for_each = slice(var.instance_types, 1, length(var.instance_types)) content { instance_type = override.value } } } instances_distribution { on_demand_allocation_strategy = "prioritized" spot_allocation_strategy = "capacity-optimized" on_demand_base_capacity = 0 on_demand_percentage_above_base_capacity = 20 } } target_group_arns = var.target_group_arns tag { key = "Name" value = "${var.name}" propagate_at_launch = true } lifecycle { create_before_destroy = true }}resource "aws_autoscaling_lifecycle_hook" "drain" { name = "${var.name}-terminate-drain" autoscaling_group_name = aws_autoscaling_group.this.name default_result = "CONTINUE" heartbeat_timeout = 300 lifecycle_transition = "autoscaling:EC2_INSTANCE_TERMINATING" notification_target_arn = "" # optional SNS/SQS for lambda processing role_arn = "" # optional role that can call complete-lifecycle-action}# File: modules/asg/twtechoutputs.tf
output "asg_name" { value = aws_autoscaling_group.this.name}output "launch_template_id" { value = aws_launch_template.this.id}Usage example in root module
module "app_asg" { source = "./modules/asg" name = "twtechapp" ami_id = "ami-0abcdef123xxxx890" instance_types = ["t3.medium", "t3a.medium"] instance_profile = "twtech-instance-profile" subnet_ids = [aws_subnet.app1.id, aws_subnet.app2.id, aws_subnet.app3.id] target_group_arns = [aws_lb_target_group.app.arn] vpc_security_group_ids = [aws_security_group.app.id] user_data = file("user-data.sh") min_size = 2 desired_capacity = 2 max_size = 6}NB:
- Each module uses a mixed instances policy (Spot + On-Demand). Remove if
unwanted.
- twtech Sets
notification_target_arnandrole_arnon the lifecycle hook if it wants to trigger a Lambda/SQS to run drain commands, then complete lifecycle. - twtech needs to create an IAM instance profile allowing SSM,
CloudWatch PutMetric, and S3 read if using S3 artifacts.
3) Failure scenario runbook — simulate AZ loss and do DB
failover
Context assumptions:
- ASG
twtechapp-asgis in regionus-east-2. - ALB target group ARN =
TARGET_GROUP_ARN. - RDS identifier =
twtechdb(Multi-AZ) or Aurora cluster =twtech-aurora-cluster. - AWS CLI configured with appropriate profile/role that has rights
to ASG, EC2, ELBv2, RDS.
A) Simulate AZ loss (safe test in non-production)
Goal: Verify multi-AZ ASG + ALB reaction. We'll kill all instances in one AZ
and observe auto-scaling and ALB drain.
Useful management commands
# List ASG and instances
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \--query 'AutoScalingGroups[0].[AutoScalingGroupName,Instances]' --output json# Find instances in AZ us-east-2a (Sample)
# Also list all instances in the ASG and their AZaws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \ --query 'AutoScalingGroups[0].Instances[*].{Id:InstanceId,AZ:AvailabilityZone}' --output table3.
Terminate instances in a
specific AZ
NB:
(Only
do this for test environment or
under change-control.)
# get instance ids in AZINSTANCE_IDS=$(aws ec2 describe-instances --region us-east-2 \ --filters "Name=tag:aws:autoscaling:groupName,Values=twtechapp-asg" "Name=availability-zone,Values=us-east-2a" \ --query 'Reservations[].Instances[].InstanceId' --output text)# terminate them (ASG will detect and replace them immediately)aws ec2 terminate-instances --instance-ids $INSTANCE_IDS --region us-east-2# Observe ALB target draining, Check target health and
draining status:
aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 \--query 'TargetHealthDescriptions[?Target.Id==`i-...`]'# to watch all targetswatch -n 2 'aws elbv2 describe-target-health --target-group-arn TARGET_GROUP_ARN --region us-east-2 --output table'NB:
- Expect
terminated instances to go to
drainingthenunusedor be removed. - Observe ASG replacement behavior
# watch desired vs actualwatch -n 5 'aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \--query "AutoScalingGroups[0].[DesiredCapacity,InServiceInstances=`length(Instances[?LifecycleState==\`InService\`])`]" \--output table'- ASG should launch new instances in other AZs to keep desired
capacity. If AZ capacity limits block, ASG may launch into other AZs or fail;
use mixed instance policy for flexibility.
#
Verify application traffic continuity
- Run a curl load against ALB DNS name and confirm responses:
ALB_DNS=$(aws elbv2 describe-load-balancers --names my-alb --region us-east-2 \--query 'LoadBalancers[0].DNSName' --output text)curl -sS "http://$ALB_DNS/healthz"# Post-test cleanup / roll-back
No special rollback—ASG will stabilize. If twtech manually changed ASG desired capacity, set it back:
aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --desired-capacity 2 --region us-east-2#
Checks & troubleshooting
- If replacements aren't launching: check ASG events:
aws autoscaling describe-scaling-activities --auto-scaling-group-name twtechapp-asg --region us-east-2 --output table- If no new instances due to subnets/AZ disabled: ensure subnet_ids
include multiple AZs and instance type availability.
B) DB failover — RDS Multi-AZ and Aurora
Important: Failover interrupts connections.
Perform in maintenance window for production. Below are commands for both RDS (Multi-AZ) and
Aurora.
B1 — RDS (Single-instance Multi-AZ
MySQL/Postgres)
NB:
Force
failover via reboot with force-failover (causes primary to failover to
standby):
aws rds reboot-db-instance --db-instance-identifier twtechdb --force-failover --region us-east-2# Verify:
# describe instance, watch RecentRestarts/AvailabilityZone changeaws rds describe-db-instances --db-instance-identifier twtechdb \--region us-east-2 \--query 'DBInstances[0].[DBInstanceStatus,MultiAZ,Endpoint.Address,PreferredMaintenanceWindow,AvailabilityZone]' \--output json# What happens next:
- AWS
promotes standby to primary; connection
endpoint stays same (for Multi-AZ
RDS),
but TCP connections break and reconnect.
- Application
should have retry/backoff and connection pooling configured to re-resolve DNS (not cache endpoint IP).
B2 — Aurora (MySQL/Postgres compatible) — failover to reader/other writer
For Aurora (clustered):
# find reader endpoint and writeraws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster --region us-east-2 \--query 'DBClusters[0].{WriterEndpoint:Endpoint,Readers:DBClusterMembers}' \--output json# failover to specific instance (instance identifier)aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster \--target-db-instance-identifier twtech-aurora-instance-2 --region us-east-2Or
# To let AWS choose a failover
target:
aws rds failover-db-cluster --db-cluster-identifier twtech-aurora-cluster --region us-east-2#
Always Verify:
aws rds describe-db-clusters --db-cluster-identifier twtech-aurora-cluster \--region us-east-2 --query 'DBClusters[0].Status,DBClusters[0].Endpoint'C) Application steps to be resilient to DB failover
1.
Use connection retry with
exponential backoff in app DB
client. Example simple policy:
o
On connection failure, retry 5 times with 200ms -> 400ms -> 800ms -> 1600ms -> 3200ms.
2.
Avoid long-lived DB connections
held
across DNS changes — configure pool to validate connections (test-on-borrow) and to recreate.
3.
Use RDS endpoint (same DNS) for Multi-AZ
—
clients should resolve DNS on each reconnect (don’t cache IP).
4.
For Aurora
use
cluster writer endpoint for writes, and reader endpoints for read scaling.
D) Verification checklist after failover / AZ loss
# Verify ALB
healthy host count >= min healthy targets.
aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HealthyHostCount \--dimensions Name=TargetGroup,Value=TARGET_GROUP_ARN --start-time $(date -u -d '5 minutes ago' +%FT%TZ) \--end-time $(date -u +%FT%TZ) --period 60 --statistics Average --region us-east-2# ASG has
desired # of InService
instances:
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names twtechapp-asg --region us-east-2 \--query "AutoScalingGroups[0].Instances[?LifecycleState=='InService'].[InstanceId,AvailabilityZone]" --output table- RDS instance status
availableandReadReplicaStatus(if Aurora) shows writer properly assigned.
E) Rollback & remediation
- If autoscaling fails to replace instances because of AZ capacity
or AMI issue:
- Increase ASG max size temporarily:
aws autoscaling update-auto-scaling-group --auto-scaling-group-name twtechapp-asg --max-size 6 --region us-east-2- Confirm subnet IPs and ENI limits; choose a different instance
type if capacity-limited.
- If DB promotion fails or cluster unhealthy: open AWS Console RDS,
check events, CloudWatch metrics, and may contact AWS Support for production
incidents.
Final notes & recommendations
- Put these scripts and Terraform under source control.
- Keep secrets
out of user-data — use SSM Parameter Store / Secrets Manager to supply DB
passwords and sensitive configs at runtime.
- Create AMIs with most dependencies to reduce boot time and
health-check flapping.
- Run the AZ-loss test and DB failover test in a staging environment
first and automate checks (CI job) so twtech runbook can remain evergreen.
- For production DR practice, document RTO/RPO and practice
cross-region promotion for critical workloads.
A complete,
ready-to-copy Sample repository with:
✔ modules/ec2-ha/ Terraform module (Launch
Template + ASG + ALB
Target Group attachment)
✔ provider.tf, variables.tf,
main.tf, and an example root module
✔ A fully populated user-data file (NO
placeholders — filled with realistic sample ARNs, instance profile name, S3
bucket name)
NB:
These values are synthetic but validly formatted ARNs so the
configuration is paste-ready. Replace them with actual values before deploying.
1. Repository
Structure
repo/├── provider.tf├── variables.tf├── main.tf├── user-data.sh└── modules/ └── ec2-ha/ ├── main.tf ├── variables.tf └── outputs.tf 2. user-data.sh (ready to paste, with realistic
values)
NB:
This following script supports:
- Java,
- Node.js,
- Python applications from S3.
# ALB Target Group ARN:
arn:aws:elasticloadbalancing:us-east-2:accountId:targetgroup/twtechprod-app-tg/72af9c1c6xxxxx
# S3 bucket:
s3://prod-app-artifacts-bucket-987654
# Instance profile:
EC2ProdAppInstanceProfile
# respository.sh#!/bin/bashset -exAPP_STACK="node" # options: java | node | pythonAPP_DIR="/opt/app"yum update -yyum install -y awscli jq curl unzip# Create app directorymkdir -p $twtechapp-dircd $twtechapp-dir# Fetch latest artifact metadata (sample: app-latest.json)aws s3 cp s3://prod-app-artifacts-bucket-987654/app-latest.json .ARTIFACT=$(jq -r '.artifact' app-latest.json)STACK=$(jq -r '.stack' app-latest.json)echo "Artifact: $ARTIFACT"echo "Stack: $STACK"aws s3 cp "s3://prod-app-artifacts-bucket-987654/$ARTIFACT" app.zipunzip -o app.zip ### Install runtimes depending on stack type ###case $STACK in java) yum install -y java-17-amazon-corretto nohup java -jar app.jar --server.port=8080 & ;; node) curl -fsSL https://rpm.nodesource.com/setup_20.x | bash - yum install -y nodejs npm install --production nohup node server.js & ;; python) yum install -y python3 python3-pip pip3 install -r requirements.txt nohup python3 app.py & ;;esac# Register with ALB Target Group (self-service health sync)INSTANCE_ID=$(curl -s http://169.xxx.xxx.254/latest/meta-data/instance-id)aws elbv2 register-targets \ --target-group-arn arn:aws:elasticloadbalancing:us-east-2:accountID:targetgroup/twtechprod-app-tg/72af9c1c673xxxxx \ --targets Id=$INSTANCE_IDecho "User-data complete." # Root
Module
# provider.tf
terraform { required_version = ">= 1.5" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } }}provider "aws" { region = var.region}# variables.tf
variable "region" { type = string default = "us-east-2"}variable "vpc_id" {}variable "private_subnets" { type = list(string)}variable "alb_target_group_arn" { type = string default = "arn:aws:elasticloadbalancing:us-east-2:accuntId:targetgroup/twtechprod-app-tg/72af9c1c6732xxxx"}variable "artifact_bucket" { type = string default = "prod-app-artifacts-bucket-987654"}variable "instance_profile" { type = string default = "EC2ProdAppInstanceProfile"}# main.tf
module "ec2_ha" { source = "./modules/ec2-ha" vpc_id = var.vpc_id private_subnets = var.private_subnets alb_target_group_arn = var.alb_target_group_arn instance_profile = var.instance_profile user_data_path = "${path.module}/user-data.sh"} # Terraform Module (modules/ec2-ha/)
# variables.tf
variable "vpc_id" {}variable "private_subnets" { type = list(string) }variable "alb_target_group_arn" {}variable "instance_profile" {}variable "user_data_path" {}# main.tf
resource "aws_launch_template" "twtech-instance" { name_prefix = "twtechprod-app" iam_instance_profile { name = var.instance_profile } image_id = data.aws_ami.amazon_linux.id instance_type = "t3.medium" user_data = filebase64(var.user_data_path) tag_specifications { resource_type = "instance" tags = { Name = "twtechprod-app" } }}data "aws_ami" "amazon_linux" { owners = ["amazon"] most_recent = true filter { name = "name" values = ["ami.xxxxxxxxxxxxx"] }}resource "aws_autoscaling_group" "twtech-asg" { name = "twtechprod-app-asg" desired_capacity = 4 min_size = 2 max_size = 8 vpc_zone_identifier = var.private_subnets health_check_type = "ELB" health_check_grace_period = 90 launch_template { id = aws_launch_template.id version = "$Latest" } target_group_arns = [var.alb_target_group_arn] tag { key = "Name" value = "twtechprod-app-asg" propagate_at_launch = true }}# outputs.tf
output "asg_name" { value = aws_autoscaling_group.asg.name}output "launch_template_id" { value = aws_launch_template.id}twtech at this point has:
✔ A complete Terraform repository
✔ A fully working
module
✔ A production-style
user-data script with non-placeholder ARNs
No comments:
Post a Comment