twtech Overview of EC2 Instance Recovery:
View:
- How EC2 Instance Recovery works,
- What
AWS handles for twtech,
- Best
practices for using it with monitoring and automation.
1. The Concept: EC2
Instance Recovery
EC2 Instance Recovery is an automated process that brings a failed
EC2 instance back into a healthy state without
replacing the instance.
It’s tied to CloudWatch Alarms on EC2 system status checks.
- System Status Checks
→ Monitor AWS infrastructure health (host
hardware, network, power, etc.).
- Instance Recovery
kicks in when these checks fail, but the instance itself is still healthy (OS-level is fine).
Important:
EC2 Instance Recovery is different from Auto Scaling replacement.
EC2 Instance Recovery restores the same instance (keeping instance ID, private/public IPs, Elastic IPs, attached volumes, and metadata intact).
2. How It Works
- CloudWatch monitors EC2 status checks (system & instance).
- If a system status
check fails, a CloudWatch Alarm
is triggered.
- The alarm action = Recover this instance.
- AWS automatically reboots the instance onto new healthy hardware behind the scenes.
- Preserves:
- Instance ID
- Elastic IPs & private IPs
- Data on attached EBS volumes
- Instance metadata (tags,
IAM role, placement group membership, etc.)
3. Supported Instances
- Instance Types:
Only some types support automatic recovery (most Nitro-based instances and
newer generations: M5, C5, T3, R5, etc.).
- Storage:
Must use EBS-backed volumes (instance store volumes are ephemeral
and won’t be preserved).
- Placement Groups:
Instances in cluster placement groups can be recovered, but AZ may change.
4. Key Features
- Automatic hardware migration → if host fails, AWS moves instance to healthy host.
- No user intervention required (after alarm
setup).
- Minimal downtime
compared to manual stop/start.
- Preserves identity
→ critical for stateful workloads.
5. When Recovery Triggers
- Loss of power to the underlying hardware.
- Physical host hardware failure.
- AWS networking issue affecting host.
- Software issues on AWS’s physical host layer.
Note:
If instance status check
fails (OS crash, kernel panic, app
crash), EC2 recovery will not fix it — twtech would need reboot, rebuild, or Auto
Scaling.
6. Architecture Flow
- EC2 Instance
→ monitored by CloudWatch Status Checks
- CloudWatch Alarm
(on “StatusCheckFailed_System”)
- Alarm → Recover Instance action
- AWS migrates instance → launches it on new healthy
hardware → reattaches volumes, network, and metadata
7. Advanced Use Cases
- Critical Single-Instance Workloads (databases, legacy apps, etc.).
- Production Environments where instance ID persistence is required (vs Auto
Scaling replacement).
- High-availability setups where recovery is a faster safety net than manual
intervention.
- Cost-conscious workloads — recovery avoids re-provisioning.
8. Best Practices
- Always enable detailed monitoring for faster
detection (1-minute granularity).
- Create CloudWatch Alarms for both:
- StatusCheckFailed_Instance (OS-level issues)
- StatusCheckFailed_System (hardware-level issues → recovery supported)
- Combine recovery with:
- Auto Scaling Groups for instance replacement if app fails.
- Route 53 health checks to reroute traffic during downtime.
- Use CloudWatch Composite Alarms to escalate when
multiple instances in a tier fail.
- Ensure EBS volumes are backed up (snapshots) in
case recovery does not succeed.
Final thoughts:
- EC2 Instance Recovery is AWS’s built-in safeguard against hardware
and system-level failures, giving twtech automatic resilience without
losing its instance’s identity.
- For app-level resilience, EC2 Instance Recovery combine Auto Scaling, health checks, and backups.
No comments:
Post a Comment