AWS Systems Manager (SSM) Maintenance Windows - Overview.
Scope:
- Intro,
- Key Components,
- Benefits and Features,
- Pricing,
- The concept of SSM Maintenance Windows (Deep Dive),
- Core Architecture Components,
- Integration with Patch Manager,
- IAM & Security Model,
- Logging, Auditing & Compliance,
- Advanced Enterprise Patterns,
- Common Failure Modes (and How to Avoid Them),
- When to Use and Not Use Maintenance Windows,
- Final thoughts.
Intro:
- AWS Systems Manager (SSM) Maintenance Windows allow twtech to schedule
recurring periods for performing potentially disruptive administrative tasks
across its AWS resources.
- This feature of (SSM) is commonly used to automate:
- Operating system patching,
- Driver updates,
- And software installations during low-traffic periods.
Key
Components
Schedule:
- Defines when and how often the window runs using Cron or Rate expressions.
Duration & Cutoff:
- Specifies the total length of the window (e.g., 4 hours) and a "cutoff" time.
- e.g., 1 hour before the end.
- To prevent new tasks from starting as the window closes.
Targets:
- These are specific resources that tasks will act upon.
- These can be selected manually, via tags, or by using AWS Resource Groups.
Tasks:
- These are automated actions performed during the window.
- Supported task types include:
Run Command
- Executing configuration scripts on managed instances.
Automation
- Running multi-step Systems Manager Automation workflows.
Lambda Functions
- Triggering serverless AWS Lambda functions.
Step Functions
- Initiating AWS Step Functions state machine tasks
Benefits
& Features
Centralized History
- Systems Manager maintains a 30-day history of all maintenance window executions.
- This allows twtech to track task status without logging into individual servers.
Error Control
- twtech can set velocity and error thresholds, such as stopping a task if it fails on more than a specific number of instances.
Time Zone Support
- Windows can be scheduled in specific local time zones rather than just UTC, ensuring maintenance aligns with local business hours.
Hybrid Management
- Tasks can be scheduled for both Amazon EC2 instances and on-premises servers or virtual machines managed by Systems Manager.
Pricing
- There is no
additional charge to use the Maintenance Windows feature
itself.
- However, twtech has to pay for the underlying AWS resources consumed during the maintenance, such as:
- EC2 instance hours
- Or Lambda invocations
The concept SSM Maintenance Windows (Deep Dive)
- Maintenance Windows are a Systems Manager orchestration feature that lets twtech to define when and how operational tasks run on managed resources.
- Maintenance Windows answer four key enterprise questions:
1. When can work happen?
2. What tasks should run?
3. On which resources?
4. In what order and with what controls?
NB:
- Maintenance Windows are critical for:
- Patching,
- Compliance,
- Availability protection,
- And change management.
Core Architecture Components
1. Maintenance Window (MW)
NB:
The container object that defines:
- Schedule
- Cron or rate expression
- Timezone support (critical for global orgs)
- Duration
- Total time the window stays open
- Cutoff
- How long before window end no new tasks can start
- Enabled / Disabled state
NB:
- Think of it as the change-approved time boundary
2. Targets
NB:
- Targets define what resources are eligible for tasks.
Supported
target types:
- EC2 instances
- On-prem servers (SSM Hybrid)
- Resource Groups
- Tags (most common in enterprises)
Sample Tag: PatchGroup = Prod-Linux
Best
practice:
- Never hardcode instance IDs
- Always use Patch Groups or Environment tags
3. Tasks
NB:
- Tasks define what action runs during the window.
Common
task types:
-
AWS-RunPatchBaseline -
AWS-RunPowerShellScript -
AWS-RunShellScript -
AWS-RunAnsiblePlaybook -
AWS-StartEC2Instance -
AWS-StopEC2Instance - Custom
SSM Documents
Each task
includes:
- SSM Document
- Task priority
- Max concurrency
- Max errors
- IAM service role
- Timeouts
- Invocation parameters
4. Task Priority & Ordering
NB:
- Tasks run in priority order (lower number = higher priority).
- Typical enterprise sequence:
1. Pre-maintenance validation
2. Stop application services
3. Apply patches
4. Reboot (if required)
5. Start services
6. Post-maintenance health checks
NB:
- This replaces fragile cron-based automation.
5. Concurrency & Error Controls
- These controls are enterprise-grade safeguards.
Max
Concurrency
- Percentage or fixed number
- Sample:
10%or5
Prevents:
- Patch storms
- Capacity collapse
- Regional brownouts
Max
Errors
- Absolute or percentage
- Stops execution when threshold is hit
Critical
for:
- Production blast-radius control
- Change failure containment
Integration with Patch Manager
- Maintenance Windows are how Patch Manager actually executes.
Flow:
1. Patch
baseline defines what is approved
2. Patch
group defines which instances
3. Maintenance
window defines when
4. Task
(AWS-RunPatchBaseline)
defines how
NB:
- twtech cannot do enterprise patching safely without Maintenance Windows.
IAM & Security Model
Required IAM Roles
1.
Maintenance
Window Service Role
- Allows SSM to:
- Run commands
- Access logs
- Interact with:
- EC2,
- S3,
- CloudWatch
2.
Instance
Profile Role
- SSM Agent permissions
- Access to patch repos
- S3 / KMS if encrypted artifacts are used
Security
best practices:
- Separate roles for Prod vs Non-Prod
- Use Least-privilege policies
- Use KMS encryption for logs and outputs
Logging, Auditing & Compliance
- Maintenance Windows integrate deeply with:
- CloudWatch Logs
- S3 command output
- SSM Compliance
- AWS Config
- CloudTrail
twtech
gets:
- Who executed what
- When it ran
- Which instances succeeded or failed
- Patch compliance evidence (SOX, PCI, HIPAA)
NB:
- This is often used as audit-proof change execution.
Advanced Enterprise Patterns
A. Environment-Based Windows
|
|
|
|
|
|
|
|
|
|
NB:
- Same documents,
- same baselines
- But, different windows.
B. Follow-the-Sun Patching
- Region-specific timezones
- Staggered maintenance windows
- Global fleet coverage with zero overlap
C. Blue/Green or Tiered Patching
- App tier A patched first
- Validation tasks
- App tier B patched second
NB:
- Reduces availability risk without load balancer changes.
4. Change-Managed Automation
- Maintenance Window ID referenced in:
- Change tickets
- Incident runbooks
- Compliance reports
NB:
- Some orgs require MW ID before approving patching.
Common Failure Modes (and How to Avoid Them)
|
Issue |
Root Cause |
Fix |
|
Tasks don’t run |
No targets resolved |
Verify tags / resource groups |
|
Instances skipped |
Wrong Patch Group |
Standardize tagging |
|
Reboots missed |
Missing reboot option |
Use |
|
Timeout failures |
Window too short |
Increase duration |
|
Permissions denied |
Wrong IAM role |
Validate service role |
When to Use and Not Use Maintenance
Windows
- when to Use Maintenance Windows
- Running planned, repeatable ops
- Enforcing change windows
- Managing large fleets
- Meeting compliance requirements
- when NOT to Use Maintenance Windows
- Immediate break/fix response (use Run Command directly)
- Event-driven automation (use EventBridge + Automation)
Final thoughts
- Maintenance Windows are:
- The execution engine of enterprise operations
- The guardrail for patching and compliance
- A replacement for brittle cron + SSH workflows
NB:
- In mature AWS Prod environments, nothing touches resources outside a Maintenance Window.
No comments:
Post a Comment