An Overview of AWS Systems Manager (SSM) feature of Maintenance
Windows.
Focus:
- Tailored for enterprise / cloud-engineering, (aligned with operations, regulations, & large-scale environments).
Breakdown:
- Intro,
- Key Components,
- Benefits and Features,
- Pricing,
- The concept: Maintenance Windows,
- Core Architecture Components,
- Integration with Patch Manager,
- IAM & Security Model,
- Logging, Auditing & Compliance,
- Advanced Enterprise Patterns,
- Common Failure Modes (and How to Avoid Them),
- When to Use and Not Use Maintenance Windows,
- Final thoughts.
Intro:
- AWS Systems Manager (SSM)– a feature of Maintenance Windows allow twtech to schedule
recurring periods for performing potentially disruptive administrative tasks
across its AWS resources.
- This feature of (SSM) is commonly used to automate operating system patching, driver updates, and software installations during low-traffic periods.
Key
Components
Schedule:
- Defines when and how often the window runs using Cron or Rate expressions.
Duration and Cutoff:
- Specifies the total length of the window (e.g., 4 hours) and a "cutoff" time (e.g., 1 hour before the end) to prevent new tasks from starting as the window closes.
Targets:
- The specific resources the tasks will act upon. These can be selected manually, via tags, or by using AWS Resource Groups.
Tasks:
- The automated actions performed during the window. Supported task types include:
Run Command:
- Executing configuration scripts on managed instances.
Automation:
- Running multi-step Systems Manager Automation workflows.
Lambda Functions:
- Triggering serverless AWS Lambda functions.
Step Functions:
- Initiating AWS Step Functions state machine tasks
Benefits
and Features
Centralized History:
- Systems Manager maintains a 30-day history of all maintenance window executions, allowing twtech to track task status without logging into individual servers.
Error Control:
- twtech can set velocity and error thresholds, such as stopping a task if it fails on more than a specific number of instances.
Time Zone Support:
- Windows can be scheduled in specific local time zones rather than just UTC, ensuring maintenance aligns with local business hours.
Hybrid Management:
- Tasks can be scheduled for both Amazon EC2 instances and on-premises servers or virtual machines managed by Systems Manager.
Pricing
- There is no
additional charge to use the Maintenance Windows feature
itself.
- However, twtech has to pay for the underlying AWS resources consumed during the maintenance, such as EC2 instance hours or Lambda invocations
The concept:
Maintenance Windows (SSM)
- Maintenance Windows are a Systems Manager orchestration feature that lets you define when and how operational tasks run on managed resources.
They
answer four key enterprise questions:
1. When
can work happen?
2. What
tasks should run?
3. On
which resources?
4. In
what order and with what controls?
NB:
- They are critical for patching, compliance, availability protection, and change management.
Core Architecture Components
1. Maintenance Window (MW)
NB:
The container object that defines:
- Schedule
- Cron or rate expression
- Timezone support (critical for global orgs)
- Duration
- Total time the window stays open
- Cutoff
- How long before window end no new tasks can start
- Enabled / Disabled state
NB:
Think of it as the change-approved time boundary
2. Targets
NB:
- Targets define what resources are eligible for tasks.
Supported
target types:
- EC2 instances
- On-prem servers (SSM Hybrid)
- Resource Groups
- Tags (most common in enterprises)
Example:
Tag: PatchGroup = Prod-LinuxBest
practice:
- Never hardcode instance IDs
- Always use Patch Groups or Environment tags
3. Tasks
NB:
Tasks define what action runs during
the window.
Common
task types:
-
AWS-RunPatchBaseline -
AWS-RunPowerShellScript -
AWS-RunShellScript -
AWS-RunAnsiblePlaybook -
AWS-StartEC2Instance -
AWS-StopEC2Instance - Custom
SSM Documents
Each task
includes:
- SSM Document
- Task priority
- Max concurrency
- Max errors
- IAM service role
- Timeouts
- Invocation parameters
4. Task Priority & Ordering
NB:
Tasks run
in priority order (lower number = higher
priority).
Typical
enterprise sequence:
1. Pre-maintenance
validation
2. Stop
application services
3. Apply
patches
4.
Reboot (if
required)
5. Start
services
6. Post-maintenance
health checks
NB:
- This replaces fragile cron-based automation.
5. Concurrency & Error Controls
- These controls are enterprise-grade safeguards.
Max
Concurrency
- Percentage or fixed number
- Example:
10%or5
Prevents:
- Patch storms
- Capacity collapse
- Regional brownouts
Max
Errors
- Absolute or percentage
- Stops execution when threshold is hit
Critical
for:
- Production blast-radius control
- Change failure containment
Integration with Patch Manager
- Maintenance Windows are how Patch Manager actually executes.
Flow:
1. Patch
baseline defines what is approved
2. Patch
group defines which instances
3. Maintenance
window defines when
4. Task
(AWS-RunPatchBaseline)
defines how
NB:
- twtech cannot do enterprise patching safely without Maintenance Windows.
IAM & Security Model
Required IAM Roles
1.
Maintenance
Window Service Role
- Allows SSM to:
- Run commands
- Access logs
- Interact with EC2, S3, CloudWatch
2.
Instance
Profile Role
- SSM Agent permissions
- Access to patch repos
- S3 / KMS if encrypted artifacts are used
Security
best practices:
- Separate roles for Prod vs Non-Prod
- Least-privilege policies
- Use KMS encryption for logs and outputs
Logging, Auditing & Compliance
Maintenance
Windows integrate deeply with:
- CloudWatch Logs
- S3 command output
- SSM Compliance
- AWS Config
- CloudTrail
twtech
gets:
- Who executed what
- When it ran
- Which instances succeeded or failed
- Patch compliance evidence (SOX, PCI, HIPAA)
NB:
- This is often used as audit-proof change execution.
Advanced Enterprise Patterns
1. Environment-Based Windows
|
Environment |
Window |
|
Dev |
Daily |
|
QA |
Weekly |
|
Staging |
Bi-weekly |
|
Prod |
Monthly |
NB:
- Same documents, same baselines — different windows.
2. Follow-the-Sun Patching
- Region-specific timezones
- Staggered maintenance windows
- Global fleet coverage with zero overlap
3. Blue/Green or Tiered Patching
- App tier A patched first
- Validation tasks
- App tier B patched second
NB:
Reduces availability risk without load balancer changes.
4. Change-Managed Automation
- Maintenance Window ID referenced in:
- Change tickets
- Incident runbooks
- Compliance reports
NB:
- Some orgs require MW ID before approving patching.
Common Failure Modes (and How to Avoid Them)
|
Issue |
Root Cause |
Fix |
|
Tasks don’t run |
No targets resolved |
Verify tags / resource groups |
|
Instances skipped |
Wrong Patch Group |
Standardize tagging |
|
Reboots missed |
Missing reboot option |
Use |
|
Timeout failures |
Window too short |
Increase duration |
|
Permissions denied |
Wrong IAM role |
Validate service role |
When to Use and Not Use Maintenance
Windows
Use when:
- Running planned, repeatable ops
- Enforcing change windows
- Managing large fleets
- Meeting compliance requirements
Don’t use when:
- Immediate break/fix response (use Run Command directly)
- Event-driven automation (use EventBridge + Automation)
Final thoughts
Maintenance
Windows are:
- The execution engine of enterprise operations
- The guardrail for patching and compliance
- A replacement for brittle cron + SSH workflows
NB:
- In mature AWS environments, nothing touches production outside a Maintenance Window.
No comments:
Post a Comment