Tuesday, December 16, 2025

Maintenance Windows, a feature of Systems Manager | Overview.

 

An Overview of AWS Systems Manager (SSM) feature of Maintenance Windows.

Focus:

  •        Tailored for enterprise / cloud-engineering, (aligned with operations, regulations, & large-scale environments).

Breakdown:

  •        Intro,
  •        Key Components,
  •        Benefits and Features,
  •        Pricing,
  •        The concept:  Maintenance Windows,
  •        Core Architecture Components,
  •        Integration with Patch Manager,
  •        IAM & Security Model,
  •        Logging, Auditing & Compliance,
  •        Advanced Enterprise Patterns,
  •        Common Failure Modes (and How to Avoid Them),
  •        When to Use and Not Use Maintenance Windows,
  •        Final thoughts.

Intro:

  •        AWS Systems Manager (SSM)– a feature of Maintenance Windows allow twtech to schedule recurring periods for performing potentially disruptive administrative tasks across its AWS resources.
  •        This feature of (SSM) is commonly used to automate operating system patching, driver updates, and software installations during low-traffic periods. 

Key Components

Schedule:

  •          Defines when and how often the window runs using Cron or Rate expressions.

Duration and Cutoff:

  •          Specifies the total length of the window (e.g., 4 hours) and a "cutoff" time (e.g., 1 hour before the end) to prevent new tasks from starting as the window closes.

Targets:

  •          The specific resources the tasks will act upon. These can be selected manually, via tags, or by using AWS Resource Groups.

Tasks:

  •          The automated actions performed during the window. Supported task types include:

Run Command:

  •     Executing configuration scripts on managed instances.

Automation:

  •     Running multi-step Systems Manager Automation workflows.

Lambda Functions:

  •     Triggering serverless AWS Lambda functions.

Step Functions:

  •     Initiating AWS Step Functions state machine tasks

Benefits and Features

Centralized History:

  •          Systems Manager maintains a 30-day history of all maintenance window executions, allowing twtech to track task status without logging into individual servers.

Error Control:

  •          twtech can set velocity and error thresholds, such as stopping a task if it fails on more than a specific number of instances.

Time Zone Support:

  •          Windows can be scheduled in specific local time zones rather than just UTC, ensuring maintenance aligns with local business hours.

Hybrid Management:

  •          Tasks can be scheduled for both Amazon EC2 instances and on-premises servers or virtual machines managed by Systems Manager. 

Pricing

  •        There is no additional charge to use the Maintenance Windows feature itself.
  •        However, twtech has to pay for the underlying AWS resources consumed during the maintenance, such as EC2 instance hours or Lambda invocations

The concept:  Maintenance Windows (SSM)

  • Maintenance Windows are a Systems Manager orchestration feature that lets you define when and how operational tasks run on managed resources.

They answer four key enterprise questions:

1.     When can work happen?

2.     What tasks should run?

3.     On which resources?

4.     In what order and with what controls?

NB:

  • They are critical for patching, compliance, availability protection, and change management.

Core Architecture Components

1. Maintenance Window (MW)

NB:

The container object that defines:

  •         Schedule
    •    Cron or rate expression
    •    Timezone support (critical for global orgs)
  •         Duration
    •    Total time the window stays open
  •         Cutoff
    •    How long before window end no new tasks can start
  •         Enabled / Disabled state

NB:

Think of it as the change-approved time boundary

2. Targets

NB:

  • Targets define what resources are eligible for tasks.

Supported target types:

  •         EC2 instances
  •         On-prem servers (SSM Hybrid)
  •         Resource Groups
  •         Tags (most common in enterprises)

Example:

Tag: PatchGroup = Prod-Linux

Best practice:

  •         Never hardcode instance IDs
  •         Always use Patch Groups or Environment tags

3. Tasks

NB:

Tasks define what action runs during the window.

Common task types:

  •         AWS-RunPatchBaseline
  •         AWS-RunPowerShellScript
  •         AWS-RunShellScript
  •         AWS-RunAnsiblePlaybook
  •         AWS-StartEC2Instance
  •         AWS-StopEC2Instance
  •         Custom SSM Documents

Each task includes:

  •         SSM Document
  •         Task priority
  •         Max concurrency
  •         Max errors
  •         IAM service role
  •         Timeouts
  •         Invocation parameters

4. Task Priority & Ordering

NB:

Tasks run in priority order (lower number = higher priority).

Typical enterprise sequence:

      1.     Pre-maintenance validation
2.     Stop application services
3.     Apply patches
4.     Reboot (if required)
5.     Start services
6.     Post-maintenance health checks

NB:

  • This replaces fragile cron-based automation.

5. Concurrency & Error Controls

  • These controls are enterprise-grade safeguards.

Max Concurrency

  •         Percentage or fixed number
  •         Example: 10% or 5

Prevents:

  •         Patch storms
  •         Capacity collapse
  •         Regional brownouts

Max Errors

  •         Absolute or percentage
  •         Stops execution when threshold is hit

Critical for:

  •         Production blast-radius control
  •        Change failure containment

Integration with Patch Manager

  • Maintenance Windows are how Patch Manager actually executes.

Flow:

      1.     Patch baseline defines what is approved
2.     Patch group defines which instances
3.     Maintenance window defines when
4.     Task (AWS-RunPatchBaseline) defines how

NB:

  • twtech cannot do enterprise patching safely without Maintenance Windows.

IAM & Security Model

Required IAM Roles

1.     Maintenance Window Service Role

  •    Allows SSM to:
    •   Run commands
    •   Access logs
    •   Interact with EC2, S3, CloudWatch

2.     Instance Profile Role

  •    SSM Agent permissions
  •    Access to patch repos
  •    S3 / KMS if encrypted artifacts are used

Security best practices:

  •         Separate roles for Prod vs Non-Prod
  •         Least-privilege policies
  •         Use KMS encryption for logs and outputs

Logging, Auditing & Compliance

Maintenance Windows integrate deeply with:

  •         CloudWatch Logs
  •         S3 command output
  •        SSM Compliance
  •         AWS Config
  •         CloudTrail

twtech gets:

  •         Who executed what
  •         When it ran
  •         Which instances succeeded or failed
  •         Patch compliance evidence (SOX, PCI, HIPAA)

NB:

  • This is often used as audit-proof change execution.

Advanced Enterprise Patterns

1. Environment-Based Windows

Environment

  Window

Dev

Daily

QA

Weekly

Staging

Bi-weekly

Prod

Monthly

NB:

  • Same documents, same baselinesdifferent windows.

2. Follow-the-Sun Patching

  •         Region-specific timezones
  •         Staggered maintenance windows
  •         Global fleet coverage with zero overlap

3. Blue/Green or Tiered Patching

  •         App tier A patched first
  •        Validation tasks
  •         App tier B patched second

NB:

Reduces availability risk without load balancer changes.

4. Change-Managed Automation

  •        Maintenance Window ID referenced in:
    •    Change tickets
    •    Incident runbooks
    •    Compliance reports

NB:

  • Some orgs require MW ID before approving patching.

Common Failure Modes (and How to Avoid Them)

Issue

   Root Cause

 Fix

Tasks don’t run

No targets resolved

Verify tags / resource groups

Instances skipped

Wrong Patch Group

Standardize tagging

Reboots missed

Missing reboot option

Use RebootIfNeeded

Timeout failures

Window too short

Increase duration

Permissions denied

Wrong IAM role

Validate service role

When to Use and Not Use Maintenance Windows

Use when:

  •         Running planned, repeatable ops
  •         Enforcing change windows
  •         Managing large fleets
  •         Meeting compliance requirements

Don’t use when:

  •         Immediate break/fix response (use Run Command directly)
  •         Event-driven automation (use EventBridge + Automation)

Final thoughts

Maintenance Windows are:

  •         The execution engine of enterprise operations
  •         The guardrail for patching and compliance
  •         A replacement for brittle cron + SSH workflows

NB:

  • In mature AWS environments, nothing touches production outside a Maintenance Window.

 

No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...