Wednesday, August 13, 2025

Amazon DynamoDB | Time To Live (TTL).

 

A deep dive into DynamoDB Time to Live (TTL)

1. The concept: TTL

TTL in DynamoDB is a per-item expiration feature.
twtech sets an attribute (usually a timestamp in UNIX epoch time, in seconds) that tells DynamoDB when that item is eligible for expiration and eventually deleted after specified time.

What it is not:

  • Not a hard delete → Items don’t vanish at the exact second of expiry. The deletion is asynchronous.
  • Not a retention policy → It’s for automatic cleanup, not for strict compliance retention.
  • Not instant → Deletion can be delayed for hours (sometimes up to 48 hrs, but usually less).

2. How DynamoDB Time to Live (TTL)  Works:

  1. Enable TTL on a table and choose one attribute name (e.g., expireAt).
  2. twtech inserts items with that attribute set to a UNIX timestamp (seconds).
  3. After the current time passes that timestamp:
    • DynamoDB marks the item as eligible for expiration.
    • A background process scans partitions and removes expired items gradually.
  4. The deletion is a soft, background purge:
    • No provisioned RCUs/WCUs are consumed for the delete.
    • TTL deletions don’t show up in table metrics as DeleteItem calls.
  5. Streams + TTL when DynamoDB Streams is enabled, TTL deletions show up with:
    • eventName: REMOVE
    • userIdentity → principalId: "dynamodb.amazonaws.com"

3. Data Modeling with TTL

TTL is typically used for:

  • Session data (auto-expire old sessions)
  • Caching layers (short-lived data that should drop off)
  • Event deduplication windows
  • IoT ingestion (expire readings after a time window)

Example table item:

{

  "PK": "User#123",

  "SK": "Session#abc",

  "expireAt": 1736822400,  // Wed, 14 Jan 2025 00:00:00 UTC

  "data": "Some session info"

}

4. Internals & Gotchas(what to watch out for)

Topic

Details

Attribute type

Must be a Number representing UNIX epoch time in seconds (not ms)

Multiple TTL attributes

Nope — only one per table.

Partial deletes

Only items with the attribute set and in the past get deleted; others stay.

Delay window

Usually minutes–hours; up to ~48 hrs worst case.

GSI behavior

Expired items are removed from GSIs automatically.

Restore from backups

Restored items keep their original TTL attributes — they may vanish soon after restore if expired.

Streams

TTL deletions appear in Streams if enabled — useful for cleanup in downstream systems.

Capacity cost

TTL deletions do not consume provisioned throughput, but reads/writes to set TTL attribute do.

Transactions

TTL deletions are not transactional — they happen outside of your normal writes.

5. Operational Considerations

  • Monitoring
    • CloudWatch metric: TimeToLiveDeletedItemCount
    • Streams: monitor REMOVE events
  • Backups
    • Expired items are still present in point-in-time recovery snapshots if they existed at that time.
  • Disaster Recovery
    • If twtech restores an old backup, expired data may be reinserted and immediately expire again.

6. twtech Best Practices

  1. Uses for ephemeral data, not for critical compliance deletes.
  2. Stores timestamps in UTC seconds — easier to compare.
  3. Doesn’t rely on precise timing — if twtech needs exact removal, do it in its app logic.
  4. Considers Streams + Lambda if twtech wants extra cleanup actions.
  5. Validates TTL in writes — ensure twtech code isn’t setting milliseconds by mistake (common bug).

Diving into the failure-mode for DynamoDB TTL.

1. TTL Workers Lag

How failure-mode for DynamoDB TTL Happens

  • TTL deletions are done by background workers that sweep partitions.
  • If twtech table is:
    • Very large
    • Has lots of expired items at once
    • Has high write activity
      …those workers might take longer to catch up.

Symptoms

  • Items remain long after their expireAt time.
  • CloudWatch TimeToLiveDeletedItemCount shows bursty behavior rather than steady.
  • GSIs still contain expired entries until the deletion propagates.

Impact

  • Unexpected storage bloat (especially for hot partitions).
  • Queries/Scans return “dead” data for longer than expected.
  • If twtech rely on TTL for cost control (e.g., IoT ingestion), storage costs can spike.

Mitigation

  • Architect for eventual cleanup, not precise.
  • Use Streams + Lambda for proactive deletion if timing matters.
  • Keep partition key design healthy so workers don’t have “skew hotspots.”

2. Streams Failures with TTL

TTL deletions appear in Streams if twtech has it enabled, but Streams aren’t guaranteed to be durable forever.

Scenarios

  1. Lambda Subscriber Fails
    • twtech downstream processing (e.g., cleanup in S3 or other systems) misses TTL removals if Lambda errors repeatedly and the retry window passes.
  2. Stream Retention Window Passes
    • DynamoDB Streams only retain records for 24 hours (or up to 7 days if extended streams are enabled).
    • If twtech processing pipeline is down longer than that, TTL deletion events are lost forever.
  3. No Stream Filter
    • If twtech doesn’t filter TTL deletions (userIdentity.principalId == "dynamodb.amazonaws.com"), twtech consumer might treat TTL deletes the same as user-initiated deletes — potentially causing unintended behavior.

Mitigation

  • Enable extended stream retention if the downstream is critical.
  • Design idempotent consumers.
  • Add filtering logic for TTL vs manual deletes.
  • Add Dead Letter Queues (DLQs) for Lambda errors.

3. Restored Backups with Expired Data

TTL doesn’t retroactively “remove” data during restore — the attribute is just metadata, not a live countdown.

What Happens

  • twtech restores a PITR or backup from, say, 2 weeks ago.
  • Many of those items have expireAt timestamps in the past.
  • The moment the table comes online, TTL workers start scanning and will eventually delete those items — but not instantly.

Risks

  • twtech application might read stale or sensitive data before TTL kicks in.
  • If twtech is restoring for compliance audits, it might have data it wasn’t supposed to keep.
  • TTL deletions after restore might cause sudden bursts of Stream REMOVE events, overwhelming consumers.

Mitigation

  • After restore, manually purge expired items via a Scan + BatchWrite/Delete before making the table “live.”
  • If restoring into a different environment (e.g., staging), disable TTL temporarily if twtech wants to inspect expired data.

4. Partial Deletes & Inconsistent Views

TTL deletions are eventually consistent:

  • If twtech query before TTL deletes an expired item, it will still see it.
  • Reads from different replicas (in multi-region/global tables) may show different states during deletion window.

Extra twist for Global Tables:

  • TTL deletions do replicate as delete events across regions.
  • If the TTL worker lags in one region, the delete may arrive from another region first.

Mitigation:

  • Always filter out expired data in twtech app logic if correctness is critical.
  • Treat TTL as storage hygiene, not business logic enforcement.

5. Observability Gaps

  • There’s no “TTL lag” metric — twtech can’t directly see how far behind the worker is.
  • CloudWatch TimeToLiveDeletedItemCount is only counts of items deleted, not latency.
  • If twtech misses that, it may not notice lag until storage costs spike.

Mitigation:

  • Track your own expireAt minus now for items returned in queries.
  • Create alarms when too many “dead” items are still being returned.

Key Takeaway

TTL is lazy garbage collection for DynamoDB, not a guaranteed real-time deletion mechanism.
If twtech needs precise, auditable, instant deletions,  it does that itself in the app layer and treat TTL as a “safety net” or cost-control measure.

SRE-style DynamoDB TTL Failure-Mode & Recovery Runbook twtech can keep handy in its operational playbooks.

DynamoDB TTL Failure-Mode & Recovery Runbook

Purpose:
Guide SREs/DevOps on identifying, diagnosing, and mitigating issues when DynamoDB Time to Live (TTL) misbehaves (lag, missed deletes, stale data after restore, or stream issues).

1. Failure Modes Overview

Failure Mode

Typical Symptoms

Risk

TTL Worker Lag

Expired items remain hours–days after expireAt; storage cost spike; queries return stale data

Higher costs, incorrect reads

Stream Loss / Consumer Failure

Missing REMOVE events in downstream; replay gap > 24h (or > 7d for extended streams)

Downstream cleanup fails

Backup Restore with Expired Data

Restored table shows old expired items; burst of deletes post-restore

Data exposure, incorrect analytics

Partial Deletes / Global Table Skew

Items removed in one region before another; inconsistent queries

Conflicting application logic

Observability Blind Spots

No TTL latency metric; no alarms on lag

Silent accumulation of expired items

2. Detection & Monitoring

Core Metrics & Logs

·        TimeToLiveDeletedItemCount (CloudWatch) – look for sudden drops or bursts.

·        DynamoDB Streams – monitor REMOVE events where: 

"userIdentity": {
  "principalId": "dynamodb.amazonaws.com"
}

·        Storage size growth in CloudWatch.

·        App-level query filters showing expireAt < now() items being returned.

Alarms

·        Alert if >X% of queried items are expired.

·        Alert if storage size > N% baseline without proportional write activity.

3. Mitigation Playbook

A. TTL Worker Lag

1.     Confirm TTL is enabled and attribute matches config.

2.     Sample items → check expireAt vs now.

3.     If lag confirmed:

o   Trigger manual cleanup (Scan + BatchDelete in small chunks).

o   Investigate partition key skew (hot partitions slow TTL sweep).

4.     Consider Streams + Lambda proactive deletion for time-sensitive data.

B. Streams Failure

1.     Check GetRecords.IteratorAgeMilliseconds in Kinesis metrics.

2.     If gap > retention:

o   Identify missed expired items via Scan.

o   Run manual cleanup.

3.     Restart consumers & reprocess backlog if still within retention window.

C. Backup Restore with Expired Data

1.     After restore, pause app writes.

2.     Scan for expired items (expireAt < now()).

3.     Batch delete expired items before opening table to production traffic.

D. Global Table Partial Deletes

1.     Compare same PK/SK in each region.

2.     Force sync via manual delete in lagging region if needed.

4. Escalation Path

1.     On-call SRE validates TTL config and lag.

2.     DBA/Cloud Engineer engages if:

o   TTL not deleting after >48h.

o   Streams event loss exceeds retention window.

3.     AWS Support ticket if:

o   TTL workers appear stalled (no deletes in >24h despite expired data).

o   Global Tables TTL replication anomalies persist after 24h.

5. Preventative Actions

  •         Use app-level filters to ignore expired items during reads.
  •         Keep partition keys evenly distributed.
  •         Enable extended stream retention if TTL-driven downstream actions are critical.
  •         Document restore procedures to purge expired data before production use.
  •         Periodically audit TTL effectiveness (compare expireAt lag).

No comments:

Post a Comment

AWS DynamoDB | Integration With S3 Bucket.

  AWS DynamoDB ↔ S3 integration , View: What DynamoDB ↔ S3 integration is,   How to use DynamoDB ↔ S3 integration,   Why uses DynamoDB ↔  S3...