Think - with -Tech: Amazon DynamoDB

Wednesday, August 13, 2025

Amazon DynamoDB | Time To Live (TTL).

A deep dive into DynamoDB Time to Live (TTL)

1. The concept: TTL

TTL in DynamoDB is a per-item expiration feature.
twtech sets an attribute (usually a timestamp in UNIX epoch time, in seconds) that tells DynamoDB when that item is eligible for expiration and eventually deleted after specified time.

What it is not:

Not a hard delete → Items don’t vanish at the exact second of expiry. The deletion is asynchronous.
Not a retention policy → It’s for automatic cleanup, not for strict compliance retention.
Not instant → Deletion can be delayed for hours (sometimes up to 48 hrs, but usually less).

2. How DynamoDB Time to Live (TTL) Works:

Enable TTL on a table and choose one attribute name (e.g., expireAt).
twtech inserts items with that attribute set to a UNIX timestamp (seconds).
After the current time passes that timestamp:

DynamoDB marks the item as eligible for expiration.
A background process scans partitions and removes expired items gradually.

The deletion is a soft, background purge:

No provisioned RCUs/WCUs are consumed for the delete.
TTL deletions don’t show up in table metrics as DeleteItem calls.

Streams + TTL → when DynamoDB Streams is enabled, TTL deletions show up with:

eventName: REMOVE
userIdentity → principalId: "dynamodb.amazonaws.com"

3. Data Modeling with TTL

TTL is typically used for:

Session data (auto-expire old sessions)
Caching layers (short-lived data that should drop off)
Event deduplication windows
IoT ingestion (expire readings after a time window)

Example table item:

{

"PK": "User#123",

"SK": "Session#abc",

"expireAt": 1736822400, // Wed, 14 Jan 2025 00:00:00 UTC

"data": "Some session info"

}

4. Internals & Gotchas(what to watch out for)

Topic	Details
Attribute type	Must be a Number representing UNIX epoch time in seconds (not ms)
Multiple TTL attributes	Nope — only one per table.
Partial deletes	Only items with the attribute set and in the past get deleted; others stay.
Delay window	Usually minutes–hours; up to ~48 hrs worst case.
GSI behavior	Expired items are removed from GSIs automatically.
Restore from backups	Restored items keep their original TTL attributes — they may vanish soon after restore if expired.
Streams	TTL deletions appear in Streams if enabled — useful for cleanup in downstream systems.
Capacity cost	TTL deletions do not consume provisioned throughput, but reads/writes to set TTL attribute do.
Transactions	TTL deletions are not transactional — they happen outside of your normal writes.

5. Operational Considerations

Monitoring

CloudWatch metric: TimeToLiveDeletedItemCount
Streams: monitor REMOVE events

Backups

Expired items are still present in point-in-time recovery snapshots if they existed at that time.

Disaster Recovery

If twtech restores an old backup, expired data may be reinserted and immediately expire again.

6. twtech Best Practices

Uses for ephemeral data, not for critical compliance deletes.
Stores timestamps in UTC seconds — easier to compare.
Doesn’t rely on precise timing — if twtech needs exact removal, do it in its app logic.
Considers Streams + Lambda if twtech wants extra cleanup actions.
Validates TTL in writes — ensure twtech code isn’t setting milliseconds by mistake (common bug).

Diving into the failure-mode for DynamoDB TTL.

1. TTL Workers Lag

How failure-mode for DynamoDB TTL Happens

TTL deletions are done by background workers that sweep partitions.
If twtech table is:

Very large
Has lots of expired items at once
Has high write activity
…those workers might take longer to catch up.

Symptoms

Items remain long after their expireAt time.
CloudWatch TimeToLiveDeletedItemCount shows bursty behavior rather than steady.
GSIs still contain expired entries until the deletion propagates.

Impact

Unexpected storage bloat (especially for hot partitions).
Queries/Scans return “dead” data for longer than expected.
If twtech rely on TTL for cost control (e.g., IoT ingestion), storage costs can spike.

Mitigation

Architect for eventual cleanup, not precise.
Use Streams + Lambda for proactive deletion if timing matters.
Keep partition key design healthy so workers don’t have “skew hotspots.”

2. Streams Failures with TTL

TTL deletions appear in Streams if twtech has it enabled, but Streams aren’t guaranteed to be durable forever.

Scenarios

Lambda Subscriber Fails

twtech downstream processing (e.g., cleanup in S3 or other systems) misses TTL removals if Lambda errors repeatedly and the retry window passes.

Stream Retention Window Passes

DynamoDB Streams only retain records for 24 hours (or up to 7 days if extended streams are enabled).
If twtech processing pipeline is down longer than that, TTL deletion events are lost forever.

No Stream Filter

If twtech doesn’t filter TTL deletions (userIdentity.principalId == "dynamodb.amazonaws.com"), twtech consumer might treat TTL deletes the same as user-initiated deletes — potentially causing unintended behavior.

Mitigation

Enable extended stream retention if the downstream is critical.
Design idempotent consumers.
Add filtering logic for TTL vs manual deletes.
Add Dead Letter Queues (DLQs) for Lambda errors.

3. Restored Backups with Expired Data

TTL doesn’t retroactively “remove” data during restore — the attribute is just metadata, not a live countdown.

What Happens

twtech restores a PITR or backup from, say, 2 weeks ago.
Many of those items have expireAt timestamps in the past.
The moment the table comes online, TTL workers start scanning and will eventually delete those items — but not instantly.

Risks

twtech application might read stale or sensitive data before TTL kicks in.
If twtech is restoring for compliance audits, it might have data it wasn’t supposed to keep.
TTL deletions after restore might cause sudden bursts of Stream REMOVE events, overwhelming consumers.

Mitigation

After restore, manually purge expired items via a Scan + BatchWrite/Delete before making the table “live.”
If restoring into a different environment (e.g., staging), disable TTL temporarily if twtech wants to inspect expired data.

4. Partial Deletes & Inconsistent Views

TTL deletions are eventually consistent:

If twtech query before TTL deletes an expired item, it will still see it.
Reads from different replicas (in multi-region/global tables) may show different states during deletion window.

Extra twist for Global Tables:

TTL deletions do replicate as delete events across regions.
If the TTL worker lags in one region, the delete may arrive from another region first.

Mitigation:

Always filter out expired data in twtech app logic if correctness is critical.
Treat TTL as storage hygiene, not business logic enforcement.

5. Observability Gaps

There’s no “TTL lag” metric — twtech can’t directly see how far behind the worker is.
CloudWatch TimeToLiveDeletedItemCount is only counts of items deleted, not latency.
If twtech misses that, it may not notice lag until storage costs spike.

Mitigation:

Track your own expireAt minus now for items returned in queries.
Create alarms when too many “dead” items are still being returned.

Key Takeaway

TTL is lazy garbage collection for DynamoDB, not a guaranteed real-time deletion mechanism.
If twtech needs precise, auditable, instant deletions, it does that itself in the app layer and treat TTL as a “safety net” or cost-control measure.

SRE-style DynamoDB TTL Failure-Mode & Recovery Runbook twtech can keep handy in its operational playbooks.

DynamoDB TTL Failure-Mode & Recovery Runbook

Purpose:
Guide SREs/DevOps on identifying, diagnosing, and mitigating issues when DynamoDB Time to Live (TTL) misbehaves (lag, missed deletes, stale data after restore, or stream issues).

1. Failure Modes Overview

Failure Mode	Typical Symptoms	Risk
TTL Worker Lag	Expired items remain hours–days after `expireAt`; storage cost spike; queries return stale data	Higher costs, incorrect reads
Stream Loss / Consumer Failure	Missing REMOVE events in downstream; replay gap > 24h (or > 7d for extended streams)	Downstream cleanup fails
Backup Restore with Expired Data	Restored table shows old expired items; burst of deletes post-restore	Data exposure, incorrect analytics
Partial Deletes / Global Table Skew	Items removed in one region before another; inconsistent queries	Conflicting application logic
Observability Blind Spots	No TTL latency metric; no alarms on lag	Silent accumulation of expired items

2. Detection & Monitoring

Core Metrics & Logs

· TimeToLiveDeletedItemCount (CloudWatch) – look for sudden drops or bursts.

· DynamoDB Streams – monitor REMOVE events where:

"userIdentity": {

  "principalId": "dynamodb.amazonaws.com"

· Storage size growth in CloudWatch.

· App-level query filters showing expireAt < now() items being returned.

Alarms

· Alert if >X% of queried items are expired.

· Alert if storage size > N% baseline without proportional write activity.

3. Mitigation Playbook

A. TTL Worker Lag

1. Confirm TTL is enabled and attribute matches config.

2. Sample items → check expireAt vs now.

3. If lag confirmed:

o Trigger manual cleanup (Scan + BatchDelete in small chunks).

o Investigate partition key skew (hot partitions slow TTL sweep).

4. Consider Streams + Lambda proactive deletion for time-sensitive data.

B. Streams Failure

1. Check GetRecords.IteratorAgeMilliseconds in Kinesis metrics.

2. If gap > retention:

o Identify missed expired items via Scan.

o Run manual cleanup.

3. Restart consumers & reprocess backlog if still within retention window.

C. Backup Restore with Expired Data

1. After restore, pause app writes.

2. Scan for expired items (expireAt < now()).

3. Batch delete expired items before opening table to production traffic.

D. Global Table Partial Deletes

1. Compare same PK/SK in each region.

2. Force sync via manual delete in lagging region if needed.

4. Escalation Path

1. On-call SRE validates TTL config and lag.

2. DBA/Cloud Engineer engages if:

o TTL not deleting after >48h.

o Streams event loss exceeds retention window.

3. AWS Support ticket if:

o TTL workers appear stalled (no deletes in >24h despite expired data).

o Global Tables TTL replication anomalies persist after 24h.

5. Preventative Actions

Use app-level filters to ignore expired items during reads.
Keep partition keys evenly distributed.
Enable extended stream retention if TTL-driven downstream actions are critical.
Document restore procedures to purge expired data before production use.
Periodically audit TTL effectiveness (compare expireAt lag).

Think - with -Tech

Wednesday, August 13, 2025

Amazon DynamoDB | Time To Live (TTL).

DynamoDB TTL Failure-Mode & Recovery Runbook

1. Failure Modes Overview

2. Detection & Monitoring

3. Mitigation Playbook

A. TTL Worker Lag

B. Streams Failure

C. Backup Restore with Expired Data

D. Global Table Partial Deletes

4. Escalation Path

5. Preventative Actions

No comments:

Post a Comment

AWS DynamoDB | Integration With S3 Bucket.

Blog Archive