A deep dive into DynamoDB Time to Live (TTL)
1. The concept: TTL
TTL in DynamoDB is a per-item
expiration feature.
twtech sets an attribute (usually a
timestamp in UNIX epoch time, in
seconds) that tells DynamoDB when that item is eligible for
expiration and eventually deleted after specified time.
What it is not:
- Not a hard delete
→ Items don’t vanish at the exact second of expiry. The deletion is asynchronous.
- Not a retention policy → It’s for automatic cleanup, not for strict
compliance retention.
- Not instant
→ Deletion can be delayed for hours (sometimes up to 48 hrs, but usually
less).
2. How DynamoDB Time to Live (TTL)
Works:
- Enable TTL
on a table and choose one attribute name (e.g., expireAt).
- twtech inserts items with that attribute set to a UNIX
timestamp (seconds).
- After the current time passes that timestamp:
- DynamoDB marks the item as eligible for
expiration.
- A background process scans partitions and
removes expired items gradually.
- The deletion is a soft, background purge:
- No provisioned RCUs/WCUs are consumed for the delete.
- TTL deletions don’t show up in table metrics as
DeleteItem calls.
- Streams + TTL
→ when DynamoDB Streams is enabled, TTL deletions show up with:
- eventName: REMOVE
- userIdentity
→ principalId: "dynamodb.amazonaws.com"
3. Data Modeling with TTL
TTL is typically used for:
- Session data
(auto-expire old sessions)
- Caching layers
(short-lived data that should drop off)
- Event deduplication windows
- IoT ingestion
(expire readings after a time window)
Example table item:
{
"PK": "User#123",
"SK": "Session#abc",
"expireAt": 1736822400, // Wed, 14 Jan 2025 00:00:00 UTC
"data": "Some session
info"
}
4. Internals & Gotchas(what to watch out for)
Topic |
Details |
Attribute type |
Must be a Number
representing UNIX epoch time in seconds (not
ms) |
Multiple TTL attributes |
Nope — only one per table. |
Partial deletes |
Only items with the attribute set
and in the past get deleted; others stay. |
Delay window |
Usually minutes–hours; up to ~48
hrs worst case. |
GSI behavior |
Expired items are removed from
GSIs automatically. |
Restore from backups |
Restored items keep their original
TTL attributes — they may vanish soon after restore if expired. |
Streams |
TTL deletions appear in Streams if
enabled — useful for cleanup in downstream systems. |
Capacity cost |
TTL deletions do not consume
provisioned throughput, but reads/writes to set TTL attribute do. |
Transactions |
TTL deletions are not
transactional — they happen outside of your normal writes. |
5. Operational Considerations
- Monitoring
- CloudWatch metric: TimeToLiveDeletedItemCount
- Streams: monitor REMOVE events
- Backups
- Expired items are still present in point-in-time
recovery snapshots if they existed at that time.
- Disaster Recovery
- If twtech restores an old backup, expired data may be reinserted and immediately expire again.
6. twtech Best Practices
- Uses for ephemeral data, not for critical compliance deletes.
- Stores timestamps in UTC seconds — easier to compare.
- Doesn’t rely on precise timing — if twtech needs exact removal, do it in its app logic.
- Considers Streams + Lambda if twtech wants extra cleanup actions.
- Validates TTL in writes — ensure twtech code isn’t setting milliseconds by
mistake (common bug).
Diving into the failure-mode for DynamoDB TTL.
1. TTL Workers Lag
How failure-mode for
DynamoDB TTL Happens
- TTL deletions are done by background workers
that sweep partitions.
- If
twtech table is:
- Very large
- Has lots of expired items at once
- Has high write activity
…those workers might take longer to catch up.
Symptoms
- Items remain long after their expireAt time.
- CloudWatch TimeToLiveDeletedItemCount shows bursty behavior rather than steady.
- GSIs still contain expired entries until the deletion
propagates.
Impact
- Unexpected storage bloat (especially for hot
partitions).
- Queries/Scans return “dead” data for longer than
expected.
- If twtech rely on TTL for cost control (e.g., IoT ingestion), storage
costs can spike.
Mitigation
- Architect for eventual cleanup, not precise.
- Use Streams + Lambda for proactive deletion if
timing matters.
- Keep partition key design healthy so workers don’t have “skew hotspots.”
2. Streams Failures with TTL
TTL deletions appear in Streams if twtech
has it enabled, but Streams aren’t guaranteed to be durable forever.
Scenarios
- Lambda Subscriber Fails
- twtech downstream processing (e.g., cleanup in S3 or other systems) misses TTL removals if
Lambda errors repeatedly and the retry window passes.
- Stream Retention Window Passes
- DynamoDB Streams only retain records for 24 hours
(or up to 7 days if extended streams are enabled).
- If twtech processing pipeline is down longer than that,
TTL deletion events are lost forever.
- No Stream Filter
- If twtech doesn’t filter TTL deletions (userIdentity.principalId ==
"dynamodb.amazonaws.com"), twtech consumer might treat TTL deletes the same as
user-initiated deletes — potentially causing unintended behavior.
Mitigation
- Enable extended stream retention if the
downstream is critical.
- Design idempotent consumers.
- Add filtering logic for TTL vs manual deletes.
- Add Dead Letter Queues (DLQs) for Lambda errors.
3. Restored Backups with Expired Data
TTL doesn’t retroactively “remove”
data during restore — the attribute is just metadata, not a live
countdown.
What Happens
- twtech restores a PITR or backup from, say, 2 weeks
ago.
- Many of those items have expireAt timestamps in the past.
- The moment the table comes online, TTL workers start
scanning and will eventually delete those items — but not instantly.
Risks
- twtech application might read stale or sensitive data
before TTL kicks in.
- If twtech is restoring for compliance audits, it might
have data it wasn’t supposed to keep.
- TTL deletions after restore might cause sudden bursts
of Stream REMOVE events, overwhelming consumers.
Mitigation
- After restore, manually purge expired items via
a Scan + BatchWrite/Delete before making the table “live.”
- If restoring into a different environment (e.g., staging), disable TTL temporarily if twtech wants to inspect expired data.
4. Partial Deletes & Inconsistent Views
TTL deletions are eventually
consistent:
- If twtech query before TTL deletes an expired item, it will
still see it.
- Reads from different replicas (in multi-region/global tables) may show different states
during deletion window.
Extra twist for Global
Tables:
- TTL deletions do replicate as delete events
across regions.
- If the TTL worker lags in one region, the delete may
arrive from another region first.
Mitigation:
- Always filter out expired data in twtech app logic if
correctness is critical.
- Treat TTL as storage hygiene, not business logic enforcement.
5. Observability Gaps
- There’s no “TTL lag” metric — twtech can’t directly see
how far behind the worker is.
- CloudWatch TimeToLiveDeletedItemCount is only counts of items deleted, not latency.
- If twtech misses that, it may not notice lag until storage
costs spike.
Mitigation:
- Track your own expireAt minus now for items
returned in queries.
- Create alarms when too many “dead” items are still
being returned.
Key Takeaway
TTL is lazy garbage collection for DynamoDB, not a guaranteed real-time
deletion mechanism.
If twtech needs precise, auditable, instant deletions, it does that itself in the app layer and
treat TTL as a “safety net” or cost-control measure.
SRE-style DynamoDB TTL Failure-Mode & Recovery Runbook twtech can keep handy in its operational playbooks.
DynamoDB TTL Failure-Mode & Recovery Runbook
Purpose:
Guide SREs/DevOps on identifying, diagnosing, and mitigating issues when
DynamoDB Time to Live (TTL)
misbehaves (lag, missed deletes, stale
data after restore, or stream issues).
1. Failure Modes Overview
Failure Mode |
Typical Symptoms |
Risk |
TTL
Worker Lag |
Expired items remain hours–days after |
Higher costs, incorrect reads |
Stream
Loss / Consumer Failure |
Missing REMOVE events in downstream; replay gap > 24h
(or > 7d for extended streams) |
Downstream cleanup fails |
Backup
Restore with Expired Data |
Restored table shows old expired items; burst of deletes
post-restore |
Data exposure, incorrect analytics |
Partial
Deletes / Global Table Skew |
Items removed in one region before another; inconsistent
queries |
Conflicting application logic |
Observability
Blind Spots |
No TTL latency metric; no alarms on lag |
Silent accumulation of expired items |
2. Detection & Monitoring
Core Metrics & Logs
·
TimeToLiveDeletedItemCount
(CloudWatch) – look for sudden drops or
bursts.
· DynamoDB Streams – monitor REMOVE events where:
"userIdentity":
{
"principalId":
"dynamodb.amazonaws.com"
}
·
Storage size growth in CloudWatch.
·
App-level query filters showing expireAt <
now()
items being returned.
Alarms
·
Alert if >X% of queried items are expired.
· Alert if storage size > N% baseline without proportional write activity.
3. Mitigation Playbook
A. TTL Worker Lag
1.
Confirm TTL is enabled
and attribute matches config.
2.
Sample items → check expireAt
vs now
.
3.
If lag confirmed:
o Trigger
manual cleanup (Scan + BatchDelete in small chunks).
o Investigate
partition key skew (hot partitions slow
TTL sweep).
4.
Consider Streams
+ Lambda proactive deletion for time-sensitive data.
B. Streams Failure
1.
Check GetRecords.IteratorAgeMilliseconds
in
Kinesis metrics.
2.
If gap > retention:
o Identify
missed expired items via Scan.
o Run
manual cleanup.
3.
Restart consumers & reprocess backlog if still
within retention window.
C. Backup Restore with Expired Data
1.
After restore, pause
app writes.
2.
Scan for expired items (expireAt < now()
).
3.
Batch delete expired items before opening table to
production traffic.
D. Global Table Partial Deletes
1.
Compare same PK/SK in each region.
2. Force sync via manual delete in lagging region if needed.
4. Escalation Path
1.
On-call SRE
validates TTL config and lag.
2.
DBA/Cloud
Engineer engages if:
o TTL
not deleting after >48h.
o Streams
event loss exceeds retention window.
3.
AWS Support
ticket if:
o TTL
workers appear stalled (no deletes in >24h despite expired data).
o Global Tables TTL replication anomalies persist after 24h.
5. Preventative Actions
- Use app-level filters to ignore expired items during reads.
- Keep partition keys evenly distributed.
- Enable extended stream retention if TTL-driven downstream actions are critical.
- Document restore procedures to purge expired data before production use.
- Periodically audit TTL effectiveness (compare
expireAt
lag).
No comments:
Post a Comment