Tuesday, August 12, 2025

DynamoDB Accelerator (DAX)- Deep Dive.

 

Amazon DynamoDB Accelerator (DAX).

Here’s, a break down of:

  • what it is, 
  • why it exists, 
  • how it works, 
  • where to watch the steps.

1.  The Concept:  DAX

DAX (DynamoDB Accelerator) is an in-memory caching service specifically designed for DynamoDB.

Think of it as a fully managed, highly available, write-through/read-through cache that sits between twtech application and its DynamoDB.
Its purpose is to dramatically reduce read latency from DynamoDB’s ~10ms (typical) down to microseconds (~1–3ms).

It’s not a generic cache like Redis or Memcached — it’s purpose-built to integrate seamlessly with DynamoDB’s API.

2. Why DAX Exists

DynamoDB is fast, but for workloads with:

  • Repeated reads of the same items (hot keys)
  • Read-heavy workloads (especially eventually consistent reads)
  • High request rates
    … even 10ms is "too slow" when twtech is doing it millions of times per second.

DAX removes the need for twtech to manually manage an external cache (e.g., Redis cluster) for DynamoDB data.

DAX:

  • Speaks the same DynamoDB API (no app logic changes except client library)
  • Handles cache invalidation and consistency automatically
  • Scales with twtech traffic

3. DAX Architecture

At a high level:

# mathematica

twtechApp  →  DAX Cluster  →  DynamoDB Table

            (In-memory)

Key points:

  • DAX is deployed as a cluster of 1–10 nodes.
  • Each node keeps an in-memory copy of frequently accessed items.
  • All reads/writes go through the DAX endpoint (single DNS name).
  • Nodes are spread across multiple Availability Zones for HA(high availability).
  • Replication keeps cache data consistent across nodes.

4. How DAX (DynamoDB Accelerator) Works

Read Path

  1. App sends a GetItem or Query request to DAX endpoint.
  2. DAX checks if the item(s) are in cache:
    • Cache hit → returns from memory (microseconds)
    • Cache miss → fetches from DynamoDB, stores in cache, returns to client
  3. Supports eventual consistency for reads; strong consistency still queries DynamoDB.

Write Path

  • Write-through cache:
    • App sends a write (PutItem, UpdateItem, DeleteItem) → DAX writes to DynamoDB and invalidates/updates cache entry.
  • This ensures no stale writes remain.

5. Performance Gains(benefits)

  • DynamoDB: ~10ms latency (read)
  • DAX: ~1–3ms latency for cache hits
  • Can handle millions of requests per second without overloading DynamoDB.
  • Helps reduce RCU consumption (Read Capacity Units) for read-heavy workloads.

6. Data Consistency

  • Eventually consistent reads: fully benefit from DAX.
  • Strongly consistent reads: bypass cache, go directly to DynamoDB (no latency gain).
  • Write-through updates ensure fresh data for subsequent eventual consistent reads.

7. Integration

  • DAX uses a custom SDK(software development kit) client for Java, Python, Go, .NET, etc.
  • No change to your table schema.
  • twtech points its DynamoDB calls to DAX client instead of DynamoDB client.

Example (Python - Boto3 style):

# python

import amazondax

dax = amazondax.AmazonDaxClient(resource_endpoint_url='dax://twtech-dax-cluster.abc123.clustercfg.dax.use2.amazonaws.com:8111')

table = dax.Table('twtech-dax-table') 

response = table.get_item(Key={'id': 'twtech-123'})

print(response['Item'])

8. Security & Access

  • Integrated with IAM for access control.
  • Encryption in-transit (TLS) and at-rest supported.
  • Runs within a VPC, so you control subnets and security groups.

9. Limitations & Gotchas

  • Not a generic cache: Only supports DynamoDB operations, no arbitrary data storage.
  • Strongly consistent reads still hit DynamoDB directly — no performance boost.
  • Large queries (scan-heavy) won’t see huge benefit if data isn’t hot in cache.
  • TTL (Time to Live) for cached items is fixed per-cluster — not per-item.
  • Cluster warm-up time: Empty cache at startup → initial requests still go to DynamoDB until populated.
  • Write-heavy workloads might not benefit much (as writes still require hitting DynamoDB).

10. When to Use DAX

✅  Good fit for:

  • Read-heavy, repeated access to the same items
  • Microsecond read latency requirements
  • High request rates that risk throttling DynamoDB
  • Applications already in AWS VPC

❌  Not ideal for:

  • Mostly write-heavy workloads
  • Strong consistency requirements
  • Data changes too often to stay hot in cache

11. Pricing

  • Billed per node-hour (size and count of nodes)
  • No separate RCU/WRU cost for cached reads (only when DynamoDB is hit)
  • Remember: bigger nodes = more in-memory capacity

12. Real-World Example

Assuming, that twtech is running a gaming leaderboard:

  • Players refresh ranks multiple times per second
  • Rankings update every few minutes
  • Without DAX: Every refresh hits DynamoDB → millions of RCUs consumed
  • With DAX: Player rank data served from in-memory cache, DynamoDB hit only when cache expires → 90%+ cost reduction and 10× speed-up

DAX Internal Flow — Request Routing, Replication & TTL

This contains a full internal diagram and step-by-step explanation of how DynamoDB Accelerator (DAX) handles requests, replicates cache entries across nodes, and expires entries (TTL).

twtech flow diagram visualizes:

  •        Read/Write paths,
  •        Cache lifecycle,
  •        Replication guarantees,
  •        Failure behavior.

Components

  •         Client App: Uses DAX SDK and issues DynamoDB-style operations (GetItem, Query, PutItem, UpdateItem, DeleteItem).
  •         DAX Endpoint (DNS): Single cluster endpoint (logical). Resolves to cluster nodes and load-balances.
  •         DAX Cluster Nodes: 1–10 nodes, distributed across AZs. Each node holds an in-memory cache and participates in replication and consensus for cache metadata.
  •         DynamoDB Table: The authoritative datastore.
  •         Control Plane: Managed by AWS for cluster membership, node replacement, parameter sets.
  •         VPC / Security Groups / IAM: Network & access controls that gate client ↔ DAX and DAX ↔ DynamoDB.

High-level read / write flow (summary)

1.     Client → DAX endpoint (single DNS).

2.     DAX Node receiving request checks local cache for keys.

o   If cache hit and request is eventually consistent read → return item from memory.

o   If cache miss (or strong consistency requested) → DAX forwards to DynamoDB, stores result in cache, then returns.

3.     Writes (PutItem/Update/Delete): DAX performs a write-through strategy:

o   DAX sends the write to DynamoDB first (synchronously), waits for confirmation.

o   On success, DAX invalidates or updates the affected cached entry across the cluster.

o   The client receives the confirmed write result.

Sequence diagram: request routing, replication and TTL

# Sample-sequence-Diagram
    participant C as Client
    participant E as DAX Endpoint
    participant N1 as DAX Node-A (leader)
    participant N2 as DAX Node-B
    participant DB as DynamoDB
    C->>E: GetItem(Key)
    E->>N1: Route request
    alt cache hit
        N1-->>C: Item (in-memory)
    else cache miss
        N1->>DB: GetItem(Key)
        DB-->>N1: Item
        N1->>N2: Replicate cache entry (async)
        N1-->>C: Item
    end
    C->>E: PutItem(Key,Value)
    E->>N1: Route write
    N1->>DB: PutItem
    DB-->>N1: Success
    N1->>N2: Invalidate/Update cache for Key (sync/async)
    N1-->>C: Write success
    Note over N1,N2: Cache entries have metadata {Key, Value, InsertTS, TTL}
    %% TTL expiration
    Note over N1: Background janitor checks TTL -> removes expired key
    N1->>N2: Propagate delete for expired key

Cache entry lifecycle (per-key)

1.     Creation

o   On cache miss, the serving node fetches from DynamoDB and inserts entry: {Key, Value, InsertTS, TTL}. TTL is computed from cluster config + item metadata.

2.     Replication

o   The serving node asynchronously (or via a lightweight replication protocol) propagates entry metadata and value to other nodes.

o   Replication is optimized for low-latency eventual consistency — other nodes may see the new entry shortly after the first node.

3.     Access / Refresh

o   Subsequent reads to any node that has the entry return it from memory. Reads may update access metadata (for LRU) locally and partially sync access stats across nodes for eviction decisions.

4.     Invalidation on Write

o   Writes are sent to DynamoDB first. After success, the coordinator node sends invalidation or update messages for the key to all nodes so they don’t serve stale data.

o   Invalidation may be synchronous for correctness-critical paths or eventually propagated depending on configuration and operation type.

5.     TTL Expiration

o   Each node runs a local background janitor that checks InsertTS + TTL and purges expired items.

o   Node then broadcasts the delete to cluster peers to maintain reasonable cross-node cache consistency.

Replication details & guarantees

  •         Replication model: Optimized for fast propagation with eventual convergence. Not a strict consensus log like a database cluster; it prioritizes low read latency.
  •         Durability: DAX is an in-memory cache — not durable. Losing all nodes means cold cache; underlying DynamoDB remains authoritative.
  •         Write ordering: Writes to DynamoDB are serialized by DynamoDB itself. DAX relies on DynamoDB’s success responses and then applies cache invalidation/updates.
  •         Cross-AZ replication: Nodes in other AZs will receive the updates, but there is a small window where a node in another AZ may still have old data until it receives the invalidation.

TTL(time-to-live) semantics and corner cases

  •         Cluster-level TTL config: DAX cluster has TTL behavior that determines how long cached items live. TTL may be influenced by the requesting client (if item has its own expiry metadata in the value) but is ultimately governed by DAX cluster settings.
  •         Clock skew: Each node uses its own clock for TTL checks. AWS mitigates this with NTP-synced hosts; minor skew can produce small windows where nodes disagree on expiry.
  •         Expired but still-served corner case: If node A expires a key and node B hasn’t yet, a request to B may return an item that A regards as expired. Invalidation broadcasts and short TTLs reduce this window.

Failure modes and behavior

  •         Node failure: Other nodes continue serving; cluster rebalances. Cold cache for keys cached mainly on the failed node until replicated.
  •         Network partition: Isolated partition may serve stale entries for eventual-consistency reads. Writes still won’t be acknowledged unless they can reach DynamoDB (DAX forwards writes to DynamoDB).
  •         Cluster replacement / scaling: New node joins request snapshot/metadata to warm up; it starts cold and populates as requests hit it.

Operational concerns & tuning

  •         Node size & count: More/larger nodes = higher in-memory capacity and replication fanout.
  •         TTL tuning: Shorter TTL → fresher reads but more DynamoDB load on misses. Longer TTL → lower DynamoDB RCU usage but higher staleness window.
  •         Eviction policy: Typically LRU or a variant; tuning impacts which keys stay in memory.
  •         Monitoring: Watch cache hit rate, replication lag (if exposed), DynamoDB RCU consumption, and error rates.

Insights: Legend / Notation

  •         Eventual-consistent reads: served from cache when available.
  •         Strongly-consistent reads: bypass cache and read from DynamoDB.
  •         Write-through: write to DynamoDB first, then update/invalidate cache.

No comments:

Post a Comment

Amazon DynamoDB Global Tables | Deep Dive.

A deep into  Amazon  DynamoDB Global Tables . Think of this as an “architect’s + operator’s ”  View:  How they work,  Why they exist,  Subt...