Amazon EMR (Elastic MapReduce) with Node Types & Purchasing Options - Overview.
Scope:
- Intro to Amazon EMR Architecture Basics ,
- EMR Node Types,
- Purchasing Options for EMR Nodes,
- Node Placement Strategies in Real Workloads,
- Practical Considerations,
- Final Tips,
- Insights.
Intro to Amazon EMR Architecture Basics
- Amazon EMR is a managed big data framework (Hadoop, Spark, Hive, Presto, etc.) that runs on top of EC2 instances.
- The cluster is organized into nodes grouped by function, and each node can be tied to a different EC2 purchasing model.
EMR Node Types
- Master Node
- Orchestrates the cluster.
- Runs the YARN ResourceManager, HDFS NameNode, Spark driver, and cluster management daemons.
- HDFS (Hadoop Distributed File System), is a primary storage component of the Apache Hadoop framework.
- It is an open-source, distributed file system designed to store and manage very large datasets across a cluster of low-cost, commodity hardware.
- Decides what job runs where.
- Usually a single node, but EMR supports multi-master (HA) mode for production.
- Critical — if it fails
(in single-master mode), the whole cluster is down.
- Should run on On-Demand or Reserved, not Spot.
- Core Node
- Executes tasks and stores data in HDFS (Hadoop Distributed File System).
- Runs YARN NodeManager + HDFS DataNode.
- Long-lived storage role.
- Typically scaled based on workload and storage
requirements.
- Should run On-Demand or Reserved if twtech relies on HDFS (storage consistency).
- Spot can cause data loss if terminated.
- Task Node (a.k.a. Task-Only /
Compute-Only)
- Executes processing tasks but does not store HDFS
blocks.
- Purely compute; ephemeral role.
- Ideal for burst scaling with Spot Instances.
- Can be dynamically added/removed with EMR Auto
Scaling.
Purchasing Options for EMR
Nodes
- On-Demand Instances
- Pay per second (with
a 1-minute minimum).
- Best for: Master nodes (to ensure stability), Core nodes if twtech needs consistent
availability.
- Predictable cost, no interruption risk.
- Spot Instances
- Up to 70–90% cheaper than On-Demand.
- Risk: AWS can reclaim the instance with 2-minute
notice if capacity is needed.
- Best for: Task nodes (since they’re stateless
compute).
- Possible but risky for Core nodes (loss of HDFS data if terminated).
- Reserved Instances (RI)
- Commit to 1-year or 3-year term for predictable
workloads.
- Provides significant discount (up to ~70%).
- Applies to EC2 under EMR, not directly to EMR.
- Best for: long-running clusters or known
steady-state workloads.
- Typically applied to Master + Core nodes.
- Savings Plans
- Flexible discount model vs. RI.
- Commitment to spend a certain $/hour over 1–3 years.
- Works across EC2 + EMR workloads, regardless of
instance family or region.
- Good for mixed and evolving workloads.
Node Placement Strategies in
Real Workloads
- Production, steady workload
- Master: On-Demand or Reserved
- Core: Reserved (if
long-lived) or On-Demand
- Task: Spot
- Burst / analytics workloads
- Master: On-Demand
- Core: On-Demand
- Task: Mostly Spot
- High Availability (multi-master)
- Masters: 3x On-Demand (spread across AZs)
- Core: On-Demand or Reserved
- Task: Spot
Practical
Considerations
- Auto Scaling: EMR can scale Task Nodes up/down based on
metrics (YARN queue, Spark pending
jobs, etc.).
- Data durability: If using S3 (via EMRFS...Elastic MapReduce File System) instead of HDFS, twtech can safely run all Core nodes on Spot, because the data is externalized.
- Instance fleets & EMR managed scaling: Allow mixing On-Demand + Spot across multiple instance types, reducing the chance of Spot interruptions.
- Cost vs reliability tradeoff:
- Mission-critical → more On-Demand/Reserved.
- Cost-sensitive analytics → maximize Spot for Task nodes.
Final Tips:
Master → stable (On-Demand / Reserved)- Core → stable if HDFS (On-Demand / Reserved); Spot if using S3
- Task → elastic compute (Spot)
- A real-world EMR architecture patterns and how node type + purchasing options shift depending on whether twtech is using S3 as a data lake or HDFS-heavy EMR.
Pattern 1: S3-Based Data Lake (Most Common Modern
Setup)
- Data is stored in Amazon
S3, and EMR clusters are mostly ephemeral (spin up,
process, terminate).
Architecture
- Storage:
S3 (via
EMRFS or Glue Data Catalog for schema)
- Processing: Spark, Presto, Hive on EMR
- HDFS: Used minimally (temporary shuffle or scratch space)
Node Role Strategy
- Master Node
- On-Demand (or
Reserved for always-on clusters).
- No tolerance for interruption.
- Core Nodes
- Since data lives in S3, Core nodes don’t need
to persist HDFS data.
- They can run Spot safely (if interrupted, no real data loss).
- Often you mix 30% On-Demand + 70% Spot for balance.
- Task Nodes
- 100% Spot (pure
compute, safe to lose).
- Elastic scaling during heavy queries.
Purchasing Mix
- Master → On-Demand
- Core → Mostly Spot (cheap, disposable)
- Task → Spot (aggressive savings)
NB:
- This model minimizes cost — great for ad hoc analytics, ETL, ML preprocessing.
Pattern 2: HDFS-Heavy EMR (Hadoop Distributed File System...Traditional Hadoop-Style)
- Data is stored in HDFS across Core nodes (instead of S3).
- This means node loss = data loss.
Architecture
- Storage: HDFS (blocks distributed across Core nodes)
- Processing: Spark, Hadoop MapReduce, Hive
- Cluster: Long-lived, sometimes weeks/months
Node Role Strategy
- Master Node
- Always On-Demand or Reserved.
- In HA mode → 3 masters across AZs, all On-Demand.
- Core Nodes
- Must be stable because they store HDFS(Hadoop Distributed File System) blocks.
- Run On-Demand or Reserved for durability.
- Spot is risky — if AWS reclaims it, HDFS(Hadoop Distributed File System) gets
corrupted.
- Task Nodes
- Still flexible — can be Spot, since they don’t store
HDFS(Hadoop Distributed File System) blocks.
Purchasing Mix
- Master → Reserved (predictable long-term workloads)
- Core → Reserved (durable HDFS storage)
- Task → Spot (for scaling)
NB:
- This setup is more expensive
but necessary for HDFS-native workloads (legacy Hadoop, Spark caching, or when low-latency storage is
required).
Pattern 3: Hybrid (HDFS for hot data + S3 for bulk
data)
Used when:
- S3 holds raw/archival data
- HDFS holds intermediate working sets for faster performance
Node Role Strategy
- Master Node
→ On-Demand/Reserved
- Core Nodes → Mix of On-Demand (for HDFS reliability) + Spot (for transient spillover)
- Task Nodes → Spot
Purchasing Mix Example
- 20% Core nodes On-Demand (anchor HDFS durability)
- 80% Core nodes
Spot (extra
storage/compute that can vanish safely)
- Task nodes 100% Spot
NB:
- This balances performance + cost.
- Common in ML training pipelines and streaming + batch hybrid architectures.
Pattern 4: Always-On Streaming + Batch (e.g., Kafka +
Spark Structured Streaming)
- Some orgs run EMR clusters 24/7 for continuous ingest + analytics.
Node Role Strategy
- Master
→ On-Demand/Reserved (must stay
alive)
- Core → Mostly Reserved (because it’s a 24/7 workload)
- Task
→ Mix of Spot + On-Demand (streaming
jobs can’t be disrupted as easily as batch)
Purchasing Mix
- Master → Reserved
- Core → Reserved
- Task → 50% On-Demand, 50% Spot (to balance cost and stability)
NB:
- This is a good fit for financial services, IoT, fraud detection pipelines.
Pattern 5: Short-Lived EMR Clusters (Transient, Job-Specific)
- Cluster spins up → processes data from S3 → shuts down.
Node Role Strategy
- Master
→ On-Demand (for job orchestration)
- Core → Spot (safe because no persistent HDFS use)
- Task → Spot
Purchasing Mix
- Master → On-Demand
- Core → Spot
- Task → Spot
NB:
- This is lowest-cost architecture for ETL-style workflows or scheduled nightly jobs.
Summary by Storage
Choice
|
Storage Model |
Master |
Core |
Task |
|
S3-based (stateless) |
On-Demand |
Spot / Mix |
Spot |
|
HDFS-heavy |
On-Demand / Reserved |
On-Demand / Reserved |
Spot |
|
Hybrid (HDFS + S3) |
On-Demand |
Mix (Anchor = Reserved, Extra =
Spot) |
Spot |
|
Always-On
(Streaming) |
Reserved |
Reserved |
Mix (Spot + On-Demand) |
|
Transient EMR
(ephemeral) |
On-Demand |
Spot |
Spot |
So the big pivot point is:
- If S3 = primary storage,
twtech can aggressively use Spot instance for Core/Task.
- If HDFS (Hadoop Distributed File System) = primary storage, twtech must anchor Core nodes with On-Demand/Reserved.
No comments:
Post a Comment