Wednesday, September 3, 2025

Amazon EMR (Elastic MapReduce) with Node Types & Purchasing Options | Overview.

Amazon EMR (Elastic MapReduce) with Node Types & Purchasing Options - Overview.

 Scope:

  • Intro to Amazon EMR Architecture Basics ,
  • EMR Node Types,
  • Purchasing Options for EMR Nodes,
  • Node Placement Strategies in Real Workloads,
  • Practical Considerations,
  • Final Tips,
  • Insights.

Intro to Amazon EMR Architecture Basics

    • Amazon EMR is a managed big data framework (Hadoop, Spark, Hive, Presto, etc.) that runs on top of EC2 instances.
    • The cluster is organized into nodes grouped by function, and each node can be tied to a different EC2 purchasing model.

 EMR Node Types

  1. Master Node
    • Orchestrates the cluster.
    • Runs the YARN ResourceManager, HDFS NameNode, Spark driver, and cluster management daemons
    • HDFS (Hadoop Distributed File System), is a primary storage component of the Apache Hadoop framework. 
    • It is an open-source, distributed file system designed to store and manage very large datasets across a cluster of low-cost, commodity hardware.
    • Decides what job runs where.
    • Usually a single node, but EMR supports multi-master (HA) mode for production.
    • Critical — if it fails (in single-master mode), the whole cluster is down.
    • Should run on On-Demand or Reserved, not Spot.
  2. Core Node
    • Executes tasks and stores data in HDFS (Hadoop Distributed File System).
    • Runs YARN NodeManager + HDFS DataNode.
    • Long-lived storage role.
    • Typically scaled based on workload and storage requirements.
    • Should run On-Demand or Reserved if twtech relies on HDFS (storage consistency). 
    • Spot can cause data loss if terminated.
  3. Task Node (a.k.a. Task-Only / Compute-Only)
    • Executes processing tasks but does not store HDFS blocks.
    • Purely compute; ephemeral role.
    • Ideal for burst scaling with Spot Instances.
    • Can be dynamically added/removed with EMR Auto Scaling.

 

Purchasing Options for EMR Nodes

  1. On-Demand Instances
    • Pay per second (with a 1-minute minimum).
    • Best for: Master nodes (to ensure stability), Core nodes if twtech needs consistent availability.
    • Predictable cost, no interruption risk.
  2. Spot Instances
    • Up to 70–90% cheaper than On-Demand.
    • Risk: AWS can reclaim the instance with 2-minute notice if capacity is needed.
    • Best for: Task nodes (since they’re stateless compute).
    • Possible but risky for Core nodes (loss of HDFS data if terminated).
  3. Reserved Instances (RI)
    • Commit to 1-year or 3-year term for predictable workloads.
    • Provides significant discount (up to ~70%).
    • Applies to EC2 under EMR, not directly to EMR.
    • Best for: long-running clusters or known steady-state workloads.
    • Typically applied to Master + Core nodes.
  4. Savings Plans
    • Flexible discount model vs. RI.
    • Commitment to spend a certain $/hour over 1–3 years.
    • Works across EC2 + EMR workloads, regardless of instance family or region.
    • Good for mixed and evolving workloads.

 Node Placement Strategies in Real Workloads

  • Production, steady workload
    • Master: On-Demand or Reserved
    • Core: Reserved (if long-lived) or On-Demand
    • Task: Spot
  • Burst / analytics workloads
    • Master: On-Demand
    • Core: On-Demand
    • Task: Mostly Spot
  • High Availability (multi-master)
    • Masters: 3x On-Demand (spread across AZs)
    • Core: On-Demand or Reserved
    • Task: Spot

 Practical Considerations

    • Auto Scaling: EMR can scale Task Nodes up/down based on metrics (YARN queue, Spark pending jobs, etc.).
    • Data durability: If using S3 (via EMRFS...Elastic MapReduce File System) instead of HDFS, twtech can safely run all Core nodes on Spot, because the data is externalized.
    • Instance fleets & EMR managed scaling: Allow mixing On-Demand + Spot across multiple instance types, reducing the chance of Spot interruptions.
    • Cost vs reliability tradeoff:
      • Mission-critical more On-Demand/Reserved.
      • Cost-sensitive analytics maximize Spot for Task nodes.

Final Tips:


    • Master
      stable (On-Demand / Reserved)
    • Core stable if HDFS (On-Demand / Reserved); Spot if using S3
    • Task elastic compute (Spot)
twtech-Insights:
    • A real-world EMR architecture patterns and how node type + purchasing options shift depending on whether twtech is using S3 as a data lake or HDFS-heavy EMR.

 Pattern 1: S3-Based Data Lake (Most Common Modern Setup)

  • Data is stored in Amazon S3, and EMR clusters are mostly ephemeral (spin up, process, terminate).

Architecture

    • Storage: S3 (via EMRFS or Glue Data Catalog for schema)
    • Processing: Spark, Presto, Hive on EMR
    • HDFS: Used minimally (temporary shuffle or scratch space)

Node Role Strategy

  • Master Node
    • On-Demand (or Reserved for always-on clusters).
    • No tolerance for interruption.
  • Core Nodes
    • Since data lives in S3, Core nodes don’t need to persist HDFS data.
    • They can run Spot safely (if interrupted, no real data loss).
    • Often you mix 30% On-Demand + 70% Spot for balance.
  • Task Nodes
    • 100% Spot (pure compute, safe to lose).
    • Elastic scaling during heavy queries.

Purchasing Mix

    • Master On-Demand
    • Core Mostly Spot (cheap, disposable)
    • Task Spot (aggressive savings)

NB:

  • This model minimizes cost — great for ad hoc analytics, ETL, ML preprocessing.

 Pattern 2: HDFS-Heavy EMR (Hadoop Distributed File System...Traditional Hadoop-Style)

    • Data is stored in HDFS across Core nodes (instead of S3)
    • This means node loss = data loss.

Architecture

    • Storage: HDFS (blocks distributed across Core nodes)
    • Processing: Spark, Hadoop MapReduce, Hive
    • Cluster: Long-lived, sometimes weeks/months

Node Role Strategy

  • Master Node
    • Always On-Demand or Reserved.
    • In HA mode → 3 masters across AZs, all On-Demand.
  • Core Nodes
    • Must be stable because they store HDFS(Hadoop Distributed File System) blocks.
    • Run On-Demand or Reserved for durability.
    • Spot is risky — if AWS reclaims it, HDFS(Hadoop Distributed File System) gets corrupted.
  • Task Nodes
    • Still flexible — can be Spot, since they don’t store HDFS(Hadoop Distributed File System) blocks.

Purchasing Mix

    • Master Reserved (predictable long-term workloads)
    • Core Reserved (durable HDFS storage)
    • Task Spot (for scaling)

NB:

    • This setup is more expensive but necessary for HDFS-native workloads (legacy Hadoop, Spark caching, or when low-latency storage is required).

 Pattern 3: Hybrid (HDFS for hot data + S3 for bulk data)

Used when:

    • S3 holds raw/archival data
    • HDFS holds intermediate working sets for faster performance

Node Role Strategy

    • Master Node On-Demand/Reserved
    • Core Nodes Mix of On-Demand (for HDFS reliability) + Spot (for transient spillover)
    • Task Nodes Spot

Purchasing Mix Example

    • 20% Core nodes On-Demand (anchor HDFS durability)
    • 80% Core nodes Spot (extra storage/compute that can vanish safely)
    • Task nodes 100% Spot

NB:

    • This balances performance + cost. 
    • Common in ML training pipelines and streaming + batch hybrid architectures.

 Pattern 4: Always-On Streaming + Batch (e.g., Kafka + Spark Structured Streaming)

  • Some orgs run EMR clusters 24/7 for continuous ingest + analytics.

Node Role Strategy

    • Master On-Demand/Reserved (must stay alive)
    • Core Mostly Reserved (because it’s a 24/7 workload)
    • Task Mix of Spot + On-Demand (streaming jobs can’t be disrupted as easily as batch)

Purchasing Mix

    • Master Reserved
    • Core Reserved
    • Task 50% On-Demand, 50% Spot (to balance cost and stability)

NB:

  • This is a good fit for financial services, IoT, fraud detection pipelines.

 Pattern 5: Short-Lived EMR Clusters (Transient, Job-Specific)

  • Cluster spins up processes data from S3 shuts down.

Node Role Strategy

    • Master On-Demand (for job orchestration)
    • Core Spot (safe because no persistent HDFS use)
    • Task Spot

Purchasing Mix

    • Master On-Demand
    • Core Spot
    • Task Spot

NB:

  • This is lowest-cost architecture for ETL-style workflows or scheduled nightly jobs.

 Summary by Storage Choice

Storage Model

Master

Core

Task

S3-based (stateless)

On-Demand

Spot / Mix

Spot

HDFS-heavy

On-Demand / Reserved

On-Demand / Reserved

Spot

Hybrid (HDFS + S3)

On-Demand

Mix (Anchor = Reserved, Extra = Spot)

Spot

Always-On (Streaming)

Reserved

Reserved

Mix (Spot + On-Demand)

Transient EMR (ephemeral)

On-Demand

Spot

Spot

 So the big pivot point is:

  • If S3 = primary storage, twtech can aggressively use Spot instance for Core/Task.
  • If HDFS (Hadoop Distributed File System) = primary storage, twtech must anchor Core nodes with On-Demand/Reserved.


No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...