Monday, December 22, 2025

AWS Batch with Examples | Overview.


An Overview of AWS Batch.

Focus:

  •        Tailored for DevOps / Cloud / DevSecOps engineers.
  •        With real-world examples, architecture details, and design trade-offs.

Breakdown:

  •        Intro,
  •        The concept: AWS Batch (Beyond the Marketing),
  •        Core AWS Batch Architecture,
  •        Job Definitions – The “Execution Contract”,
  •        Job Queues – Priority & Scheduling Control,
  •        Compute Environments – Where the Magic Happens,
  •        AWS Batch vs ECS vs EKS (When to Use What),
  •        Real-World Example #1 – Large-Scale ETL Pipeline
  •        Real-World Example #2 – DevOps Automation at Scale,
  •        Array Jobs – Massive Parallelism,
  •        Dependency Graphs – Workflow Orchestration,
  •        Observability & Operations,
  •        Security & IAM (DevSecOps Angle),
  •        Cost Optimization Strategies,
  •        When NOT to Use AWS Batch,
  •        AWS Batch in One Sentence.

Intro:

  •        AWS Batch is a fully managed service that enables developers, scientists, and engineers to run large-scale batch computing workloads on AWS Cloud.
  •        AWS Batch dynamically provisions the optimal amount of compute resources (e.g., CPU or memory-optimized instances) and eliminates the need to manage the underlying infrastructure.

Core Components and Workflow

Jobs:

  •          A unit of work (e.g., a shell script, a Docker container executable) that twtech submit to AWS Batch.
  •         Jobs are specified by a job definition.

Job Definitions:

  •          A blueprint for twtech jobs, specifying runtime parameters, container images, instance types, IAM roles, and environment variables.

Job Queues:

  •          A holding area where submitted jobs reside until they are scheduled to run.
  •         twtech can configure queues with different priorities.

Compute Environments

  •          The underlying infrastructure (Amazon EC2, AWS Fargate, Amazon EKS) where jobs are executed.
  •         AWS Batch manages the provisioning and scaling of these resources.

Scheduler:

  •          Continuously monitors the job queues and dispatches jobs to optimal compute resources within the linked compute environments.

Sample Walkthrough for "Hello World" on AWS Fargate

NB:

  •        This simple example is adapted from the official documentation, uses the AWS Management Console to run a basic "Hello World" job on AWS Fargate. 

Create a Compute Environment:

  •        Navigate to the AWS Batch console and select Compute environments.
  •        Choose Create and select AWS Fargate as the configuration type.
  •        Name the environment (e.g., first-farget-ce) and leave the default settings for a quick start, allowing AWS to create necessary roles automatically.

Create a Job Queue:

  •        Go to Job queues and select Create.
  •        Name the queue (e.g., first-farget-queue) and link the compute environment you just created. Set a priority (e.g., 900).

 Create a Job Definition:

  •        Go to Job definitions and click Create.
  •        Select Single-node for the job type.
  •        Name the definition (e.g., first-farget-job-def).
  •        In the Container configuration section, use the default busybox image and, in the Command field, enter echo Hello world from twtech Batch Team as an override.
  •        Ensure an execution role is created/selected (AWS can create one automatically with default permissions).
  •        Configure other optional settings like memory and vCPU requirements as needed.

 Submit the Job:

  •        Go to Jobs and select Submit new job.
  •        Name the job (e.g., twtech Hello-world-job), select the job definition and job queue you created.
  •        Click Submit.

View the Output:

  •        In the Jobs table, monitor the status. Once the status is SUCCEEDED, select the job name.
  •        In the job details pane, choose the Log stream name link.
  •        This opens Amazon CloudWatch Logs, where twtech should see the " echo Hello world from twtech Batch Team " message

Common Use Cases and Deep Dive

  • AWS Batch is suitable for various compute-intensive workloads: 

High Performance Computing (HPC):

  •          Running scientific simulations (e.g., genomics, fluid dynamics) using multi-node parallel jobs.

Machine Learning:

  •          Training models, hyperparameter tuning, and large-scale data analysis.

Media Processing:

  •        Video transcoding, image processing, and animation rendering. 

NB:

For advanced use cases and detailed examples, refer to the official documentation

Link to Official documentation:

https://docs.aws.amazon.com/batch/

1. The concept: AWS Batch (Beyond the Marketing)

AWS Batch is a managed batch job scheduler that:

  •         Provisions compute automatically (EC2, Spot, or Fargate)
  •         Schedules containerized batch jobs
  •         Optimizes placement, scaling, retries, and queueing
  •         Integrates tightly with ECS, IAM, CloudWatch, and S3

NB:

  •  Think of AWS Batch as “ECS + Auto Scaling + Job Scheduler + Retry Logic”, purposely-built-for non-interactive workloads.

Typical use cases:

  •         Data processing / ETL
  •         Media rendering
  •         Financial risk modeling
  •         ML model training or inference
  •         Scientific simulations
  •         Large-scale DevOps automation jobs

2. Core AWS Batch Architecture 

Key Components

Component

Purpose

Job Definition

How the job runs (image, vCPU, memory, retries)

Job Queue

Where jobs wait, with priority

Compute Environment

Where jobs run (EC2, Spot, Fargate)

ECS (under the hood)

Actually runs containers

3. Job Definitions – The “Execution Contract”

  • A Job Definition is similar to a Kubernetes Pod spec.

# Sample Batch Job Definition

# json
{
  "jobDefinitionName": "twtechimage-processing-job",
  "type": "container",
  "containerProperties": {
    "image": "accoutID.dkr.ecr.us-east-2.amazonaws.com/image-processor:latest",
    "vcpus": 2,
    "memory": 4096,
    "command": ["python", "process.py", "--input", "s3://raw-images", "--output", "s3://processed-images"],
    "jobRoleArn": "arn:aws:iam::accoutID:role/BatchJobRole"
  },
  "retryStrategy": {
    "attempts": 3
  }
}

# Key Concepts

  •         Immutable versioning (new revision each change)
  •         IAM Role per job → least privilege access
  •         Retry strategies baked in (no custom retry code needed)

4. Job Queues – Priority & Scheduling Control

  • twtech can define multiple job queues with priorities.

Sample

Queue

Priority

Purpose

critical-etl

100

Financial data

standard-processing

50

Daily batch

low-priority

10

Backfills

Scheduler behavior

  •         Higher priority queues are drained first
  •         Lower priority jobs wait even if submitted earlier

NB:

  •  This is powerful for enterprise multi-team environments.

5. Compute Environments – Where the Magic Happens

Compute Environments define:

  •         Instance type
  •         On-Demand vs Spot
  •         Scaling limits
  •         Networking

Types

Type

        Use Case

EC2 (On-Demand)

Predictable, SLA-critical jobs

EC2 Spot

Cost-optimized, fault-tolerant workloads

Fargate

No instance management

Fargate Spot

Lowest ops overhead + cheap


Sample: Spot-based Compute Environment

# json
{
  "type": "MANAGED",
  "computeResources": {
    "type": "SPOT",
    "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
    "minvCpus": 0,
    "maxvCpus": 256,
    "instanceTypes": ["m5.large", "m5.xlarge"],
    "subnets": ["subnet-ID"],
    "securityGroupIds": ["sg-ID"],
    "instanceRole": "ecsInstanceRole"
  }
}

# Spot Best Practice

  •         Combine retryStrategy
  •         Enable checkpointing (write progress to S3/DynamoDB)
  •         Use multi-instance-type pools

6. AWS Batch vs ECS vs EKS (When to Use What)

Feature

Batch

ECS

EKS

Job Scheduling

  Native

 Manual

 Manual

Queueing

Retry Logic

 Built-in

Spot Optimization

⚠️ Manual

⚠️ Manual

Kubernetes API

Best for Batch

⭐⭐⭐⭐⭐

⭐⭐

⭐⭐⭐

 Rule of thumb

  •         Batch offline, compute-heavy, job-based
  •         ECS/EKS long-running services or microservices

7. Real-World Example #1 – Large-Scale ETL Pipeline

Scenario

  • Daily ingestion of 10TB logs transform analytics-ready parquet files.

Flow

1.     Logs land in S3

2.     Lambda submits 1 job per partition

3.     AWS Batch:

    •    Scales EC2 Spot automatically
    •    Runs 1,000+ containers in parallel

4.     Results written back to S3

Why Batch (benefits) 

  •         Automatic scaling,
  •         Spot savings (70–90%),
  •         Retry failed partitions only.

8. Real-World Example #2 – DevOps Automation at Scale

Scenario

Security team runs:

  •         Terraform drift detection
  •         AMI vulnerability scanning
  •         CIS benchmark checks

Architecture

  •         Batch jobs triggered nightly
  •         Jobs pull configs from Git
  •         Results stored in DynamoDB + S3
  •         Slack notifications on failure

Benefit

  •         No always-on compute
  •         Easy job isolation
  •         Clean IAM boundaries per job

9. Array Jobs – Massive Parallelism

  • Array jobs let twtech to run N identical jobs with different indices.

# Sample

# bash

aws batch submit-job \
  --job-name twtecchimage-array \
  --job-queue standard-processing \
  --job-definition twtech-image-processing-job \
  --array-properties size=1000

# Inside container:

AWS_BATCH_JOB_ARRAY_INDEX=42

Use cases:

  •         Monte Carlo simulations
  •         Image/video frame processing
  •         Large backfills

10. Dependency Graphs – Workflow Orchestration

  • Batch supports job dependencies.

Sample

Dependencies can be:

  •         SEQUENTIAL
  •         N_TO_N
  •         AFTER_SUCCESS

NB:

  •  For complex workflows, pair Batch with Step Functions.

11. Observability & Operations

Monitoring

  •         CloudWatch Logs per job
  •         Job state transitions
  •         Failed attempts visibility

Metrics

  •         vCPU usage
  •         Job run time
  •         Queue depth

Common Failure Patterns

Issue

     Fix

Jobs stuck in RUNNABLE

Increase max vCPUs

Spot interruptions

Increase retries + checkpoint

Slow startup

Pre-pull images

12. Security & IAM (DevSecOps Angle)

Best practices:

  •         One IAM role per job type
  •         No wildcard S3 permissions
  •         Encrypt:
    •    S3 (SSE-KMS)
    •    EBS volumes
  •         Scan container images (ECR + Inspector)

13. Cost Optimization Strategies

Technique

      Savings

Spot instances

70–90%

Array jobs

Reduced scheduling overhead

Right-sizing vCPU/memory

10–30%

Fargate Spot

No idle cost

14. When NOT to Use AWS Batch

   Long-running APIs
   Low-latency workloads
   Highly interactive tasks
   Kubernetes-native ecosystems

15. AWS Batch in One Sentence

  • AWS Batch is the best way to run massive, containerized, fault-tolerant batch workloads on AWS without managing infrastructure or schedulers.

 

No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...