An Overview of AWS Batch.
Focus:
- Tailored
for DevOps / Cloud / DevSecOps engineers.
- With
real-world
examples, architecture details, and design trade-offs.
Breakdown:
- Intro,
- The
concept: AWS Batch (Beyond the Marketing),
- Core
AWS Batch Architecture,
- Job
Definitions – The “Execution Contract”,
- Job
Queues – Priority & Scheduling Control,
- Compute
Environments – Where the Magic Happens,
- AWS Batch vs ECS vs EKS (When to Use What),
- Real-World
Example #1 – Large-Scale ETL Pipeline
- Real-World
Example #2 – DevOps Automation at Scale,
- Array
Jobs – Massive Parallelism,
- Dependency
Graphs – Workflow Orchestration,
- Observability
&
Operations,
- Security & IAM (DevSecOps Angle),
- Cost
Optimization Strategies,
- When
NOT to Use AWS Batch,
- AWS
Batch in One Sentence.
Intro:
- AWS
Batch is
a fully managed service that enables developers, scientists, and engineers to
run large-scale batch computing workloads on AWS Cloud.
- AWS Batch dynamically provisions the optimal amount of compute resources (e.g., CPU or memory-optimized instances) and eliminates the need to manage the underlying infrastructure.
Core
Components and Workflow
Jobs:
- A unit of work (e.g., a shell script, a Docker container executable) that twtech
submit to AWS Batch.
- Jobs are specified by a job definition.
Job
Definitions:
- A blueprint for twtech jobs, specifying runtime parameters, container images, instance types, IAM roles, and environment variables.
Job
Queues:
- A holding area where
submitted jobs reside until they are scheduled to run.
- twtech can configure queues with different priorities.
Compute
Environments
- The underlying infrastructure (Amazon EC2, AWS Fargate, Amazon EKS)
where jobs are executed.
- AWS Batch manages the provisioning and scaling of these resources.
Scheduler:
- Continuously
monitors the job queues and dispatches jobs to optimal compute resources within
the linked compute environments.
Sample
Walkthrough for "Hello World" on AWS Fargate
NB:
- This simple example is adapted from the official documentation, uses the AWS Management Console to run a basic "Hello World" job on AWS Fargate.
Create
a Compute Environment:
- Navigate to the AWS Batch console and select Compute environments.
- Choose Create and select AWS Fargate as the configuration type.
- Name the environment (e.g., first-farget-ce) and leave the default settings for a quick start, allowing AWS to create necessary roles automatically.
Create
a Job Queue:
- Go to Job queues and select Create.
- Name the queue (e.g., first-farget-queue) and link the compute environment you just created. Set a priority (e.g., 900).
Create a Job Definition:
- Go to Job definitions and click Create.
- Select Single-node for the job type.
- Name the definition (e.g., first-farget-job-def).
- In the Container configuration section, use the default busybox image and, in the Command field, enter echo Hello world from twtech Batch Team as an override.
- Ensure an execution role is created/selected (AWS can create one automatically with default permissions).
- Configure other optional settings like memory and vCPU requirements as needed.
Submit the Job:
- Go to Jobs and select Submit new job.
- Name the job (e.g., twtech Hello-world-job), select the job definition and job queue you created.
- Click Submit.
View the Output:
- In the Jobs table, monitor the status. Once the status is SUCCEEDED, select the job name.
- In
the job details pane, choose the Log
stream name link.
- This opens Amazon CloudWatch Logs, where twtech should see the " echo Hello world from twtech Batch Team " message
Common
Use Cases and Deep Dive
- AWS Batch is suitable for various compute-intensive workloads:
High
Performance Computing (HPC):
- Running scientific simulations (e.g., genomics, fluid dynamics) using multi-node parallel jobs.
Machine
Learning:
- Training models, hyperparameter tuning, and large-scale data analysis.
Media
Processing:
- Video transcoding, image processing, and animation rendering.
NB:
For advanced use cases and detailed examples, refer to the official documentationLink to Official documentation:
https://docs.aws.amazon.com/batch/
1. The concept: AWS Batch (Beyond the Marketing)
AWS Batch is a managed batch job scheduler that:
- Provisions compute automatically (EC2, Spot, or Fargate)
- Schedules containerized batch jobs
- Optimizes placement, scaling, retries, and queueing
- Integrates tightly with ECS, IAM, CloudWatch, and S3
NB:
- Think of AWS Batch as “ECS + Auto Scaling + Job Scheduler + Retry Logic”, purposely-built-for non-interactive workloads.
Typical use cases:
- Data
processing / ETL
- Media
rendering
- Financial
risk modeling
- ML
model training or inference
- Scientific
simulations
- Large-scale
DevOps automation jobs
2. Core AWS Batch Architecture
Key
Components
|
Component |
Purpose |
|
Job Definition |
How the job runs (image,
vCPU, memory, retries) |
|
Job Queue |
Where jobs wait, with priority |
|
Compute Environment |
Where jobs run (EC2,
Spot, Fargate) |
|
ECS (under the hood) |
Actually runs containers |
3. Job Definitions – The “Execution Contract”
- A Job Definition is similar to a Kubernetes Pod spec.
# Sample Batch Job Definition
# json{ "jobDefinitionName": "twtechimage-processing-job", "type": "container", "containerProperties": { "image": "accoutID.dkr.ecr.us-east-2.amazonaws.com/image-processor:latest", "vcpus": 2, "memory": 4096, "command": ["python", "process.py", "--input", "s3://raw-images", "--output", "s3://processed-images"], "jobRoleArn": "arn:aws:iam::accoutID:role/BatchJobRole" }, "retryStrategy": { "attempts": 3 }}# Key
Concepts
- Immutable versioning (new revision each change)
- IAM Role per job → least privilege access
- Retry strategies baked in (no custom retry code needed)
4. Job Queues – Priority & Scheduling Control
- twtech can define multiple job queues with priorities.
Sample
|
Queue |
Priority |
Purpose |
|
|
100 |
Financial data |
|
|
50 |
Daily batch |
|
|
10 |
Backfills |
Scheduler behavior
- Higher
priority queues are drained first
- Lower
priority jobs wait even if submitted earlier
NB:
- This is powerful for enterprise multi-team environments.
5. Compute Environments – Where the Magic Happens
Compute Environments define:
- Instance type
- On-Demand vs Spot
- Scaling limits
- Networking
Types
|
Type |
Use Case |
|
EC2 (On-Demand) |
Predictable,
SLA-critical jobs |
|
EC2 Spot |
Cost-optimized,
fault-tolerant workloads |
|
Fargate |
No instance
management |
|
Fargate Spot |
Lowest ops
overhead + cheap |
Sample: Spot-based Compute Environment
# json{ "type": "MANAGED", "computeResources": { "type": "SPOT", "allocationStrategy": "SPOT_CAPACITY_OPTIMIZED", "minvCpus": 0, "maxvCpus": 256, "instanceTypes": ["m5.large", "m5.xlarge"], "subnets": ["subnet-ID"], "securityGroupIds": ["sg-ID"], "instanceRole": "ecsInstanceRole" }}# Spot Best
Practice
- Combine retryStrategy
- Enable checkpointing (write progress to S3/DynamoDB)
- Use multi-instance-type pools
6. AWS Batch vs ECS vs EKS (When to Use What)
|
Feature |
Batch |
ECS |
EKS |
|
Job
Scheduling |
✅ Native |
❌ Manual |
❌ Manual |
|
Queueing |
✅ |
❌ |
❌ |
|
Retry
Logic |
✅ Built-in |
❌ |
❌ |
|
Spot
Optimization |
✅ |
⚠️ Manual |
⚠️ Manual |
|
Kubernetes
API |
❌ |
❌ |
✅ |
|
Best
for Batch |
⭐⭐⭐⭐⭐ |
⭐⭐ |
⭐⭐⭐ |
Rule of thumb
- Batch → offline, compute-heavy, job-based
- ECS/EKS → long-running services or microservices
7. Real-World Example #1
– Large-Scale ETL Pipeline
Scenario
- Daily ingestion of 10TB logs → transform → analytics-ready parquet files.
Flow
1.
Logs land in S3
2.
Lambda submits 1 job per partition
3.
AWS Batch:
- Scales EC2 Spot automatically
- Runs 1,000+ containers in parallel
4.
Results written back to S3
Why Batch (benefits)
- Automatic scaling,
- Spot savings (70–90%),
- Retry failed partitions only.
8. Real-World Example #2 – DevOps Automation at Scale
Scenario
Security team runs:
- Terraform
drift detection
- AMI vulnerability
scanning
- CIS benchmark
checks
Architecture
- Batch jobs
triggered nightly
- Jobs pull
configs from Git
- Results
stored in DynamoDB + S3
- Slack
notifications on failure
Benefit
- No always-on
compute
- Easy job
isolation
- Clean IAM boundaries per job
9. Array Jobs – Massive Parallelism
- Array jobs let twtech to run N identical jobs with different indices.
# Sample
# bash
aws batch submit-job \ --job-name twtecchimage-array \ --job-queue standard-processing \ --job-definition twtech-image-processing-job \ --array-properties size=1000# Inside container:
AWS_BATCH_JOB_ARRAY_INDEX=42Use cases:
- Monte Carlo simulations
- Image/video frame processing
- Large backfills
10. Dependency Graphs – Workflow Orchestration
- Batch supports job
dependencies.
Sample
Dependencies can be:
- SEQUENTIAL
- N_TO_N
- AFTER_SUCCESS
NB:
- For complex workflows, pair Batch with Step Functions.
11. Observability & Operations
Monitoring
- CloudWatch
Logs per job
- Job state
transitions
- Failed attempts
visibility
Metrics
- vCPU usage
- Job run time
- Queue depth
Common
Failure Patterns
|
Issue |
Fix |
|
Jobs
stuck in RUNNABLE |
Increase
max vCPUs |
|
Spot
interruptions |
Increase
retries + checkpoint |
|
Slow
startup |
Pre-pull
images |
12. Security & IAM (DevSecOps Angle)
Best practices:
- One IAM role per job type
- No wildcard S3 permissions
- Encrypt:
- S3 (SSE-KMS)
- EBS volumes
- Scan container images (ECR + Inspector)
13. Cost Optimization Strategies
|
Technique |
Savings |
|
Spot
instances |
70–90% |
|
Array
jobs |
Reduced
scheduling overhead |
|
Right-sizing
vCPU/memory |
10–30% |
|
Fargate
Spot |
No idle
cost |
14. When NOT to Use AWS Batch
❌ Long-running
APIs
❌ Low-latency
workloads
❌ Highly
interactive tasks
❌ Kubernetes-native
ecosystems
15. AWS Batch in One Sentence
- AWS Batch is the best way to run massive, containerized, fault-tolerant batch workloads on AWS without managing infrastructure or schedulers.
No comments:
Post a Comment