AWS Batch - Overview.
Focus:
- Tailored for:
- DevOps
- Cloud
- DevSecOps engineers.
- With real-world examples,
- architecture details,
- And design trade-offs.
Scope:
- Intro,
- Core Components and Workflow,
- The concept of AWS Batch (Beyond the Marketing) Deep Dive,
- Core AWS Batch Architecture,
- Job Definitions – The “Execution Contract”,
Sample Batch Job Definition,
- Job Queues – Priority & Scheduling Control,
- Compute Environments – Where the Magic Happens,
- AWS Batch vs ECS vs EKS (When to Use What),
- Real-World Example #1 – Large-Scale ETL Pipeline
- Real-World Example #2 – DevOps Automation at Scale,
- Array Jobs – Massive Parallelism,
- Dependency Graphs – Workflow Orchestration,
- Observability & Operations,
- Security & IAM (DevSecOps Angle),
- Cost Optimization Strategies,
- When NOT to Use AWS Batch,
- AWS Batch in One Sentence.
Intro:
- AWS Batch is a fully managed service that enables:
- Developers,
- Scientists,
- And engineers
- To run large-scale batch computing workloads on AWS Cloud.
- AWS Batch dynamically provisions the optimal amount of compute resources:
- e.g.,
- CPU
- or memory-optimized instances
- Thus, eliminating the need to manage the underlying infrastructure.
Core
Components and Workflow
Jobs:
- A unit of work (e.g., a shell script, a Docker container executable) that twtech
submit to AWS Batch.
- Jobs are specified by a job definition.
Job Definitions:
- A blueprint for twtech jobs, specifying runtime parameters, container images, instance types, IAM roles, and environment variables.
Job Queues:
- A holding area where
submitted jobs reside until they are scheduled to run.
- twtech can configure queues with different priorities.
Compute Environments
- The underlying infrastructure (Amazon EC2, AWS Fargate, Amazon EKS)
where jobs are executed.
- AWS Batch manages the provisioning and scaling of these resources.
Scheduler:
- Continuously monitors the job queues and dispatches jobs to optimal compute resources within the linked compute environments.
Sample
Walkthrough for "Hello World" on AWS Fargate
NB:
- This simple example is adapted from the official documentation, uses the AWS Management Console to run a basic "Hello World" job on AWS Fargate.
Create a Compute Environment:
- Navigate to the AWS Batch console and select Compute environments.
- Choose Create and select AWS Fargate as the configuration type.
- Name the environment (e.g., first-farget-ce) and leave the default settings for a quick start, allowing AWS to create necessary roles automatically.
Create a Job Queue:
- Go to Job queues and select Create.
- Name the queue (e.g., first-farget-queue) and link the compute environment you just created. Set a priority (e.g., 900).
Create a Job Definition:
- Go to Job definitions and click Create.
- Select Single-node for the job type.
- Name the definition (e.g., first-farget-job-def).
- In the Container configuration section, use the default busybox image and, in the Command field, enter echo Hello world from twtech Batch Team as an override.
- Ensure an execution role is created/selected (AWS can create one automatically with default permissions).
- Configure other optional settings like memory and vCPU requirements as needed.
Submit the Job:
- Go to Jobs and select Submit new job.
- Name the job (e.g., twtech Hello-world-job), select the job definition and job queue you created.
- Click Submit.
View the Output:
- In the Jobs table, monitor the status. Once the status is SUCCEEDED, select the job name.
- In
the job details pane, choose the Log
stream name link.
- This opens Amazon CloudWatch Logs, where twtech should see the " echo Hello world from twtech Batch Team " message
Common
Use Cases and Deep Dive
- AWS Batch is suitable for various compute-intensive workloads:
High Performance Computing (HPC):
- Running scientific simulations (e.g., genomics, fluid dynamics) using multi-node parallel jobs.
Machine Learning:
- Training models, hyperparameter tuning, and large-scale data analysis.
Media Processing:
- Video transcoding, image processing, and animation rendering.
NB:
- For advanced use cases and detailed examples, refer to the official documentation
Link to Official documentation:
1. The concept of AWS Batch (Beyond the Marketing) Deep Dive,
AWS Batch is a managed batch job scheduler that:
- Provisions compute automatically (EC2, Spot, or Fargate)
- Schedules containerized batch jobs
- Optimizes placement, scaling, retries, and queueing
- Integrates tightly with ECS, IAM, CloudWatch, and S3
NB:
- Think of AWS Batch as “ECS + Auto Scaling + Job Scheduler + Retry Logic”, purposely-built-for non-interactive workloads.
Typical use cases:
- Data
processing / ETL
- Media
rendering
- Financial
risk modeling
- ML
model training or inference
- Scientific
simulations
- Large-scale
DevOps automation jobs
2. Core AWS Batch Architecture
Key
Components
|
Component |
Purpose |
|
Job Definition |
How the job runs (image,
vCPU, memory, retries) |
|
Job Queue |
Where jobs wait, with priority |
|
Compute Environment |
Where jobs run (EC2,
Spot, Fargate) |
|
ECS (under the hood) |
Actually runs containers |
3. Job Definitions – The “Execution Contract”
- A Job Definition is similar to a Kubernetes Pod spec.
# Sample Batch Job Definition
# json{"jobDefinitionName":"twtechimage-processing-job","type":"container","containerProperties":{"image":"accoutID.dkr.ecr.us-east-2.amazonaws.com/image-processor:latest","vcpus":2,"memory":4096,"command":["python","process.py","--input","s3://raw-images","--output","s3://processed-images"],"jobRoleArn":"arn:aws:iam::accoutID:role/BatchJobRole"},"retryStrategy":{"attempts":3}}
# Explanation
- Immutable versioning (new revision each change)
- IAM Role per job → least privilege access
- Retry strategies baked in (no custom retry code needed)
4. Job Queues – Priority & Scheduling Control
- twtech can define multiple job queues with priorities.
Sample
|
Queue |
Priority |
Purpose |
|
|
100 |
Financial data |
|
|
50 |
Daily batch |
|
|
10 |
Backfills |
Scheduler behavior
- Higher
priority queues are drained first
- Lower
priority jobs wait even if submitted earlier
NB:
- This is powerful for enterprise multi-team environments.
5. Compute Environments – Where the Magic Happens
Compute Environments define:
- Instance type
- On-Demand vs Spot
- Scaling limits
- Networking
Types
|
|
|
|
|
|
|
|
|
|
Sample: Spot-based Compute Environment
# json{"type":"MANAGED","computeResources":{"type":"SPOT","allocationStrategy":"SPOT_CAPACITY_OPTIMIZED","minvCpus":0,"maxvCpus":256,"instanceTypes":["m5.large","m5.xlarge"],"subnets":["twtech-subnet-ID"],"securityGroupIds":["twtech-sg-ID"],"instanceRole":"ecsInstanceRole"}}
# Spot Best
Practice
- Combine retryStrategy
- Enable checkpointing (write progress to S3/DynamoDB)
- Use multi-instance-type pools
6. AWS Batch vs ECS vs EKS (When to Use What)
|
Feature |
Batch |
ECS |
EKS |
|
Job
Scheduling |
✅ Native |
❌ Manual |
❌ Manual |
|
Queueing |
✅ |
❌ |
❌ |
|
Retry
Logic |
✅ Built-in |
❌ |
❌ |
|
Spot
Optimization |
✅ |
⚠️ Manual |
⚠️ Manual |
|
Kubernetes
API |
❌ |
❌ |
✅ |
|
Best
for Batch |
⭐⭐⭐⭐⭐ |
⭐⭐ |
⭐⭐⭐ |
Rule of thumb
- Batch → offline, compute-heavy, job-based
- ECS/EKS → long-running services or microservices
7. Real-World Example #1
– Large-Scale ETL Pipeline
Scenario
- Daily ingestion of 10TB logs → transform → analytics-ready parquet files.
Flow
1.
Logs land in S3
2.
Lambda submits 1 job per partition
3.
AWS Batch:
- Scales EC2 Spot automatically
- Runs 1,000+ containers in parallel
4.
Results written back to S3
Why Batch (benefits)
- Automatic scaling,
- Spot savings (70–90%),
- Retry failed partitions only.
8. Real-World Example #2 – DevOps Automation at Scale
Scenario
Security team runs:
- Terraform
drift detection
- AMI vulnerability
scanning
- CIS benchmark
checks
Architecture
- Batch jobs
triggered nightly
- Jobs pull
configs from Git
- Results
stored in DynamoDB + S3
- Slack
notifications on failure
Benefit
- No always-on
compute
- Easy job
isolation
- Clean IAM boundaries per job
9. Array Jobs – Massive Parallelism
- Array jobs let twtech to run N identical jobs with different indices.
# Sample
# bash
aws batch submit-job \--job-nametwtecch-image-array\--job-queuestandard-processing\--job-definitiontwtech-image-processing-job\--array-properties size=1000
# Inside container:
AWS_BATCH_JOB_ARRAY_INDEX=42
Use cases:
- Monte Carlo simulations
- Image/video frame processing
- Large backfills
10. Dependency Graphs – Workflow Orchestration
- Batch supports job dependencies.
Sample
Dependencies can be:
- SEQUENTIAL
- N_TO_N
- AFTER_SUCCESS
NB:
- For complex workflows, pair Batch with Step Functions.
11. Observability & Operations
Monitoring
- CloudWatch
Logs per job
- Job state
transitions
- Failed attempts
visibility
Metrics
- vCPU usage
- Job run time
- Queue depth
Common
Failure Patterns
|
|
|
|
|
|
|
|
12. Security & IAM (DevSecOps Angle)
Best practices:
- One IAM role per job type
- No wildcard S3 permissions
- Encrypt:
- S3 (SSE-KMS)
- EBS volumes
- Scan container images (ECR + Inspector)
13. Cost Optimization Strategies
|
|
|
|
|
|
|
|
|
|
14. When NOT to Use AWS Batch
❌ Long-running APIs
❌ Low-latency workloads
❌ Highly interactive tasks
❌ Kubernetes-native ecosystems
15. AWS Batch in One Sentence
- AWS Batch is the best way to:
- Run massive workloads,
- Containerized workloads,
- Fault-tolerant batch workloads
- on AWS without managing infrastructure or schedulers.
No comments:
Post a Comment