AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL - Overview.
Scope:
- Intro,
- AWS Glue – Key Features Overview,
- Quick Comparison,
- Features that together make Glue versatile,
- Flow diagram,
- How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape.
Intro:
- AWS offers several data integration and ETL (Extract, Transform, Load) tools with features designed to help twtech prepare and process its data efficiently:
- AWS Glue Job Bookmarks,
- AWS Glue DataBrew,
- AWS Glue Studio,
- AWS Glue Streaming ETL
AWS
Glue – Key
Features Overview
1. Job Bookmarks
- Purpose: Track state and prevent
reprocessing of the same data.
- How it works:
- When a Glue ETL job runs, it stores checkpoint
metadata about processed files/partitions.
- On the next run, Glue uses bookmarks to process only new/unprocessed
data.
- Modes:
- Job-bookmark enabled: Incremental processing (new data only).
- Job-bookmark disabled: Full reload (processes
everything again).
- Use Case: Efficiently process streaming data lakes (e.g., daily
ingestion from S3 logs).
2. AWS Glue DataBrew
- Visual, no-code data preparation tool.
- Focused on data
cleaning, transformation, and profiling
without writing Spark/PySpark code.
- Key Features:
- 250+ pre-built transformations (filter, merge, pivot, enrich).
- Data quality profiling (detect anomalies, missing values, patterns).
- Integration with S3, Redshift, RDS, Athena, etc.
- Use Case: Business/data
analysts preparing data for ML or BI dashboards.
3. AWS Glue Studio
- Low-code/no-code environment for building
ETL pipelines.
- Drag-and-drop interface to create Glue jobs (batch ETL, streaming ETL).
- Generates PySpark or Scala code behind the scenes.
- Key Features:
- Visual job authoring.
- Built-in monitoring and job run history.
- Simplifies
working with sources (S3, JDBC,
Kinesis) and sinks (S3, Redshift, Snowflake).
- Use Case: Data engineers building repeatable ETL workflows
without deep Spark knowledge.
4. AWS Glue Streaming ETL
- Extends Glue to real-time streaming data processing.
- Based on Apache Spark Structured Streaming under
the hood.
- Sources:
Kinesis Data Streams, Kafka, MSK.
- Targets:
S3 (Parquet/JSON/CSV), Redshift, OpenSearch, etc.
- Key Features:
- Supports schema inference + Data Catalog integration.
- Job bookmarks ensure fault tolerance and
“exactly-once” semantics.
- Scales automatically with streaming throughput.
- Use Case:
Real-time analytics (e.g., clickstream logs, IoT telemetry, fraud
detection).
5. Quick Comparison
|
Feature |
Audience |
Mode |
Best For |
|
|
Job Bookmarks |
Data engineers |
Batch/Stream |
Incremental data ingestion |
|
|
DataBrew |
Analysts |
Batch (interactive) |
No-code cleaning & profiling |
|
|
Glue Studio |
Data engineers |
Batch/Stream |
Visual ETL pipeline building |
|
|
Streaming ETL |
Data engineers |
Real-time |
Real-time ingestion &
analytics |
|
✅ Features that together make Glue versatile:
- DataBrew → Prep data visually.
- Studio → Build/manage pipelines.
- Job Bookmarks → Enable incremental processing.
- Streaming ETL → Handle real-time workloads.
Flow diagram:
How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape
- Glue Studio → Build/manage ETL pipelines visually.
- DataBrew → No-code cleaning & prep before ingestion.
- Streaming ETL → Real-time transformations.
- Job Bookmarks → Power all Glue jobs by ensuring incremental data processing.
- Data Lake (S3 + Catalog) → Central store for raw/processed data.
- Downstream Services (Athena, Redshift Spectrum, EMR) → Consume the transformed data.
No comments:
Post a Comment