Think - with -Tech: AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL

Sunday, September 7, 2025

AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL | Overview.

AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL - Overview.

Scope:

Intro,
AWS Glue – Key Features Overview,
Quick Comparison,
Features that together make Glue versatile,
Flow diagram,
How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape.

Intro:

AWS offers several data integration and ETL (Extract, Transform, Load) tools with features designed to help twtech prepare and process its data efficiently:

AWS Glue Job Bookmarks,

AWS Glue DataBrew,

AWS Glue Studio,

AWS Glue Streaming ETL

AWS Glue – Key Features Overview

1. Job Bookmarks

Purpose: Track state and prevent reprocessing of the same data.
How it works:

When a Glue ETL job runs, it stores checkpoint metadata about processed files/partitions.
On the next run, Glue uses bookmarks to process only new/unprocessed data.

Modes:

Job-bookmark enabled: Incremental processing (new data only).
Job-bookmark disabled: Full reload (processes everything again).

Use Case: Efficiently process streaming data lakes (e.g., daily ingestion from S3 logs).

2. AWS Glue DataBrew

Visual, no-code data preparation tool.
Focused on data cleaning, transformation, and profiling without writing Spark/PySpark code.
Key Features:

250+ pre-built transformations (filter, merge, pivot, enrich).
Data quality profiling (detect anomalies, missing values, patterns).
Integration with S3, Redshift, RDS, Athena, etc.

Use Case: Business/data analysts preparing data for ML or BI dashboards.

3. AWS Glue Studio

Low-code/no-code environment for building ETL pipelines.
Drag-and-drop interface to create Glue jobs (batch ETL, streaming ETL).
Generates PySpark or Scala code behind the scenes.
Key Features:

Visual job authoring.
Built-in monitoring and job run history.
Simplifies working with sources (S3, JDBC, Kinesis) and sinks (S3, Redshift, Snowflake).

Use Case: Data engineers building repeatable ETL workflows without deep Spark knowledge.

4. AWS Glue Streaming ETL

Extends Glue to real-time streaming data processing.
Based on Apache Spark Structured Streaming under the hood.
Sources: Kinesis Data Streams, Kafka, MSK.
Targets: S3 (Parquet/JSON/CSV), Redshift, OpenSearch, etc.
Key Features:

Supports schema inference + Data Catalog integration.
Job bookmarks ensure fault tolerance and “exactly-once” semantics.
Scales automatically with streaming throughput.

Use Case: Real-time analytics (e.g., clickstream logs, IoT telemetry, fraud detection).

5. Quick Comparison

Feature		Audience	Mode	Best For
Job Bookmarks	Data engineers		Batch/Stream	Incremental data ingestion
DataBrew	Analysts		Batch (interactive)	No-code cleaning & profiling
Glue Studio	Data engineers		Batch/Stream	Visual ETL pipeline building
Streaming ETL	Data engineers		Real-time	Real-time ingestion & analytics

✅ Features that together make Glue versatile:

DataBrew → Prep data visually.
Studio → Build/manage pipelines.
Job Bookmarks → Enable incremental processing.
Streaming ETL → Handle real-time workloads.

Flow diagram:

How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape

Glue Studio → Build/manage ETL pipelines visually.
DataBrew → No-code cleaning & prep before ingestion.
Streaming ETL → Real-time transformations.
Job Bookmarks → Power all Glue jobs by ensuring incremental data processing.
Data Lake (S3 + Catalog) → Central store for raw/processed data.
Downstream Services (Athena, Redshift Spectrum, EMR) → Consume the transformed data.

Think - with -Tech

Sunday, September 7, 2025

AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL | Overview.

How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape

No comments:

Post a Comment

Amazon EventBridge | Overview.

Blog Archive