Sunday, September 7, 2025

AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL | Overview.


AWS Glue Job Bookmarks, DataBrew, Studio & Streaming ETL - Overview.

 Scope:

  • Intro,
  • AWS Glue – Key Features Overview,
  • Quick Comparison,
  • Features that together make Glue versatile,
  • Flow diagram,
  • How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL fit together in the Glue landscape.

Intro:

  • AWS offers several data integration and ETL (Extract, Transform, Load) tools with features designed to help twtech prepare and process its data efficiently: 

    •  AWS Glue Job Bookmarks, 

    • AWS Glue DataBrew,

    • AWS Glue Studio, 

    • AWS Glue Streaming ETL

AWS Glue – Key Features Overview

1. Job Bookmarks

  • Purpose: Track state and prevent reprocessing of the same data.
  • How it works:
    • When a Glue ETL job runs, it stores checkpoint metadata about processed files/partitions.
    • On the next run, Glue uses bookmarks to process only new/unprocessed data.
  • Modes:
    • Job-bookmark enabled: Incremental processing (new data only).
    • Job-bookmark disabled: Full reload (processes everything again).
  • Use Case: Efficiently process streaming data lakes (e.g., daily ingestion from S3 logs).

2. AWS Glue DataBrew

  • Visual, no-code data preparation tool.
  • Focused on data cleaning, transformation, and profiling without writing Spark/PySpark code.
  • Key Features:
    • 250+ pre-built transformations (filter, merge, pivot, enrich).
    • Data quality profiling (detect anomalies, missing values, patterns).
    • Integration with S3, Redshift, RDS, Athena, etc.
  • Use Case: Business/data analysts preparing data for ML or BI dashboards.

3. AWS Glue Studio

  • Low-code/no-code environment for building ETL pipelines.
  • Drag-and-drop interface to create Glue jobs (batch ETL, streaming ETL).
  • Generates PySpark or Scala code behind the scenes.
  • Key Features:
    • Visual job authoring.
    • Built-in monitoring and job run history.
    • Simplifies working with sources (S3, JDBC, Kinesis) and sinks (S3, Redshift, Snowflake).
  • Use Case: Data engineers building repeatable ETL workflows without deep Spark knowledge.

4. AWS Glue Streaming ETL

  • Extends Glue to real-time streaming data processing.
  • Based on Apache Spark Structured Streaming under the hood.
  • Sources: Kinesis Data Streams, Kafka, MSK.
  • Targets: S3 (Parquet/JSON/CSV), Redshift, OpenSearch, etc.
  • Key Features:
    • Supports schema inference + Data Catalog integration.
    • Job bookmarks ensure fault tolerance and “exactly-once” semantics.
    • Scales automatically with streaming throughput.
  • Use Case: Real-time analytics (e.g., clickstream logs, IoT telemetry, fraud detection).

5.  Quick Comparison

Feature

Audience

Mode

Best For

Job Bookmarks

Data engineers

Batch/Stream

Incremental data ingestion

DataBrew

Analysts

Batch (interactive)

No-code cleaning & profiling

Glue Studio

Data engineers

Batch/Stream

Visual ETL pipeline building

Streaming ETL

Data engineers

Real-time

Real-time ingestion & analytics

✅  Features that together make Glue versatile:

    • DataBrew Prep data visually.
    • Studio Build/manage pipelines.
    • Job Bookmarks Enable incremental processing.
    • Streaming ETL Handle real-time workloads.

Flow diagram:

 How AWS Glue Job Bookmarks, DataBrew, Studio, Streaming ETL  fit together in the Glue landscape

    •         Glue Studio  Build/manage ETL pipelines visually.
    •         DataBrew  No-code cleaning & prep before ingestion.
    •         Streaming ETL Real-time transformations.
    •         Job Bookmarks  Power all Glue jobs by ensuring incremental data processing.
    •         Data Lake (S3 + Catalog)  Central store for raw/processed data.
    •         Downstream Services (Athena, Redshift Spectrum, EMR)  Consume the transformed data.



No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...