Sunday, September 7, 2025

AWS Glue | Deep Dive.

twtech deep dive into AWS Glue.

Scope:

  • The Concept: AWS Glue,
  • Key Highlights,
  • Key Components of AWS Glue,
  • Glue Architecture,
  • AWS Glue ETL Process (Step-by-Step),
  • Key Features and Advanced Concepts,
  • Pricing Model,
  • Use Cases,
  • Best Practices.

1. The Concept: AWS Glue

  • AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by AWS.
  • AWS Glue allows twtech to prepare and transform data for analytics without managing servers.
  • Glue integrates well with other AWS services like S3, Redshift, Athena, RDS, and many more.

Key Highlights:

    • Serverless: No infrastructure to manage.
    • Supports structured, semi-structured, and unstructured data.
    • Can handle large-scale datasets.
    • Provides a data catalog for metadata management.

2. Key Components of AWS Glue

a) Glue Data Catalog

    • Centralized metadata repository.
    • Stores metadata like table definitions, schema, and location of data.
    • Acts as a central catalog for Athena, Redshift Spectrum, EMR, and Glue ETL jobs.
    • Features:
      • Database and table management.
      • Schema versioning.
      • Crawlers for automatic metadata extraction.

b) Glue Crawlers

    • Automatically scan twtech data sources.
    • Infer schema and store it in the Data Catalog.
    • Can crawl S3, JDBC, or other supported data stores.
    • Example: Scan an S3 bucket of JSON files → Create a table in Glue Data Catalog.

c) Glue Jobs

    • ETL scripts that process and transform data.
    • Supports Python (PySpark) or Scala.
    • Can run on a serverless Spark environment managed by Glue.

Job Types:

    1. Spark ETL Jobs – large-scale transformations.
    2. Python Shell Jobs – lightweight scripts for small tasks.
    3. Streaming Jobs – process data in near real-time (with Glue Streaming).

d) Glue Triggers

    • Schedule or run jobs based on events.
    • Types:
      • On-demand – manually start jobs.
      • Scheduled – cron-like scheduling.
      • Event-based – triggered by S3 object creation or job completion.

e) Glue Workflows

    • Orchestrates complex ETL pipelines.
    • Can integrate multiple jobs, crawlers, and triggers.

3. Glue Architecture

    1. Data Sources: S3, RDS, Redshift, DynamoDB, JDBC.
    2. Glue Crawlers: Extract metadata populate Data Catalog.
    3. Glue Jobs: Transform and clean data using Spark.
    4. Data Targets: Store transformed data in S3, Redshift, or other warehouses.
    5. Triggers & Workflows: Automate ETL processes.

Architecture Note: 

AWS Glue is serverless Spark under the hood, so scaling is handled automatically.

4. AWS Glue ETL Process (Step-by-Step):

  1. Crawl twtech data Create metadata tables.
  2. Author ETL Jobs:
    • Transform data (clean, enrich, join, filter).
    • Use DynamicFrames (AWS Glue abstraction) or Spark DataFrames.
  3. Run & Monitor Jobs:
    • Check job runs, logs in CloudWatch.
  4. Store or Load Data:
    • Write to S3, Redshift, or external DBs.

5. Key Features and Advanced Concepts

a) DynamicFrames

    • Glue-specific abstraction over Spark DataFrames.
    • Handles semi-structured data more flexibly.
    • Built-in schema evolution and transformations.

b) Schema Evolution

    • Glue supports changes in schema over time.
    • DynamicFrames can handle missing fields or type changes automatically.

c) Glue Studio

    • Visual interface for building ETL pipelines.
    • Drag-and-drop transformations.
    • Easier for non-coders.

d) Glue Elastic Views

    • Combines data from multiple sources into materialized views.
    • Useful for analytics without manual ETL coding.

e) Glue DataBrew

    • No-code visual data preparation.
    • Clean, normalize, and enrich data.
    • Generate transformation recipes.

6. Pricing Model

    • Glue Jobs: Pay per Data Processing Unit (DPU) hour.
    • Glue Crawlers & Data Catalog: Based on number of objects crawled and tables stored.
    • Serverless no upfront cost; cost depends on job execution and scale.

7. Use Cases

    1. Data Lake ETL: Transform raw S3 data to analytics-ready formats.
    2. Data Warehousing: Load and transform data for Redshift.
    3. Streaming ETL: Real-time data processing (Kinesis S3/Redshift).
    4. Schema Discovery: Automatically catalog structured and semi-structured datasets.
    5. Cross-account/region ETL: Combine datasets across AWS environments.

8. Best Practices

    • Use crawlers to automatically keep metadata updated.
    • Use DynamicFrames for semi-structured data; DataFrames for structured.
    • Partition S3 data for performance.
    • Monitor jobs in CloudWatch.
    • Keep ETL code modular and use Glue Workflows for orchestration.




No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...