Think - with -Tech: AWS Glue

Sunday, September 7, 2025

AWS Glue | Deep Dive.

twtech deep dive into AWS Glue.

Scope:

The Concept: AWS Glue,
Key Highlights,
Key Components of AWS Glue,
Glue Architecture,
AWS Glue ETL Process (Step-by-Step),
Key Features and Advanced Concepts,
Pricing Model,
Use Cases,
Best Practices.

1. The Concept: AWS Glue

AWS Glue is a fully managed ETL (Extract, Transform, Load) service provided by AWS.
AWS Glue allows twtech to prepare and transform data for analytics without managing servers.
Glue integrates well with other AWS services like S3, Redshift, Athena, RDS, and many more.

Key Highlights:

Serverless: No infrastructure to manage.
Supports structured, semi-structured, and unstructured data.
Can handle large-scale datasets.
Provides a data catalog for metadata management.

2. Key Components of AWS Glue

a) Glue Data Catalog

Centralized metadata repository.
Stores metadata like table definitions, schema, and location of data.
Acts as a central catalog for Athena, Redshift Spectrum, EMR, and Glue ETL jobs.
Features:

Database and table management.
Schema versioning.
Crawlers for automatic metadata extraction.

b) Glue Crawlers

Automatically scan twtech data sources.
Infer schema and store it in the Data Catalog.
Can crawl S3, JDBC, or other supported data stores.
Example: Scan an S3 bucket of JSON files → Create a table in Glue Data Catalog.

c) Glue Jobs

ETL scripts that process and transform data.
Supports Python (PySpark) or Scala.
Can run on a serverless Spark environment managed by Glue.

Job Types:

Spark ETL Jobs – large-scale transformations.
Python Shell Jobs – lightweight scripts for small tasks.
Streaming Jobs – process data in near real-time (with Glue Streaming).

d) Glue Triggers

Schedule or run jobs based on events.
Types:

On-demand – manually start jobs.
Scheduled – cron-like scheduling.
Event-based – triggered by S3 object creation or job completion.

e) Glue Workflows

Orchestrates complex ETL pipelines.
Can integrate multiple jobs, crawlers, and triggers.

3. Glue Architecture

Data Sources: S3, RDS, Redshift, DynamoDB, JDBC.
Glue Crawlers: Extract metadata → populate Data Catalog.
Glue Jobs: Transform and clean data using Spark.
Data Targets: Store transformed data in S3, Redshift, or other warehouses.
Triggers & Workflows: Automate ETL processes.

Architecture Note:

AWS Glue is serverless Spark under the hood, so scaling is handled automatically.

4. AWS Glue ETL Process (Step-by-Step):

Crawl twtech data → Create metadata tables.
Author ETL Jobs:

Transform data (clean, enrich, join, filter).
Use DynamicFrames (AWS Glue abstraction) or Spark DataFrames.

Run & Monitor Jobs:

Check job runs, logs in CloudWatch.

Store or Load Data:

Write to S3, Redshift, or external DBs.

5. Key Features and Advanced Concepts

a) DynamicFrames

Glue-specific abstraction over Spark DataFrames.
Handles semi-structured data more flexibly.
Built-in schema evolution and transformations.

b) Schema Evolution

Glue supports changes in schema over time.
DynamicFrames can handle missing fields or type changes automatically.

c) Glue Studio

Visual interface for building ETL pipelines.
Drag-and-drop transformations.
Easier for non-coders.

d) Glue Elastic Views

Combines data from multiple sources into materialized views.
Useful for analytics without manual ETL coding.

e) Glue DataBrew

No-code visual data preparation.
Clean, normalize, and enrich data.
Generate transformation recipes.

6. Pricing Model

Glue Jobs: Pay per Data Processing Unit (DPU) hour.
Glue Crawlers & Data Catalog: Based on number of objects crawled and tables stored.
Serverless → no upfront cost; cost depends on job execution and scale.

7. Use Cases

Data Lake ETL: Transform raw S3 data to analytics-ready formats.
Data Warehousing: Load and transform data for Redshift.
Streaming ETL: Real-time data processing (Kinesis → S3/Redshift).
Schema Discovery: Automatically catalog structured and semi-structured datasets.
Cross-account/region ETL: Combine datasets across AWS environments.

8. Best Practices

Use crawlers to automatically keep metadata updated.
Use DynamicFrames for semi-structured data; DataFrames for structured.
Partition S3 data for performance.
Monitor jobs in CloudWatch.
Keep ETL code modular and use Glue Workflows for orchestration.

Think - with -Tech

Sunday, September 7, 2025

AWS Glue | Deep Dive.

No comments:

Post a Comment

Amazon EventBridge | Overview.

Blog Archive