Think - with -Tech: AWS Glue Data Catalog for Datasets

Sunday, September 7, 2025

AWS Glue Data Catalog for Datasets | Deep Dive.

AWS Glue Data Catalog for Datasets - Deep Dive.

Scope:

Intro,
The Concept: Glue Data Catalog,
Core Components,
How the Glue Data Catalog Works,
Sample Table Definition in the Data Catalog,
Integrations with Other Services,
Best Practices,
High-Level Architecture.

Intro:

AWS Glue Data Catalog is one of the core services that ties together datasets, metadata, and analytics tools in AWS.

1. The Concept: Glue Data Catalog.

A centralized metadata repository that stores structured information (schemas, tables, partitions) about datasets.
Think of Glue Data Catalog as a database of metadata, not the actual data itself.
Glue Data Catalog Provides a Hive Metastore–compatible service that integrates with Athena, Redshift Spectrum, EMR, and Glue ETL jobs.

2. Core Components

Databases

Logical grouping of tables (like schemas in a relational database).
E.g., twtech-raw_db, twtech-curated_db

Tables

Metadata definitions that describe datasets.
A table contains:

Column names & data types
Data location (usually S3 path)
File format (CSV, JSON, Parquet, ORC, Avro, etc.)
SerDe (serializer/deserializer) info
Partition keys (e.g., year, month)

Partitions

Subdivisions of table data that improve query performance.
Example S3 layout:

s3://twtech-datalake/sales/year=2025/month=09/day=07/

Glue Catalog stores partition metadata so queries can prune irrelevant files.

Connections

Metadata definitions for external sources (JDBC, RDS, Redshift, etc.).
Secure connection details for Glue jobs to read/write data.

Crawlers

Automated tools that scan data sources and create/update tables in the catalog.
Detect schema, format, and partitions.
Update when schema changes (if configured).

3. How the Glue Data Catalog Works

Data Discovery

Crawlers or manual table definitions register metadata in the catalog.

Metadata Storage

Data stays in S3 (or other sources).
Only schema and location references are stored in the Catalog.

Querying

Athena, Redshift Spectrum, and EMR reference the Catalog to understand the structure of underlying files.
Glue ETL jobs use the catalog to load sources/targets without hardcoding schemas.

4. Sample Table Definition in the Data Catalog

# A Parquet table registered in the Catalog may look like:

{

"Name": "twteh-sales_table",

"DatabaseName": "twtech-curated_db",

"StorageDescriptor": {

"Columns": [

{"Name": "order_id", "Type": "string"},

{"Name": "amount", "Type": "double"},

{"Name": "order_date", "Type": "timestamp"}

"Location": "s3://twtech-datalake/curated/twtechsales/",

"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",

"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",

"SerdeInfo": {

"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

}

"PartitionKeys": [

{"Name": "year", "Type": "int"},

{"Name": "month", "Type": "int"}

]

}

5. Integrations with Other Services

Athena → Queries data via the Catalog (no ETL required).
Redshift Spectrum → External schemas use the Glue Catalog for S3 tables.
EMR (Hive/Spark) → Glue Catalog acts as a Hive Metastore.
Glue Jobs → Source and target definitions are pulled from the Catalog.
Lake Formation → Extends Glue Catalog with fine-grained data permissions.

6. Best Practices

✅ Use partitioning and bucketing to optimize queries.
✅ Enable schema versioning to track changes over time.
✅ Secure with Lake Formation access control (IAM + fine-grained).
✅ Automate metadata refresh with Glue Crawlers + triggers.
✅ For large catalogs, use resource tagging and naming standards.

7. High-Level Architecture

Flow:

S3 / Databases → Glue Crawler → Glue Data Catalog → (Athena, Redshift Spectrum, EMR, Glue Jobs, Lake Formation)

Think - with -Tech

Sunday, September 7, 2025

AWS Glue Data Catalog for Datasets | Deep Dive.

No comments:

Post a Comment

Databases Explained & Use Cases with (Flash Card) | Overview.

Blog Archive