Sunday, September 7, 2025

AWS Glue Data Catalog for Datasets | Deep Dive.

AWS Glue Data Catalog for Datasets - Deep Dive.

Scope:

  • Intro,
  • The Concept: Glue Data Catalog,
  • Core Components,
  • How the Glue Data Catalog Works,
  • Sample Table Definition in the Data Catalog,
  • Integrations with Other Services,
  • Best Practices,
  • High-Level Architecture.

Intro:

    • AWS Glue Data Catalog  is one of the core services that ties together datasets, metadata, and analytics tools in AWS.

1. The Concept: Glue Data Catalog.

    • A centralized metadata repository that stores structured information (schemas, tables, partitions) about datasets.
    • Think of Glue Data Catalog as a database of metadata, not the actual data itself.
    • Glue Data Catalog Provides a Hive Metastore–compatible service that integrates with Athena, Redshift Spectrum, EMR, and Glue ETL jobs.

2. Core Components

 Databases

    • Logical grouping of tables (like schemas in a relational database).
    • E.g., twtech-raw_db, twtech-curated_db

 Tables

  • Metadata definitions that describe datasets.
    A table contains:
    • Column names & data types
    • Data location (usually S3 path)
    • File format (CSV, JSON, Parquet, ORC, Avro, etc.)
    • SerDe (serializer/deserializer) info
    • Partition keys (e.g., year, month)

 Partitions

    • Subdivisions of table data that improve query performance.
    • Example S3 layout:

        s3://twtech-datalake/sales/year=2025/month=09/day=07/

    • Glue Catalog stores partition metadata so queries can prune irrelevant files.

 Connections

    • Metadata definitions for external sources (JDBC, RDS, Redshift, etc.).
    • Secure connection details for Glue jobs to read/write data.

 Crawlers

    • Automated tools that scan data sources and create/update tables in the catalog.
    • Detect schema, format, and partitions.
    • Update when schema changes (if configured).

3. How the Glue Data Catalog Works

  1. Data Discovery
    • Crawlers or manual table definitions register metadata in the catalog.
  2. Metadata Storage
    • Data stays in S3 (or other sources).
    • Only schema and location references are stored in the Catalog.
  3. Querying
    • Athena, Redshift Spectrum, and EMR reference the Catalog to understand the structure of underlying files.
    • Glue ETL jobs use the catalog to load sources/targets without hardcoding schemas.

4. Sample Table Definition in the Data Catalog

# A Parquet table registered in the Catalog may look like:

{

  "Name": "twteh-sales_table",

  "DatabaseName": "twtech-curated_db",

  "StorageDescriptor": {

    "Columns": [

      {"Name": "order_id", "Type": "string"},

      {"Name": "amount", "Type": "double"},

      {"Name": "order_date", "Type": "timestamp"}

    ],

    "Location": "s3://twtech-datalake/curated/twtechsales/",

    "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",

    "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",

    "SerdeInfo": {

      "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

    }

  },

  "PartitionKeys": [

    {"Name": "year", "Type": "int"},

    {"Name": "month", "Type": "int"}

  ]

}

5. Integrations with Other Services

    • Athena Queries data via the Catalog (no ETL required).
    • Redshift Spectrum External schemas use the Glue Catalog for S3 tables.
    • EMR (Hive/Spark) Glue Catalog acts as a Hive Metastore.
    • Glue Jobs Source and target definitions are pulled from the Catalog.
    • Lake Formation Extends Glue Catalog with fine-grained data permissions.

6. Best Practices

Use partitioning and bucketing to optimize queries.

Enable schema versioning to track changes over time.

Secure with Lake Formation access control (IAM + fine-grained).

Automate metadata refresh with Glue Crawlers + triggers.

For large catalogs, use resource tagging and naming standards.

7. High-Level Architecture

Flow:

S3 / Databases Glue Crawler Glue Data Catalog (Athena, Redshift Spectrum, EMR, Glue Jobs, Lake Formation)



No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...