AWS Glue Data Catalog for Datasets - Deep Dive.
Scope:
- Intro,
- The Concept: Glue Data Catalog,
- Core Components,
- How the Glue Data Catalog Works,
- Sample Table Definition in the Data Catalog,
- Integrations with Other Services,
- Best Practices,
- High-Level Architecture.
Intro:
- AWS Glue Data Catalog is one of the core services
that ties together datasets, metadata, and analytics tools in AWS.
1. The Concept: Glue Data Catalog.
- A centralized metadata
repository that stores
structured information (schemas, tables, partitions) about datasets.
- Think of Glue Data Catalog as a database
of metadata, not the
actual data itself.
- Glue Data Catalog Provides a Hive
Metastore–compatible service that integrates with Athena,
Redshift Spectrum, EMR, and Glue ETL jobs.
2. Core Components
Databases
- Logical grouping of tables (like schemas in a relational database).
- E.g., twtech-raw_db, twtech-curated_db
Tables
- Metadata definitions that describe datasets.
A table contains: - Column names & data types
- Data location (usually
S3 path)
- File format (CSV,
JSON, Parquet, ORC, Avro, etc.)
- SerDe (serializer/deserializer)
info
- Partition keys (e.g., year, month)
Partitions
- Subdivisions of table data that improve query performance.
- Example S3 layout:
s3://twtech-datalake/sales/year=2025/month=09/day=07/
- Glue Catalog stores partition metadata so queries can prune irrelevant files.
Connections
- Metadata definitions for external sources (JDBC, RDS, Redshift, etc.).
- Secure connection details for Glue jobs to read/write data.
Crawlers
- Automated tools that scan data sources and
create/update tables in the catalog.
- Detect schema, format, and partitions.
- Update when schema changes (if configured).
3. How the Glue Data Catalog Works
- Data Discovery
- Crawlers or manual table definitions register metadata
in the catalog.
- Metadata Storage
- Data stays in S3 (or
other sources).
- Only schema and location references are stored in the
Catalog.
- Querying
- Athena, Redshift Spectrum, and EMR reference the
Catalog to understand the structure of underlying files.
- Glue ETL jobs use the catalog to load sources/targets
without hardcoding schemas.
4. Sample Table Definition in the Data Catalog
# A Parquet table
registered in the Catalog may look like:
{
"Name": "twteh-sales_table",
"DatabaseName": "twtech-curated_db",
"StorageDescriptor": {
"Columns": [
{"Name": "order_id", "Type":
"string"},
{"Name": "amount", "Type":
"double"},
{"Name": "order_date",
"Type": "timestamp"}
],
"Location": "s3://twtech-datalake/curated/twtechsales/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
}
},
"PartitionKeys": [
{"Name": "year", "Type":
"int"},
{"Name": "month", "Type":
"int"}
]
}
5. Integrations with Other Services
- Athena
→ Queries data via the Catalog (no ETL required).
- Redshift Spectrum → External schemas use the Glue Catalog for S3 tables.
- EMR (Hive/Spark) → Glue Catalog acts as a Hive Metastore.
- Glue Jobs → Source and target definitions are pulled from the Catalog.
- Lake Formation → Extends Glue Catalog with fine-grained data permissions.
6. Best Practices
✅ Use partitioning and bucketing to optimize queries.
✅ Enable schema versioning to track changes over time.
✅ Secure with Lake Formation access control (IAM + fine-grained).
✅ Automate metadata refresh with Glue Crawlers + triggers.
✅ For large catalogs, use resource tagging and naming standards.
7. High-Level Architecture
Flow:
S3 / Databases → Glue Crawler → Glue Data Catalog → (Athena, Redshift Spectrum, EMR, Glue Jobs, Lake Formation)
No comments:
Post a Comment