Sunday, September 7, 2025

How AWS Glue Converts Data into Parquet Format | Overview.


How AWS Glue Converts Data into Parquet Format - Overview.

Scope:

  • Why Use Parquet,
  • Step-by-Step Conversion to Parquet in Glue,
  • Using a Glue Job (PySpark or Scala) CLI,
  • Using Glue Studio (User Interface),
  • Optimizations for Parquet Conversion,
  • Common Use Cases,
  • End-to-End Data Flow (Visual).

1. Why Use Parquet

    • Columnar format: Efficient for analytics workloads (Athena, Redshift Spectrum, EMR, Spark).
    • Compression: Built-in compression reduces storage costs.
    • Schema evolution: Supports changes like adding new columns without rewriting the whole dataset.
    • Predicate pushdown: Queries scan only relevant columns/rows.

2. AWS Glue Components in the Flow

  1. Data Source
    • Raw data (CSV, JSON, Avro, ORC, logs) in Amazon S3, JDBC databases, Kinesis, etc.
  2. Glue Crawler (Optional)
    • Scans raw data and creates a table schema in the AWS Glue Data Catalog.
  3. Glue Job (ETL)
    • A Spark-based job that:
      • Reads data from the source (using Data Catalog or direct connection).
      • Transforms/cleanses if required.
      • Writes the result into Parquet format in Amazon S3.
  4. Glue Data Catalog (Metadata Store)
    • Stores schema of both source and transformed (Parquet) datasets.
    • Enables query engines like Athena, EMR, and Redshift Spectrum to access data.
  5. Workflow & Triggers
    • Automates pipelines: e.g., run crawler transform to Parquet update catalog.

3. Step-by-Step Conversion to Parquet in Glue

(A) Using a Glue Job (PySpark or Scala)

import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.job import Job

# Initialize Glue

args = getResolvedOptions(sys.argv, ['twtechJOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

job = Job(glueContext)

job.init(args['twtechJOB_NAME'], args)

# Load data from catalog

datasource = glueContext.create_dynamic_frame.from_catalog(

    database = "twtech-raw_db",

    table_name = "twtechinput_csv_table"

)

# Convert to Parquet

output = datasource.toDF()

output.write.mode("overwrite").parquet("s3://twtech-s3bucket/twtechparquet-output/")

job.commit()

  • DynamicFrame vs DataFrame
    • DynamicFrame (Glue abstraction, handles semi-structured data).
    • DataFrame (standard Spark object, supports advanced transformations).

(B) Using Glue Studio (User Interface)

    • Drag-and-drop source apply mapping target.
    • Select Parquet as output format when defining the S3 target.

4. Optimizations for Parquet Conversion

  1. Partitioning
    • Organize by keys like year=2025/month=09/ improves query performance in Athena.
  2. Compression
    • Parquet supports SNAPPY (default), GZIP, LZO.
    • Example:

o   output.write.parquet("s3://twtech-s3bucket/output/", compression="snappy")

  1. Schema Evolution
    • Enable “merge schema” option if your source data evolves.
    • In Spark:

o   spark.conf.set("spark.sql.parquet.mergeSchema", "true")

  1. Job Bookmarking
    • Glue can track processed files and only convert new ones.

5. Common Use Cases

    • Data Lake Ingestion: Raw CSVParquet Query with Athena.
    • Data Warehouse Prep: Convert Parquet for Redshift Spectrum queries.
    • Analytics Pipelines: Kinesis stream Glue Parquet for ML training.

6. End-to-End Data Flow (Visual)

Source (CSV/JSON/Logs in S3) [Crawler] Glue Job (ETL) Parquet in S3 Data Catalog Query (Athena/Redshift/EMR).




No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...