How AWS Glue Converts Data into Parquet Format - Overview.
Scope:
- Why Use Parquet,
- Step-by-Step Conversion to Parquet in Glue,
- Using a Glue Job (PySpark or Scala) CLI,
- Using Glue Studio (User Interface),
- Optimizations for Parquet Conversion,
- Common Use Cases,
- End-to-End Data Flow (Visual).
1. Why Use Parquet
- Columnar format:
Efficient for analytics workloads (Athena, Redshift Spectrum, EMR, Spark).
- Compression: Built-in compression reduces storage costs.
- Schema evolution: Supports changes like adding new columns without rewriting the whole dataset.
- Predicate pushdown: Queries scan only relevant columns/rows.
2. AWS Glue Components in the Flow
- Data Source
- Raw data (CSV, JSON, Avro, ORC, logs) in Amazon S3,
JDBC databases, Kinesis, etc.
- Glue Crawler (Optional)
- Scans raw data and creates a table schema in
the AWS Glue Data Catalog.
- Glue Job (ETL)
- A Spark-based job that:
- Reads data from the source
(using Data Catalog or direct connection).
- Transforms/cleanses if
required.
- Writes the result into Parquet
format in Amazon S3.
- Glue Data Catalog (Metadata Store)
- Stores schema of both source and transformed (Parquet)
datasets.
- Enables query engines like Athena, EMR, and Redshift
Spectrum to access data.
- Workflow & Triggers
- Automates pipelines: e.g., run crawler → transform to
Parquet → update catalog.
3. Step-by-Step Conversion to Parquet in Glue
(A)
Using a Glue Job (PySpark or Scala)
import
sys
from
awsglue.transforms import *
from
awsglue.utils import getResolvedOptions
from
pyspark.context import SparkContext
from
awsglue.context import GlueContext
from awsglue.job import Job
#
Initialize Glue
args
= getResolvedOptions(sys.argv, ['twtechJOB_NAME'])
sc
= SparkContext()
glueContext
= GlueContext(sc)
spark
= glueContext.spark_session
job
= Job(glueContext)
job.init(args['twtechJOB_NAME'], args)
#
Load data from catalog
datasource
= glueContext.create_dynamic_frame.from_catalog(
database = "twtech-raw_db",
table_name = "twtechinput_csv_table"
)
#
Convert to Parquet
output
= datasource.toDF()
output.write.mode("overwrite").parquet("s3://twtech-s3bucket/twtechparquet-output/")
job.commit()
- DynamicFrame vs DataFrame
- DynamicFrame (Glue abstraction, handles
semi-structured data).
- DataFrame (standard Spark object, supports advanced
transformations).
(B)
Using Glue Studio (User Interface)
- Drag-and-drop source → apply mapping → target.
- Select Parquet as output format when defining the S3 target.
4. Optimizations for Parquet Conversion
- Partitioning
- Organize by keys like year=2025/month=09/ → improves query performance in Athena.
- Compression
- Parquet supports SNAPPY (default), GZIP, LZO.
- Example:
o
output.write.parquet("s3://twtech-s3bucket/output/",
compression="snappy")
- Schema Evolution
- Enable “merge schema” option if your source data
evolves.
- In Spark:
o
spark.conf.set("spark.sql.parquet.mergeSchema",
"true")
- Job Bookmarking
- Glue can track processed files and only convert new
ones.
5. Common Use Cases
- Data Lake Ingestion:
Raw CSV → Parquet → Query with Athena.
- Data Warehouse Prep: Convert Parquet for Redshift Spectrum queries.
- Analytics Pipelines: Kinesis stream → Glue → Parquet for ML training.
6. End-to-End Data Flow (Visual)
Source (CSV/JSON/Logs in S3) → [Crawler] → Glue Job (ETL) → Parquet in S3 → Data Catalog → Query (Athena/Redshift/EMR).
No comments:
Post a Comment