Think - with -Tech: Redshift Cluster | Deep Dive.

Tuesday, September 2, 2025

Redshift Cluster | Deep Dive.

Redshift Cluster - Deep Dive.

Scope:

Intro,
The Concept: Amazon Redshift,
Redshift Cluster Architecture (components of Redshift cluster),
Data Ingestion into Redshift (Common ingestion patterns),
Data Distribution & Storage,
Query Execution Model,
Scaling,
Security,
Monitoring,
Advanced Features,
Best Practices.

Intro:

A Redshift cluster is a fully managed, petabyte-scale cloud data warehouse service from Amazon Web Services (AWS).
Redshift cluster consists of a set of compute nodes that store data and perform query operations, coordinated by a leader node.

1. The Concept: Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for OLAP (Online Analytical Processing) workloads.
It’s built on MPP (Massively Parallel Processing) and columnar storage, optimized for analytics, reporting, and data warehousing.

2. Redshift Cluster Architecture (components of Redshift cluster):

Leader Node

Manages client connections & query parsing.
Distributes work to compute nodes.
Aggregates results & returns to the client.
Runs only metadata and coordination, no actual storage.

Compute Nodes

Store data in slices (distributed across nodes).
Execute queries in parallel.
Data is stored in columnar format.
Each node can have multiple slices (based on vCPU).

Cluster Types

Dense Compute (DC2/DC3) → Fast SSD-based, optimized for performance.
Dense Storage (DS2 – legacy) → HDD-based, high capacity but slower.
RA3 (Recommended) → Separate compute and managed storage (S3-based). Scales independently.

3. Data Ingestion into Redshift (Common ingestion patterns):

Amazon S3 → COPY command: most common, bulk loads.
Kinesis Data Firehose → Redshift: real-time streaming.
AWS Glue / EMR (Electronic Medical Record) → Redshift: ETL pipelines.
Federated Query with Redshift Spectrum: query external data in S3.
Amazon DMS (Data management Service): migrate databases.
JDBC(Java Database Connectivity) / ODBC(Open Database Connectivity) clients: manually inserts/updates.

4. Data Distribution & Storage

Columnar storage: Optimized for compression & analytics.
Distribution Styles: Determines how data is spread across compute nodes.

AUTO → Redshift decides best.
KEY → Based on column value (good for joins).
ALL → Full table copy on all nodes (good for small lookup tables).
EVEN → Round-robin distribution.

Sort Keys
Sort Keys: Improve query performance by defining ordering on disk.
Compression Encoding: Saves space & improves I/O.

5. Query Execution Model

Client sends query → Leader Node parses → generates query plan.
Query plan distributed to Compute Nodes.
Each node executes on its slices (parallel execution).
Results aggregated and sent back to Leader Node → Client.

6. Scaling

Elastic Resize → Add/remove nodes quickly (minutes).
Classic Resize → Migration to a new cluster (hours).
Concurrency Scaling → Automatically adds capacity for high workloads.
RA3 with Managed Storage → Storage automatically grows.

7. Security

VPC isolation
IAM integration (fine-grained access)
KMS encryption (at rest)
SSL/TLS (in transit)
Audit logging → CloudWatch, S3

8. Monitoring

Amazon CloudWatch metrics
Query Monitoring Rules (QMRs)
Workload Management (WLM) → Prioritize queries by queue.
Performance Insights

9. Advanced Features

Redshift Spectrum → Query directly on S3 without loading.
Materialized Views → Precomputed query results.
Data Sharing → Share data across clusters/accounts.
Cross-Region Snapshots → Backup & disaster recovery.
Integration with QuickSight, Tableau, Power BI.

10. Best Practices

Use RA3 nodes for modern workloads.
Use COPY with compression for bulk loads.
Optimize sort keys & distribution styles.
Use WLM(Workload Manager) queues for workload isolation.
Regularly VACUUM & ANALYZE to maintain performance.
Use Spectrum for infrequently queried S3 data.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)