Tuesday, September 2, 2025

Redshift Cluster | Deep Dive.

Redshift Cluster - Deep Dive.

Scope:

  • Intro,
  • The Concept: Amazon Redshift,
  • Redshift Cluster Architecture (components of Redshift cluster),
  • Data Ingestion into Redshift (Common ingestion patterns),
  • Data Distribution & Storage,
  • Query Execution Model,
  • Scaling,
  • Security,
  • Monitoring,
  • Advanced Features,
  • Best Practices.

Intro:

    • A Redshift cluster is a fully managed, petabyte-scale cloud data warehouse service from Amazon Web Services (AWS)
    •  Redshift cluster consists of a set of compute nodes that store data and perform query operations, coordinated by a leader node.

 1. The Concept: Amazon Redshift

    • Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for OLAP (Online Analytical Processing) workloads.
    •  It’s built on MPP (Massively Parallel Processing) and columnar storage, optimized for analytics, reporting, and data warehousing.

 2. Redshift Cluster Architecture (components of Redshift cluster):

  • Leader Node
    • Manages client connections & query parsing.
    • Distributes work to compute nodes.
    • Aggregates results & returns to the client.
    • Runs only metadata and coordination, no actual storage.
  • Compute Nodes
    • Store data in slices (distributed across nodes).
    • Execute queries in parallel.
    • Data is stored in columnar format.
    • Each node can have multiple slices (based on vCPU).
  • Cluster Types
    • Dense Compute (DC2/DC3) Fast SSD-based, optimized for performance.
    • Dense Storage (DS2 – legacy) HDD-based, high capacity but slower.
    • RA3 (Recommended) Separate compute and managed storage (S3-based). Scales independently.


3. Data Ingestion into Redshift (Common ingestion patterns):

    • Amazon S3 COPY command: most common, bulk loads.
    • Kinesis Data Firehose Redshift: real-time streaming.
    • AWS Glue / EMR (Electronic Medical Record) Redshift: ETL pipelines.
    • Federated Query with Redshift Spectrum: query external data in S3.
    • Amazon DMS (Data management Service): migrate databases.
    • JDBC(Java Database Connectivity) / ODBC(Open Database Connectivity) clients: manually inserts/updates.

4. Data Distribution & Storage

    • Columnar storage: Optimized for compression & analytics.
    • Distribution Styles: Determines how data is spread across compute nodes.
      • AUTO Redshift decides best.
      • KEY Based on column value (good for joins).
      • ALL Full table copy on all nodes (good for small lookup tables).
      • EVEN Round-robin distribution.
    • Sort Keys
    • Sort Keys: Improve query performance by defining ordering on disk.
    • Compression Encoding: Saves space & improves I/O.

5. Query Execution Model

    • Client sends query Leader Node parses generates query plan.
    • Query plan distributed to Compute Nodes.
    • Each node executes on its slices (parallel execution).
    • Results aggregated and sent back to Leader Node Client.

 6. Scaling

    • Elastic Resize Add/remove nodes quickly (minutes).
    • Classic Resize Migration to a new cluster (hours).
    • Concurrency Scaling Automatically adds capacity for high workloads.
    • RA3 with Managed Storage Storage automatically grows.

 7. Security

    • VPC isolation
    • IAM integration (fine-grained access)
    • KMS encryption (at rest)
    • SSL/TLS (in transit)
    • Audit logging CloudWatch, S3

 8. Monitoring

    • Amazon CloudWatch metrics
    • Query Monitoring Rules (QMRs)
    • Workload Management (WLM) Prioritize queries by queue.
    • Performance Insights

 9. Advanced Features

    • Redshift Spectrum Query directly on S3 without loading.
    • Materialized Views Precomputed query results.
    • Data Sharing Share data across clusters/accounts.
    • Cross-Region Snapshots Backup & disaster recovery.
    • Integration with QuickSight, Tableau, Power BI.

 10. Best Practices

    • Use RA3 nodes for modern workloads.
    • Use COPY with compression for bulk loads.
    • Optimize sort keys & distribution styles.
    • Use WLM(Workload Manager) queues for workload isolation.
    • Regularly VACUUM & ANALYZE to maintain performance.
    • Use Spectrum for infrequently queried S3 data.



No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...