Tuesday, September 2, 2025

Amazon Redshift | Deep Dive.

Amazon Redshift - Deep Dive. 

Scope:

  • Intro,
  • The Concept:  Redshift,
  • Core Architecture,
  • Data Storage & Distribution,
  • Query Processing,
  • Performance Features,
  • Security,
  • Modern Enhancements,
  • Integrations,
  • Best Practices,
  • When to Use & when to avoid Redshit.
Intro:
    • Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service from Amazon Web Services (AWS).
    • Amazon Redshift is designed for analyzing large datasets using standard SQL and existing business intelligence (BI) tools. 
    • Amazon Redshift is Optimized for online analytical processing (OLAP).
    • Amazon Redshift  handles complex queries against large amounts of data efficiently. 
    • https://aws.amazon.com/redshift/

1. The Concept:  Redshift

    • Amazon Redshift is AWS’s fully managed, petabyte-scale cloud data warehouse designed for OLAP (Online Analytical Processing) workloads.
    • Amazon Redshift allows twtech to run complex SQL queries across structured/semi-structured data with very high performance and integrates tightly with the AWS ecosystem.

2. Core Architecture

Redshift is built on MPP (Massively Parallel Processing) principles.

  • Leader Node
    • Manages query parsing, optimization, and coordination.
    • Does not store user data.
  • Compute Nodes
    • Store data in slices (distributed across nodes).
    • Execute queries in parallel.
  • Cluster
    • Collection of nodes (1 Leader + multiple Compute nodes).
    • Can scale up/down depending on workload.

3. Data Storage & Distribution

Redshift uses columnar storage for efficiency.

  • Columnar Format: Improves compression & speeds up analytical queries.
  • Distribution Styles: Controls how data is spread across compute nodes.
    • AUTO (Redshift decides)
    • KEY (based on column)
    • ALL (replicates table to all nodes)
    • EVEN (round-robin distribution)
  • Sort Keys:
    • Compound Sort Key: Optimized for range queries.
    • Interleaved Sort Key: Balances across multiple columns.

4. Query Processing

    • Uses PostgreSQL-compatible SQL (with differences).
    • Leader Node parses SQL, creates query execution plan.
    • Pushes sub-queries to Compute Nodes.
    • Results aggregated and sent back to client.

5. Performance Features

    • Materialized Views Pre-computed query results.
    • Result Caching Reuse results for repeated queries.
    • Concurrency Scaling Auto add transient clusters for high concurrency.
    • Spectrum Query directly from S3 without loading into Redshift.
    • Automatic Vacuum & Analyze (for table health and stats).
    • Workload Management (WLM)Query prioritization & queues.

6. Security

    • IAM integration for authentication.
    • VPC isolation and private clusters.
    • Encryption: KMS or HSM integration.
    • Row-Level Security (RLS) and column-level access.
    • Audit Logging via CloudTrail + CloudWatch.

7. Modern Enhancements

    • RA3 Nodes (separate compute & storage, managed storage on S3).
    • AQUA (Advanced Query Accelerator) Hardware-accelerated cache layer.
    • Data Sharing Share live data across Redshift clusters without copies.
    • Semi-structured Data support with SUPER data type + PartiQL queries (like JSON).
    • ML Integration Train/deploy ML models inside Redshift using SageMaker.
    • Federated Query Query data in RDS, Aurora, or other sources via Redshift.

8. Integrations

    • ETL/ELT Glue, EMR, Lambda, Kinesis.
    • BI & Analytics QuickSight, Tableau, Looker, Power BI.
    • Data Lake Integration Redshift Spectrum on S3.
    • Streaming Ingest Kinesis Data Firehose to Redshift.
    • Automation CloudFormation, CDK, Terraform.

9. Best Practices

    • Choose RA3 nodes for separation of compute & storage.
    • Use DISTKEY + SORTKEY wisely for joins and filters.
    • Leverage Spectrum for infrequent/large historical data.
    • Keep tables ANALYZED & VACUUMED for query optimizer efficiency.
    • Partition large datasets by time (with DATE columns).
    • Monitor performance with Redshift Console + CloudWatch + system tables (STL, SVL, STV).

10. When to Use & when to avoid Redshit

when to use Redshit:

    • Large-scale analytical queries on structured/semi-structured data.
    • Scenarios needing tight AWS ecosystem integration.
    • Multi-tenant, secure, enterprise-grade data warehousing.

 whn to Avoid Redshift:

    • twech needs high-speed OLTP (Online Analytics Processing), instead use RDS/Aurora instead.
    • twtech workloads are small-scale , instead use Athena or Aurora that are more cost-effective.


No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...