Amazon Redshift - Deep Dive.
Scope:
- Intro,
- The Concept: Redshift,
- Core Architecture,
- Data Storage & Distribution,
- Query Processing,
- Performance Features,
- Security,
- Modern Enhancements,
- Integrations,
- Best Practices,
- When to Use & when to avoid Redshit.
Intro:
- Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service from Amazon Web Services (AWS).
- Amazon Redshift is designed for analyzing large datasets using standard SQL and existing business intelligence (BI) tools.
- Amazon Redshift is Optimized for online analytical processing (OLAP).
- Amazon Redshift handles complex queries against large amounts of data efficiently.
- https://aws.amazon.com/redshift/
1. The Concept: Redshift
- Amazon Redshift is AWS’s fully managed, petabyte-scale cloud data warehouse designed for OLAP (Online Analytical Processing) workloads.
- Amazon Redshift allows twtech to run complex SQL queries across structured/semi-structured data with very high performance and integrates tightly with the AWS ecosystem.
2. Core Architecture
Redshift is built on MPP
(Massively Parallel Processing) principles.
- Leader Node
- Manages query parsing, optimization, and coordination.
- Does not store user data.
- Compute Nodes
- Store data in slices (distributed across nodes).
- Execute queries in parallel.
- Cluster
- Collection of nodes (1 Leader + multiple Compute nodes).
- Can scale up/down depending on workload.
3. Data Storage & Distribution
Redshift uses columnar storage
for efficiency.
- Columnar Format:
Improves compression & speeds up analytical queries.
- Distribution Styles:
Controls how data is spread across compute nodes.
- AUTO
(Redshift decides)
- KEY
(based on column)
- ALL
(replicates table to all nodes)
- EVEN
(round-robin distribution)
- Sort Keys:
- Compound Sort Key:
Optimized for range queries.
- Interleaved Sort Key: Balances across multiple columns.
4. Query Processing
- Uses PostgreSQL-compatible SQL (with
differences).
- Leader Node parses SQL, creates query execution plan.
- Pushes sub-queries to Compute Nodes.
- Results aggregated and sent back to client.
5. Performance Features
- Materialized Views
→ Pre-computed query results.
- Result Caching → Reuse results for repeated queries.
- Concurrency Scaling → Auto add transient clusters for high concurrency.
- Spectrum → Query directly from S3 without loading into Redshift.
- Automatic Vacuum & Analyze (for table health and stats).
- Workload Management (WLM) → Query prioritization & queues.
6. Security
- IAM integration
for authentication.
- VPC isolation and private clusters.
- Encryption: KMS or HSM integration.
- Row-Level Security (RLS) and column-level access.
- Audit Logging via CloudTrail + CloudWatch.
7. Modern Enhancements
- RA3 Nodes
(separate compute & storage, managed storage on S3).
- AQUA (Advanced Query Accelerator) → Hardware-accelerated cache layer.
- Data Sharing → Share live data across Redshift clusters without copies.
- Semi-structured Data support with SUPER data type + PartiQL queries (like JSON).
- ML Integration → Train/deploy ML models inside Redshift using SageMaker.
- Federated Query → Query data in RDS, Aurora, or other sources via Redshift.
8. Integrations
- ETL/ELT
→ Glue, EMR, Lambda, Kinesis.
- BI & Analytics → QuickSight, Tableau, Looker, Power BI.
- Data Lake Integration → Redshift Spectrum on S3.
- Streaming Ingest → Kinesis Data Firehose to Redshift.
- Automation → CloudFormation, CDK, Terraform.
9. Best Practices
- Choose RA3 nodes for separation of compute &
storage.
- Use DISTKEY + SORTKEY wisely for joins and filters.
- Leverage Spectrum for infrequent/large historical data.
- Keep tables ANALYZED & VACUUMED for query optimizer efficiency.
- Partition large datasets by time (with DATE columns).
- Monitor performance with Redshift Console + CloudWatch + system tables (STL, SVL, STV).
10. When to Use & when to avoid Redshit
✅ when to use Redshit:
- Large-scale analytical queries on structured/semi-structured
data.
- Scenarios needing tight AWS ecosystem integration.
- Multi-tenant, secure, enterprise-grade data warehousing.
❌ whn to Avoid Redshift:
- twech needs high-speed OLTP (Online Analytics Processing), instead use RDS/Aurora instead.
- twtech workloads are small-scale , instead use Athena or Aurora that are more cost-effective.
No comments:
Post a Comment