Redshift Cluster - Deep Dive.
Scope:
- Intro,
- The Concept: Amazon Redshift,
- Redshift Cluster Architecture (components of Redshift cluster),
- Data Ingestion into Redshift (Common ingestion patterns),
- Data Distribution & Storage,
- Query Execution Model,
- Scaling,
- Security,
- Monitoring,
- Advanced Features,
- Best Practices.
Intro:
- A Redshift cluster is a fully managed, petabyte-scale cloud data warehouse service from Amazon Web Services (AWS).
- Redshift cluster consists of a set of compute nodes that store data and perform query operations, coordinated by a leader node.
1. The Concept: Amazon
Redshift
- Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for OLAP (Online Analytical Processing) workloads.
- It’s built on MPP (Massively
Parallel Processing) and columnar
storage, optimized for analytics, reporting, and
data warehousing.
2. Redshift Cluster Architecture (components of Redshift cluster):
- Leader Node
- Manages client connections & query parsing.
- Distributes work to compute nodes.
- Aggregates results & returns to the client.
- Runs only metadata and coordination, no actual
storage.
- Compute Nodes
- Store data in slices (distributed across nodes).
- Execute queries in parallel.
- Data is stored in columnar format.
- Each node can have multiple slices (based on vCPU).
- Cluster Types
- Dense Compute (DC2/DC3) → Fast SSD-based, optimized for performance.
- Dense Storage (DS2 – legacy) → HDD-based, high capacity but slower.
- RA3 (Recommended) → Separate compute and managed storage (S3-based). Scales independently.
3. Data Ingestion into Redshift (Common ingestion patterns):
- Amazon S3 → COPY command: most common, bulk loads.
- Kinesis Data Firehose → Redshift: real-time streaming.
- AWS Glue / EMR (Electronic Medical Record) → Redshift: ETL pipelines.
- Federated Query with Redshift Spectrum: query external data in S3.
- Amazon DMS (Data management Service): migrate databases.
- JDBC(Java Database Connectivity) / ODBC(Open Database Connectivity) clients: manually inserts/updates.
4. Data Distribution &
Storage
- Columnar storage: Optimized for compression & analytics.
- Distribution Styles: Determines how data is spread across compute nodes.
- AUTO → Redshift decides best.
- KEY → Based on column value (good for joins).
- ALL → Full table copy on all nodes (good for small lookup tables).
- EVEN → Round-robin distribution.
- Sort Keys
- Sort Keys: Improve query performance by defining ordering on disk.
- Compression Encoding: Saves space & improves I/O.
5. Query Execution Model
- Client sends query → Leader Node parses →
generates query plan.
- Query plan distributed to Compute Nodes.
- Each node executes on its slices (parallel execution).
- Results aggregated and sent back to Leader Node → Client.
6. Scaling
- Elastic Resize →
Add/remove nodes quickly (minutes).
- Classic Resize → Migration to a new cluster (hours).
- Concurrency Scaling → Automatically adds capacity for high workloads.
- RA3 with Managed Storage → Storage automatically grows.
7. Security
- VPC isolation
- IAM integration (fine-grained access)
- KMS encryption (at rest)
- SSL/TLS (in transit)
- Audit logging → CloudWatch, S3
8. Monitoring
- Amazon CloudWatch metrics
- Query Monitoring Rules (QMRs)
- Workload Management (WLM) → Prioritize queries by queue.
- Performance Insights
9. Advanced Features
- Redshift Spectrum →
Query directly on S3 without loading.
- Materialized Views → Precomputed query results.
- Data Sharing → Share data across clusters/accounts.
- Cross-Region Snapshots → Backup & disaster recovery.
- Integration with QuickSight, Tableau, Power BI.
10. Best Practices
- Use RA3 nodes for modern workloads.
- Use COPY with compression for bulk loads.
- Optimize sort keys & distribution styles.
- Use WLM(Workload Manager) queues for workload isolation.
- Regularly VACUUM & ANALYZE to maintain performance.
- Use Spectrum for infrequently queried S3 data.
No comments:
Post a Comment