Amazon Redshift Spectrum - Overview.
Scope:
- Intro,
- The Concepte: Redshift Spectrum,
- Spectrum Architecture,
- Workflow,
- Key Features,
- Performance Optimization,
- Security & Access,
- Use Cases,
- Pricing,
- Final thoughts.
Intro:
- Amazon Redshift Spectrum is a feature of Amazon Redshift that allows users to query data files located in an Amazon S3 data lake without first needing to load or move the data into the Amazon Redshift cluster's storage.
- Amazon Redshift Spectrum — one of the most powerful features of Redshift for querying exabyte-scale data in S3 without loading it into the Redshift cluster storage.
1. The Concepte: Redshift Spectrum
- Redshift Spectrum is an extension of Amazon Redshift that allows twtech to query structured and semi-structured data directly in Amazon S3
using SQL.
- This extension of Amazon Redshift (spectrum) uses the Redshift SQL engine but delegates scan and filtering to Spectrum worker nodes, which operate in a massively parallel way.
- This extension of Amazon Redshift (spectrum) Eliminates the need to ETL all data into Redshift —
- twtech can keep “cold” data in S3 (data that is infrequently accessed) and only load “hot” data in s3 (frequently accessed and highly active data) into Redshift tables.
2.
Spectrum Architecture
- Redshift Cluster
- Leader node parses SQL queries.
- Compute nodes handle query execution inside the
warehouse.
- If query involves external tables, pushes request to
Spectrum.
- Spectrum Layer (separate fleet of workers)
- Runs outside twtech Redshift cluster, scales
automatically.
- Reads data directly from S3.
- Applies predicate pushdown (filtering at
source).
- Data Sources
- Amazon S3
(Parquet, ORC, Avro, JSON, CSV,
TSV, Text).
- Can integrate with AWS Glue Data Catalog or Hive Metastore for schema definitions.
3. Workflow
- Data is stored in S3 (e.g., raw logs, Parquet files).
- Define external tables in an external schema pointing to Glue/Hive catalog.
- Query external tables with SQL (SELECT * FROM external_table).
- Spectrum workers fetch data → filter/project → return to Redshift compute nodes.
- Results can be joined with internal Redshift tables.
4. Key Features
- Separation of storage & compute → Pay only for the scanned data.
- Supports multiple formats → Parquet & ORC (columnar) are recommended.
- Joins across sources → Can join S3 external tables with Redshift managed tables.
- Elastic scaling → Spectrum automatically adds compute capacity for queries.
5.
Performance Optimization
- Use columnar formats (Parquet/ORC) instead of
row-based (CSV/JSON).
- Partition data in S3 (e.g., by date, region) → Spectrum will prune partitions.
- Compress data (Snappy, GZIP, ZSTD).
- Avoid small files (“small file problem”) → Combine into larger files (100MB–1GB).
- Store metadata in Glue Catalog for easier management.
6.
Security & Access
- IAM policies control access to S3 data.
- Redshift Spectrum honors Lake Formation permissions.
- Data in S3 should be encrypted (KMS, SSE-S3, SSE-C).
- Spectrum queries can be logged in CloudTrail & CloudWatch.
7. Use Cases
- Data Lake Analytics
→ Query raw data in S3 without loading.
- Cold/Hot Data Architecture → Keep recent data in Redshift, historical in S3.
- Ad-hoc Exploration → Analysts can query logs or JSON directly.
- Cost Optimization → Store infrequently queried data cheaply in S3.
- Big Data Federation → Join transactional warehouse data with large S3 datasets.
8.
Pricing
- Billed per TB of data scanned by Spectrum.
- Partitioning, compression, and columnar formats reduce scan size → reduce cost.
Final thoughts:
- Amazon Redshift Spectrum extends Redshift to the data lake in S3, enabling hybrid querying of warehouse + S3 data.
- Amazon Redshift Spectrum is cost-efficient, highly and scalable.
- Amazon Redshift Spectrum is ideal for cold data queries, data lakes, and big data analytics.
No comments:
Post a Comment