Think - with -Tech: Amazon Redshift Spectrum

Tuesday, September 2, 2025

Amazon Redshift Spectrum - Overview.

Scope:

Intro:

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows users to query data files located in an Amazon S3 data lake without first needing to load or move the data into the Amazon Redshift cluster's storage.
Amazon Redshift Spectrum — one of the most powerful features of Redshift for querying exabyte-scale data in S3 without loading it into the Redshift cluster storage.

1. The Concepte: Redshift Spectrum

Redshift Spectrum is an extension of Amazon Redshift that allows twtech to query structured and semi-structured data directly in Amazon S3 using SQL.
This extension of Amazon Redshift (spectrum) uses the Redshift SQL engine but delegates scan and filtering to Spectrum worker nodes, which operate in a massively parallel way.
This extension of Amazon Redshift (spectrum) Eliminates the need to ETL all data into Redshift —
twtech can keep “cold” data in S3 (data that is infrequently accessed) and only load “hot” data in s3 (frequently accessed and highly active data) into Redshift tables.

2. Spectrum Architecture

Amazon S3 (Parquet, ORC, Avro, JSON, CSV, TSV, Text).
Can integrate with AWS Glue Data Catalog or Hive Metastore for schema definitions.

3. Workflow

Data is stored in S3 (e.g., raw logs, Parquet files).
Define external tables in an external schema pointing to Glue/Hive catalog.
Query external tables with SQL (SELECT * FROM external_table).
Spectrum workers fetch data → filter/project → return to Redshift compute nodes.
Results can be joined with internal Redshift tables.

4. Key Features

Separation of storage & compute → Pay only for the scanned data.
Supports multiple formats → Parquet & ORC (columnar) are recommended.
Joins across sources → Can join S3 external tables with Redshift managed tables.
Elastic scaling → Spectrum automatically adds compute capacity for queries.

5. Performance Optimization

Use columnar formats (Parquet/ORC) instead of row-based (CSV/JSON).
Partition data in S3 (e.g., by date, region) → Spectrum will prune partitions.
Compress data (Snappy, GZIP, ZSTD).
Avoid small files (“small file problem”) → Combine into larger files (100MB–1GB).
Store metadata in Glue Catalog for easier management.

6. Security & Access

7. Use Cases

Data Lake Analytics → Query raw data in S3 without loading.
Cold/Hot Data Architecture → Keep recent data in Redshift, historical in S3.
Ad-hoc Exploration → Analysts can query logs or JSON directly.
Cost Optimization → Store infrequently queried data cheaply in S3.
Big Data Federation → Join transactional warehouse data with large S3 datasets.

8. Pricing

Billed per TB of data scanned by Spectrum.
Partitioning, compression, and columnar formats reduce scan size → reduce cost.

Final thoughts:

Amazon Redshift Spectrum extends Redshift to the data lake in S3, enabling hybrid querying of warehouse + S3 data.
Amazon Redshift Spectrum is cost-efficient, highly and scalable.
Amazon Redshift Spectrum is ideal for cold data queries, data lakes, and big data analytics.

Think - with -Tech