Loading Large Data into Amazon Redshift - Overview
Scope:
- Intro,
- Methods of Loading Data (Recommended for Bulk Inserts),
- Other Methods to Load large datasets into Amazon Redshift,
- Best Practices for Large Data Loads,
- Sample COPY Command (Sample: S3 → Redshift) & Key options
- Handling Very Large Loads,
- Performance Tuning & Maintenance,
- Common Pitfalls,
- Final thought.
Intro:
- Loading large datasets into Amazon Redshift efficiently involves using specific tools and strategies designed to leverage Redshift's parallel processing architecture.
- The primary method recommended by AWS is using the Copy command, often utilizing data stored in Amazon S3.
1. Methods of Loading Data (Recommended for Bulk Inserts)
- COPY command (fastest, parallel load).
- Reads data in parallel from multiple sources into
compute nodes.
- Sources: S3, DynamoDB,
EMR, Kinesis Data Firehose, SSH (remote host).
- The copy command Supports compression (gzip, bzip2, lzop, zstd, snappy).
- INSERT / UPDATE / DELETE →
okay for small data, not large-scale loads.
- Amazon DMS → continuous replication from RDS/other DBs.
- Glue ETL / EMR → transformation before loading.
2.
Best Practices for Large Data Loads
a) Use COPY instead of INSERT
- COPY is parallelized across all nodes and
slices.
- INSERT is row-by-row, much slower for millions of rows.
b) Split Files for Parallelism
- Place data in multiple files (ideally matching
the number of slices in the cluster).
- Example: If cluster has 8 nodes × 16 slices = 128 slices → split into 128 files.
- File size: 100 MB – 1 GB each for balance.
c) Use Column Compression
- Redshift automatically detects optimal compression with
ANALYZE COMPRESSION.
- Loading compressed files reduces network + I/O cost.
d) Optimize Distribution & Sort Keys
- Ensures data is well spread across slices → avoids data
skew.
- Matching distribution keys to join/filter columns reduces shuffling.
e) Staging Tables
- Load raw data into staging tables (no
constraints, no sort/distribution keys).
- Apply transformations and then insert into final tables.
- Prevents lock/contention on production tables.
3. Sample COPY Command (Sample: S3 → Redshift) & Key options
# bash
COPY twtechsales
FROM 's3://twtech-s3bucket/twtechsales_data/'
IAM_ROLE
'arn:aws:iam::accountID:role/RedshiftCopyRole'
FORMAT
AS PARQUET;
Key Options:
- DELIMITER → For CSV (Comma-Separated Values) and TSV (Tab-Separated Values)
- GZIP, BZIP2, ZSTD → Load compressed files.
- PARQUET, ORC, AVRO → Columnar formats, efficient.
- IGNOREHEADER → Skip headers in CSV (Comma-Separated Values) .
- ACCEPTINVCHARS → Handle bad encoding.
- MAXERROR → Ignore X rows before failing.
- COMPUPDATE OFF → Avoid compression re-calculation if not needed.
4.
Handling Very Large Loads
- Multi-step pipeline:
- Land raw data in S3.
- Split into optimal file chunks.
- COPY into staging → transform → move to final tables.
- Concurrency Scaling:
- Helps during peak ingest times.
- WLM Queues:
- Assign COPY jobs to dedicated queues to avoid
competing with queries.
5.
Performance Tuning & Maintenance
- VACUUM
→ Reclaims space & sorts data after large deletes/updates.
- ANALYZE → Updates table statistics for query planner.
- DISTKEY & SORTKEY tuning → Prevents data skew & speeds queries.
- Workload Management (WLM) → Isolate COPY workload.
- COPY with COMPUPDATE & STATUPDATE → Can auto-tune compression/stats (but disable if already optimized).
6.
Common Pitfalls
- Too few large files →
Underutilizes slices.
- Too many tiny files → Overhead, slower performance.
- Uncompressed loads → Increases cost/time.
- Loading directly into production tables → Causes locking & poor query performance.
- Skewed data distribution → Once node is overloaded, bad parallelism.
twtech Final thought:
- For large data inserts, always land data in S3 → COPY → staging tables → optimized schema, ensuring parallel load, compression, and proper distribution.
No comments:
Post a Comment