An overview of Transferring Large Amounts of Data into AWS (TB → PB
scale).
Scope:
- Performance,
- Cost consideration,
- Architecture,
- Tuning,
- When to use specific services.
Breakdown:
- Online Network Transfer,
- Optimized Transfer Services,
- Offline Physical Transfer Devices (When Network Is Too Slow),
- Hybrid Cloud Replication
Services,
- VM, Backup, and Large-Scale
Migration Tools,
- Performance Architecture:
End-to-End Flow,
- Choosing the Right Method,
- Performance Optimization
Techniques,
- Security,
- Cost Considerations.
Intro:
- Transferring large amounts of data into AWS can be done through online and offline methods.
- Transferring large amounts of data into AWS depends primarily on the data volume, available network bandwidth, and
time constraints (How quick complete transfer is need).
NB:
These methods use twtech existing network connection to move
data to AWS.
- A managed file transfer service designed for automating and
accelerating data movement between on-premises storage (NFS, SMB shares,
Hadoop, etc.) and AWS storage services (Amazon S3, EFS, FSx).
- Can be up to 10x faster
than open-source tools; handles many tasks automatically, including data
integrity validation and encryption.
- One-time migrations, recurring data processing, and automated replication when you have available network bandwidth.
- Provides fully managed support for transferring files into and
out of Amazon S3 using standard file transfer protocols (SFTP, FTPS, and FTP).
- Seamlessly migrating existing file transfer workflows that rely on these protocols without changing client-side configurations.
- This feature optimizes public internet transfers to Amazon S3 by
leveraging Amazon CloudFront's global edge locations to minimize the effect of
latency and maximize available bandwidth.
- Accelerating uploads when using the public internet and dealing
with high latency over long distances.
- For manual or scripted
transfers, the AWS CLI provides s3 cp or s3 sync commands, which can be tuned to use multipart uploads for large files.
- Users who require scripting capabilities and can manage network optimization manually.
- Establishes a dedicated private network connection from your premises to AWS, which can provide a more consistent and higher-throughput experience than an internet connection.
- High-throughput, reliable, and secure data transfer for ongoing hybrid cloud operations
- when twtech has extremely large datasets (terabytes
to petabytes), limited network
bandwidth or no network bandwidth, or the data is not needed immediately, AWS provides physical storage devices.
- A rugged, secure device with significant storage (up to 210 TB NVMe SSD) and optional
compute capabilities.
- twtech orders a device
via the AWS Console, copy its data to it on-premises, and ship it back to
AWS, where the data is loaded into twtech S3 bucket.
- Bulk data migrations (terabytes to petabytes) where shipping the data is faster or more cost-effective than online transfer.
- The smallest member of the Snow Family, offering up to 8 TB of usable storage per device.
- Smaller data migration requirements or edge
computing in remote locations.
- A literal semi-trailer for exabyte-scale data migration.
- Extremely large, multi-petabyte/exabyte-scale data center migrations.
There are the 5
primary categories:
- Online Network Transfer (Direct Connect, VPN, Internet)
- Optimized Transfer Services (AWS DataSync, S3 Transfer Acceleration)
- Physical Offline Devices (Snowcone, Snowball, Snowmobile)
- Hybrid Software Replication (Storage Gateway, FSx, RDS/Aurora tools)
- Application-Level Migration (Databases, VMs, Files, Streams)
1. Online Network Transfer
1.1 Direct Connect (DX)
Best for: predictable, high-bandwidth, petabyte-scale continuous transfer.
Capacities:
- Dedicated DX: 1 Gbps, 10 Gbps, 100 Gbps
- Hosted DX: 50 Mbps – 10 Gbps
Use cases:
- Datacenter → AWS data ingestion
- Real-time replication
- Long-term hybrid architectures
Throughput:
With TCP tuning (window size, parallel streams), twtech can achieve:
·
~70–90% of line-rate on
optimized circuits.
Deep considerations:
- Use Jumbo Frames (MTU 9001)
- Tune TCP window size > 16 MB
- Use parallel transfers for S3
- Combine with AWS DataSync for protocol optimization
1.2 Site-to-Site VPN
Used when DX doesn't exist or for temporary migrations.
- 1.25 Gbps per VPN tunnel (usually less, ~300–400 Mbps real-world)
- Can use Equal-Cost Multi-Path (ECMP) to parallelize multiple tunnels
Not ideal for PB-scale, but workable for incremental syncs.
1.3 Public Internet
If twtech environment has:
- 1–100 Gbps internet
- CDN offload
- WAN acceleration
It becomes viable, but performance is variable.
2. Optimized Transfer Services
2.1 AWS DataSync - Most popular for Terabytes (TB) – Petabytes (PB) transfers
Purpose-built for:
- File systems → S3, EFS, FSx
- Agents use parallelism + compression + delta transfers
- Handles 10 Gbps+ per agent
- Deploy multiple agents for >100 Gbps total
DataSync advantages:
- Checksums + integrity validation
- Built-in retry, encryption
- No protocol overhead like NFS/SCP/rsync
- 10× faster than rsync
Architecture:
On-Prem File System → DataSync Agent → AWS PrivateLink → S3 / EFS / FSx / EC2 / LambdaBest for:
- PB-scale file migrations
- Large dataset ingestion
- Media, HPC, research, logs, backups
2.2
S3 Transfer Acceleration (S3-TA)
Uses CloudFront’s global edge network to accelerate long-distance uploads.
Performance:
- 50–500% faster for cross-continent transfers
- Close to AWS? twtech won't benefit.
Best for:
- Uploading from globally distributed sources
- Media ingestion
- Web apps with global users
2.3
S3 Multi-Part Upload
For large
objects (>5 GB), use:
- Multi-threading
- Parallelism
- Chunk sizes (64–256MB)
- Resume support
Throughput can reach multi-Gbps with enough threads.
3. Offline Physical Transfer Devices (When
Network Is Too Slow)
3.1 Snowcone
- 8 TB usable storage
- Small edge device
3.2
Snowball Edge (Standard for PB transfers)
Two
variants:
-
Snowball Edge Storage Optimized
(~80 TB usable)
- Snowball Edge Compute Optimized (~40 TB usable + compute)
Security:
- AES-256 encryption
- TPM
- Tamper evident
- Encrypted end-to-end
Typical
ingestion workflow:
- AWS ships device
- twtech load data locally (40–80 TB per device)
- Ship back to AWS
- AWS ingests to S3
- Verification completed
Scale:
- 10 devices = 0.8 PB
- 100 devices = 8 PB
- Multi-day turnaround depending on shipping
🚚 3.3 AWS
Snowmobile (100 PB per truck)
Industrial-scale data migration solution.
Best for:
- 10 PB – 1 EB (massive archives, media libraries, seismic data)
Transfer rate:
- ~1 Tbps aggregated internal write capacity
4. Hybrid Cloud Replication Services
4.1 AWS Storage Gateway
Used for:
- File gateway → S3-backed SMB/NFS
- Volume gateway → Backup/DR
- Tape gateway → Replace physical tape libraries
Not designed for one-time bulk loads, but good for gradual ingestion.
4.2 FSx Services
- FSx for NetApp ONTAP
- FSx for Windows
- FSx for Lustre
Each provides native replication tools (SnapMirror, robocopy, HSM
workflows).
4.3 Database Migration Services
Tools:
- AWS DMS
- Babelfish
- Oracle RMAN to S3
- PostgreSQL pg_dump / pg_basebackup
- MySQL logical/physical dumps
- DynamoDB import from S3
5. VM, Backup, and Large-Scale Migration
Tools
5.1 VM Migration
- AWS MGN (Application Migration Service)
- AWS Server Migration Service (legacy)
- VMware HCX → AWS
HCX can transfer hundreds of VMs via WAN-optimized links.
5.2 Backup Tools
- Veeam → AWS (S3, Glacier, VTL)
- CommVault
- Rubrik
- NetBackup
These can hydrate backups directly into AWS.
6. Performance Architecture: End-to-End
Flow
Below is an advanced view of how data flows into AWS:
7. Choosing the Right Method
|
Data Volume |
Network Available |
Best Option |
|
< 5 TB |
Good Internet |
Direct upload / DataSync |
|
5–50 TB |
1–10 Gbps |
DataSync / DX |
|
50–500 TB |
<5 Gbps |
Snowball Edge |
|
500 TB – 10 PB |
DX < 10 Gbps |
Snowball (multiple) |
|
>10 PB |
DX insufficient |
Snowmobile |
|
Continuous Replication |
10+ Gbps DX |
Direct Connect + DataSync |
|
Global Upload |
Distributed users |
S3 Transfer Acceleration |
8. Performance Optimization Techniques
TCP Tuning
- TCP Window Size: 16–256 MB
- Increase buffer sizes
- Enable BDP-based tuning
Parallelism
- 10–100 parallel upload streams for S3
- Multi-threaded DataSync
Compression
- Use at source if CPU available
- DataSync compresses automatically
Chunking
- S3: 64–256 MB multipart chunks
9. Security
- In-flight encryption (TLS)
- At-rest encryption with KMS keys
- Snowball/Snowmobile: AES-256 XTS encryption
- IAM and S3 bucket policies for access control
- VPC endpoints / PrivateLink
10. Cost Considerations
|
Method |
Cost Type |
|
DataSync |
Per-GB ($0.0125/GB) |
|
Snowball |
Per device + shipping |
|
Snowmobile |
Contracted event-based |
|
Direct Connect |
Port-hour + data transfer |
|
Transfer Acceleration |
Premium per-GB |
|
S3 Storage |
Standard S3 pricing |