Sunday, November 30, 2025

Transferring Large Amount of Data into AWS | Overview.

An overview of Transferring Large Amounts of Data into AWS (TB PB scale).

Scope:

  •        Performance,
  •        Cost consideration,
  •        Architecture,
  •        Tuning,
  •        When to use specific services.

Breakdown:

  •        Online Network Transfer,
  •        Optimized Transfer Services,
  •        Offline Physical Transfer Devices (When Network Is Too Slow),
  •        Hybrid Cloud Replication Services,
  •        VM, Backup, and Large-Scale Migration Tools,
  •        Performance Architecture: End-to-End Flow,
  •        Choosing the Right Method,
  •        Performance Optimization Techniques,
  •        Security,
  •        Cost Considerations.

Intro:

  •        Transferring large amounts of data into AWS can be done through online and offline methods.
  •         Transferring large amounts of data into AWS depends primarily on the data volume, available network bandwidth, and time constraints (How quick complete transfer is need).
Online Transfer Methods

NB:

These methods use twtech existing network connection to move data to AWS

        AWS DataSync:
  •         A managed file transfer service designed for automating and accelerating data movement between on-premises storage (NFS, SMB shares, Hadoop, etc.) and AWS storage services (Amazon S3, EFS, FSx).
   Pros:
    •    Can be up to 10x faster than open-source tools; handles many tasks automatically, including data integrity validation and encryption.
   Best for:
    • One-time migrations, recurring data processing, and automated replication when you have available network bandwidth.
        AWS Transfer Family:
  •         Provides fully managed support for transferring files into and out of Amazon S3 using standard file transfer protocols (SFTP, FTPS, and FTP).
   Best for:
    • Seamlessly migrating existing file transfer workflows that rely on these protocols without changing client-side configurations.
        Amazon S3 Transfer Acceleration:
  •         This feature optimizes public internet transfers to Amazon S3 by leveraging Amazon CloudFront's global edge locations to minimize the effect of latency and maximize available bandwidth.
   Best for:
    •    Accelerating uploads when using the public internet and dealing with high latency over long distances.
        AWS Command Line Interface (CLI) / SDKs:
  •          For manual or scripted transfers, the AWS CLI provides s3 cp or s3 sync commands, which can be tuned to use multipart uploads for large files.
Best for:
    • Users who require scripting capabilities and can manage network optimization manually.
AWS Direct Connect:
  • Establishes a dedicated private network connection from your premises to AWS, which can provide a more consistent and higher-throughput experience than an internet connection.
Best for:
    • High-throughput, reliable, and secure data transfer for ongoing hybrid cloud operations
Offline Transfer Methods (AWS Snow Family) 
  • when twtech has extremely large datasets (terabytes to petabytes), limited network bandwidth or no network bandwidth, or the data is not needed immediately, AWS provides physical storage devices. 
        AWS Snowball Edge:
  •         A rugged, secure device with significant storage (up to 210 TB NVMe SSD) and optional compute capabilities.
        Process:
  •          twtech orders a device via the AWS Console, copy its data to it on-premises, and ship it back to AWS, where the data is loaded into twtech S3 bucket.
        Best for:
  •  Bulk data migrations (terabytes to petabytes) where shipping the data is faster or more cost-effective than online transfer.
        AWS Snowcone:
  • The smallest member of the Snow Family, offering up to 8 TB of usable storage per device.                                                                                                               
        Best for:
    •  Smaller data migration requirements or edge computing in remote locations.
        AWS Snowmobile:
    • A literal semi-trailer for exabyte-scale data migration.
        Best for:

    • Extremely large, multi-petabyte/exabyte-scale data center migrations.

There are the 5 primary categories:

  1.      Online Network Transfer (Direct Connect, VPN, Internet)
  2.      Optimized Transfer Services (AWS DataSync, S3 Transfer Acceleration)
  3.      Physical Offline Devices (Snowcone, Snowball, Snowmobile)
  4.      Hybrid Software Replication (Storage Gateway, FSx, RDS/Aurora tools)
  5.      Application-Level Migration (Databases, VMs, Files, Streams)

1. Online Network Transfer

1.1 Direct Connect (DX)

Best for: predictable, high-bandwidth, petabyte-scale continuous transfer.

Capacities:

  •         Dedicated DX: 1 Gbps, 10 Gbps, 100 Gbps
  •         Hosted DX: 50 Mbps – 10 Gbps

Use cases:

  •         Datacenter AWS data ingestion
  •         Real-time replication
  •         Long-term hybrid architectures

Throughput:

With TCP tuning (window size, parallel streams), twtech can achieve:

·        ~70–90% of line-rate on optimized circuits.

Deep considerations:

  •         Use Jumbo Frames (MTU 9001)
  •         Tune TCP window size > 16 MB
  •         Use parallel transfers for S3
  •         Combine with AWS DataSync for protocol optimization

1.2 Site-to-Site VPN

Used when DX doesn't exist or for temporary migrations.

  •         1.25 Gbps per VPN tunnel (usually less, ~300–400 Mbps real-world)
  •         Can use Equal-Cost Multi-Path (ECMP) to parallelize multiple tunnels

Not ideal for PB-scale, but workable for incremental syncs.

1.3 Public Internet

If twtech environment has:

  •         1–100 Gbps internet
  •         CDN offload
  •         WAN acceleration

It becomes viable, but performance is variable.

2. Optimized Transfer Services

 2.1 AWS DataSync - Most popular for Terabytes (TB) – Petabytes (PB) transfers

Purpose-built for:

  •         File systems S3, EFS, FSx
  •         Agents use parallelism + compression + delta transfers
  •         Handles 10 Gbps+ per agent
  •         Deploy multiple agents for >100 Gbps total

DataSync advantages:

  •         Checksums + integrity validation
  •         Built-in retry, encryption
  •         No protocol overhead like NFS/SCP/rsync
  •         10× faster than rsync

Architecture:

On-Prem File System  DataSync Agent  AWS PrivateLink  S3 / EFS / FSx / EC2 / Lambda

Best for:

  •         PB-scale file migrations
  •         Large dataset ingestion
  •         Media, HPC, research, logs, backups

 2.2 S3 Transfer Acceleration (S3-TA)

Uses CloudFront’s global edge network to accelerate long-distance uploads.

Performance:

  •         50–500% faster for cross-continent transfers
  •         Close to AWS? twtech won't benefit.

Best for:

  •         Uploading from globally distributed sources
  •         Media ingestion
  •         Web apps with global users

 2.3 S3 Multi-Part Upload

For large objects (>5 GB), use:

  •         Multi-threading
  •         Parallelism
  •         Chunk sizes (64–256MB)
  •         Resume support

Throughput can reach multi-Gbps with enough threads.

3. Offline Physical Transfer Devices (When Network Is Too Slow)

3.1 Snowcone

  •         8 TB usable storage
  •         Small edge device
        USB-C powered: Used for remote, rugged, bandwidth-limited sites.

 3.2 Snowball Edge (Standard for PB transfers)

Two variants:

  •         Snowball Edge Storage Optimized (~80 TB usable)
  •         Snowball Edge Compute Optimized (~40 TB usable + compute)

Security:

  •         AES-256 encryption
  •         TPM
  •         Tamper evident
  •         Encrypted end-to-end

Typical ingestion workflow:

  1.      AWS ships device
  2.      twtech load data locally (40–80 TB per device)
  3.      Ship back to AWS
  4.      AWS ingests to S3
  5.      Verification completed

Scale:

  •         10 devices = 0.8 PB
  •         100 devices = 8 PB
  •         Multi-day turnaround depending on shipping

🚚 3.3 AWS Snowmobile (100 PB per truck)

Industrial-scale data migration solution.

Best for:

  •         10 PB – 1 EB (massive archives, media libraries, seismic data)

Transfer rate:

  •         ~1 Tbps aggregated internal write capacity

4. Hybrid Cloud Replication Services

4.1 AWS Storage Gateway

Used for:

  •         File gateway S3-backed SMB/NFS
  •         Volume gateway Backup/DR
  •         Tape gateway Replace physical tape libraries

Not designed for one-time bulk loads, but good for gradual ingestion.

4.2 FSx Services

  •         FSx for NetApp ONTAP
  •         FSx for Windows
  •         FSx for Lustre

Each provides native replication tools (SnapMirror, robocopy, HSM workflows).

4.3 Database Migration Services

Tools:

  •         AWS DMS
  •         Babelfish
  •         Oracle RMAN to S3
  •         PostgreSQL pg_dump / pg_basebackup
  •         MySQL logical/physical dumps
  •         DynamoDB import from S3

5. VM, Backup, and Large-Scale Migration Tools

5.1 VM Migration

  •         AWS MGN (Application Migration Service)
  •         AWS Server Migration Service (legacy)
  •         VMware HCX AWS

HCX can transfer hundreds of VMs via WAN-optimized links.

5.2 Backup Tools

  •         Veeam AWS (S3, Glacier, VTL)
  •         CommVault
  •         Rubrik
  •         NetBackup

These can hydrate backups directly into AWS.

6. Performance Architecture: End-to-End Flow

Below is an advanced view of how data flows into AWS:

7. Choosing the Right Method

Data Volume

Network Available

Best Option

< 5 TB

Good Internet

Direct upload / DataSync

550 TB

1–10 Gbps

DataSync / DX

50500 TB

<5 Gbps

Snowball Edge

500 TB 10 PB

DX < 10 Gbps

Snowball (multiple)

>10 PB

DX insufficient

Snowmobile

Continuous Replication

10+ Gbps DX

Direct Connect + DataSync

Global Upload

Distributed users

S3 Transfer Acceleration

8. Performance Optimization Techniques

TCP Tuning

  •         TCP Window Size: 16–256 MB
  •         Increase buffer sizes
  •         Enable BDP-based tuning

Parallelism

  •         10–100 parallel upload streams for S3
  •         Multi-threaded DataSync

Compression

  •         Use at source if CPU available
  •         DataSync compresses automatically

Chunking

  •         S3: 64–256 MB multipart chunks

9. Security

  •         In-flight encryption (TLS)
  •         At-rest encryption with KMS keys
  •         Snowball/Snowmobile: AES-256 XTS encryption
  •         IAM and S3 bucket policies for access control
  •         VPC endpoints / PrivateLink

10. Cost Considerations

Method

Cost Type

DataSync

Per-GB ($0.0125/GB)

Snowball

Per device + shipping

Snowmobile

Contracted event-based

Direct Connect

Port-hour + data transfer

Transfer Acceleration

Premium per-GB

S3 Storage

Standard S3 pricing

 

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...