Monday, September 8, 2025

Amazon Apache Flink | Overview & Hands-On.

Amazon Apache Flink - Overview & Hands-On.

Scope:

  • Intro,   
  • Usesful link to documentation,   
  • The Concept: Amazon Apache Flink
  • Key Features,
  • How It Works (Flow),
  • Architecture Overview,
  • Security & Permissions,
  • Operational Deep Dive,
  • Sample Use Cases,
  • Best Practices,
  • Insights,
  • Project: Hands-On.

Intro:

    • Amazon Managed Service for Apache Flink is a fully managed, serverless service from Amazon Web Services (AWS). 
    • Amazon Managed Service for Apache Flink allows twtech to build and run applications using the Apache Flink framework. 
    • Amazon Managed Service for Apache Flink service simplifies the processing of real-time streaming data with low latency, without requiring twtech to manage the underlying infrastructure or clusters.
    •  Amazon Managed Service for Apache Flink was formerly known as Kinesis Data Analytics for Apache Flink.
Usesful link to documentation
https://aws.amazon.com/blogs/aws/announcing-amazon-managed-service-for-apache-flink-renamed-from-amazon-kinesis-data-analytics/

 The Concept: Amazon Apache Flink

    • Amazon Managed Service for Apache Flink lets twtech build and run real-time stream processing applications using Apache Flink, an open-source distributed stream processing framework.
    • It’s fully managed, meaning AWS handles provisioning, scaling, patching, and fault-tolerance, so twtech can focus on building event-driven pipelines instead of managing infrastructure.

 Key Features

1.     Streaming ETL(extract transform load)

    •    Transform, enrich, and aggregate streaming data in real time.
    •    Example: Join a Kinesis data stream with reference data from S3 or DynamoDB.

2.     Event-time Processing

    •   Supports out-of-order and late-arriving events using Flink’s watermarks.

3.     Stateful Applications

    •    Manages application state (windows, aggregations, joins) with checkpointing and savepoints.
    •    State is stored in Amazon S3 (durable, highly available).

4.     Integrations

    •    Sources: Kinesis Data Streams, Managed Kafka (MSK), self-managed Kafka, Amazon S3.
    •    Sinks: Kinesis Data Streams, Kinesis Data Firehose, Amazon S3, OpenSearch, DynamoDB, Redshift, custom sinks.

5.     Elastic Scaling

    •    Automatically adjusts parallelism for throughput.
    •    twtech can scale Flink applications up/down without downtime.

6.     Monitoring

    •    Amazon CloudWatch for logs and metrics.
    •    AWS CloudTrail for auditing.
    •    Managed Prometheus/Grafana for detailed Flink metrics.

 How Apache Flink Works (Flow)

1.     Data Sources (Kinesis Data Streams, MSK, Kafka, S3)
            ⬇

2.     Amazon Managed Service for Apache Flink

o   Runs Flink applications with managed runtime

o   Provides windowing, filtering, joins, aggregations

o   Maintains application state in S3

3.     Data Sinks (Kinesis Firehose, S3, DynamoDB, OpenSearch, Redshift, custom apps)

 Architecture Overview

Producers Kinesis Data Streams / MSK / Kafka / IoT Core

Amazon Managed Service for Apache Flink (Core Layer)

    •       Job Manager (coordinates tasks)
    •       Task Managers (execute stream tasks)
    •        State backend on S3 (durable checkpoints)
    •       Auto scaling + high availability

Consumers Firehose, Redshift, DynamoDB, OpenSearch, S3

Security & Permissions
    • IAM: Fine-grained permissions for reading/writing streams, S3, DynamoDB, etc.
    • Encryption:
      • o   At rest (KMS) for S3 state, streams, and sinks.
      • o   In transit with TLS.
    • VPC integration: Run Flink apps in private subnets, connect to MSK, RDS, etc.

 Operational Deep Dive

  •  Checkpointing
    •    Automatic, periodic snapshots of state written to S3.
    •    Used for fault tolerance jobs can restart from latest checkpoint.
  •  Savepoints
    •    Manual snapshots for version upgrades or app migration.
    •    Ensures state continuity across deployments.
  •  Scaling
    •    Horizontal scaling: add more Task Managers.
    •    Vertical scaling: increase parallelism per task.
    •    Elastic scaling: AWS-managed auto-scaling policies.
  •  Failure Recovery
    •    If a Task Manager fails, Flink restarts it using latest checkpoint.
    •    If Job Manager fails, service launches a new one automatically.

Sample Use Cases

1.     Real-Time Analytics

  •    Process clickstreams, IoT sensor data, log events, or financial transactions.

2.     Fraud Detection

  •    Detect anomalies in payments in real time using event-time joins & ML inference.

3.     Streaming ETL

  •    Cleanse, normalize, and enrich IoT or log data before storing in Redshift/S3.

4.     Monitoring & Alerting

  •    Feed processed data into OpenSearch/Grafana dashboards.

 Best Practices

    •         Use event-time semantics (with watermarks) to handle late data.
    •         Enable checkpoints frequently (every 30–60s).
    •         Store state in S3 for durability.
    •         Use RocksDB state backend for large stateful apps.
    •         Leverage savepoints when upgrading apps.
    •         Deploy via CI/CD (CloudFormation, CDK, Terraform) to ensure reproducibility.
    •         Monitor backpressure & checkpoint durations to fine-tune parallelism.
    •         Keep state size manageable (avoid unbounded windows).
Insights:
    • Apache Flink can read from kinesis Data Streams (KDS) 
    • But Apache Flink   it doesn't read from Amazon Data Firehose

Project: Hands-On

  • How twtech creates and use Amazon Apache Flink to Stream for real-time streaming data with low latency, without requiring it to manage the underlying infrastructure or clusters.

Search for service: kinesis (Real-Time Streaming)

  • Click on general menu tab: To expand and see all features

  • From Data Streams, Click on: Manage Apache Flink

How it works:

Youtube resource Link : https://youtu.be/vI1GiMSHuxM

    •        Amazon Managed Service for Apache Flink makes it easy to build and run real-time streaming applications using Apache Flink.
    •        Amazon Managed Service for Apache Flink takes care of everything required to run streaming applications.
    •        There are no servers and clusters to manage, no compute and storage infrastructure to set up, and twtech only pay for the resources it uses.
    •         twtech can easily setup and integrate data sources or destinations with minimal code, process data continuously with sub-second latencies, and respond to events in real-time.
    •        Amazon Managed Service for Apache Flink takes care of the critical tasks of keeping twtech system secure, updated, compliant, and optimized, so twtech can focus on building applications.


  • Create streaming application:

Create streaming application

  • Managed Service for Apache Flink continuously reads and analyzes data from a connected streaming source in real time. 
  • Managed Service for Apache Flink resources are not covered under the AWS Free Tier, and usage-based charges apply. 



  • Alternatively:

  • At this point, twtech can: run application (twtech-springapp) and monitor matrics via :  open Apache Flink dashboard






No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, What EventBridge  Really  Is (Deep...