Sunday, August 31, 2025

Amazon Athena Federated Query | Deep Dive.

Amazon Athena Federated Query - Deep Dive.

Scope:

  • Intro,
  • The Concept: Athena Federated Query,
  • How it Works,
  • Architecture & Key Components,
  • Flow,
  • Supported Data Sources,
  • Use Cases,
  • Performance & Optimization,
  • Security,
  • Pricing,
  • Best Practices.

Intro:

    • Amazon Athena Federated Query is built on top of the standard Athena.
    • But it extends querying capabilities beyond S3.

1. The Concept: Athena Federated Query

Athena Federated Query enables twtech to run SQL queries across multiple data sources, not just Amazon S3.

    • Uses the same Presto-based Athena engine.
    • Supports structured, semi-structured, and unstructured data.
    • Queries are executed in-place — data is read directly from the source.
    • Great for single-pane-of-glass analytics across diverse stores.

2. How it Works

  1. Data Sources: twtech can connect Athena to RDS, Redshift, DynamoDB, Aurora, CloudWatch Logs, and even external sources like MySQL, PostgreSQL, and SaaS apps.
  2. Connectors: Uses Athena Query Federation SDK and Lambda-based data source connectors.
    • AWS provides prebuilt connectors (S3, Redshift, DynamoDB, DocumentDB, CloudWatch, etc.).
    • twtech can also build custom connectors for any JDBC-compliant source.
  3. Glue Data Catalog: Manages schemas and table definitions across sources.
  4. Query Execution: Athena’s Presto engine coordinates with connectors connectors fetch data from the source results streamed back into Athena.
  5. Outputs: Results stored in S3, and accessible to BI tools (QuickSight, Tableau, Power BI..(Business Intelligence)).

3. Architecture & Key Components:

    • Athena Engine (Presto/Trino)
    • AWS Glue Data Catalog (metadata)
    • Federated Query Connectors (Lambda functions)
    • Source Databases (RDS, Redshift, DynamoDB, external JDBC, SaaS APIs)
    • S3 (query results storage)
    • BI Tools (QuickSight, Tableau, etc.)

Flow:

Athena Query Glue Catalog Presto Engine Lambda Connector Source DB Results S3 + BI(Business Intelligence) Tools.

4. Supported Data Sources

Out-of-the-box Athena Federation Connectors include:

  • AWS Sources:
    • DynamoDB
    • RDS (MySQL, PostgreSQL)
    • Aurora
    • Redshift
    • DocumentDB
    • CloudWatch Logs
    • CloudWatch Metrics
  • External Sources:
    • MySQL
    • PostgreSQL
    • SAP HANA
    • Teradata
    • Oracle
    • Google BigQuery
    • Snowflake
    • Custom JDBC

5. Use Cases

    • Unified Analytics: Query across S3 (data lake) + DynamoDB (NoSQL) + RDS (transactional DB).
    • Ad-Hoc Queries: Business analysts query without ETL pipelines.
    • Data Lake Federation: Extend your data lake to operational databases.
    • Cross-System Joins: Example Join CloudWatch Logs with RDS transactions for troubleshooting.
    • Compliance & Audits: Query multiple systems for reports without data duplication.

6. Performance & Optimization

    • Pushdown Processing: Whenever possible, queries are pushed down to the source (e.g., WHERE clauses applied in the DB).
    • Partitioning in Source: Helps minimize scanned data.
    • Limit Data Movement: Joins across sources may cause large data transfers. Best to pre-aggregate in source systems.
    • Connector Tuning: Lambda-based connectors need concurrency and timeout tuning for large queries.
    • Caching / CTAS: Store federated results in S3 (Parquet/ORC) for repeat queries.

7. Security

    • IAM + Lake Formation: Controls access to Athena and connectors.
    • Encryption: Data in transit via TLS; results encrypted in S3.
    • Granular Access: Use fine-grained permissions in Lake Formation & source DB.

8. Pricing

    • Athena charges $5 per TB scanned, across all federated sources.
    • Additional charges:
      • Lambda execution time (for connectors).
      • Data transfer costs if cross-region queries are executed.

9. Best Practices

    • Use federated queries for exploration & occasional joins, not heavy analytics.
    • For repeated workloads export data to S3 in Parquet via CTAS (Call to Action… a clear instruction designed to get a user to do something).
    • Keep connectors optimized (configure memory, concurrency, timeout).
    • Secure and audit with CloudTrail + Lake Formation.
    • Monitor query performance with Athena Workgroups + CloudWatch.

Final thoughts:

  • Athena Federated Query = Athena + Lambda Connectors + Glue Catalog lets twtech query any data source (cloud DBs, logs, SaaS, S3) with one SQL interface, without ETL(Extract, Transform, Load) pipelines.



Amazon Athena | Deep Dive & Hands-On.

Amazon Athena - Deep Dive & Hands-On.

scope:

  • The Concept of Athena,
  • How Athena Works,
  • What Amazon Athena is often used for,
  • Query Flow,
  • Architecture,
  • Key Features,
  • Performance Optimization,
  • Integrations,
  • Pricing,
  • Use Cases,
  • Best Practices,
  • Final thoughts,
  • ProjectHands-On.

Intro:

    • Amazon Athena is a serverless, interactive query service that enables data analysts to analyze data directly in Amazon S3 using standard SQL. 
    • Amazon Athena is designed for ad-hoc analysis and data exploration without needing to manage any infrastructure or perform complex ETL (Extract, Transform, Load) jobs.

1. The Concept of Athena.

Amazon Athena is a serverless, interactive query service that lets twtech run SQL queries directly on data stored in Amazon S3.

    • No infrastructure to manage (serverless).
    • Uses Presto/Trino SQL engine under the hood.
    • twtech Pays only for the data scanned by queries.

What Amazon Athena is often used for:

    • Ad-hoc data exploration
    • Log analytics
    • Data lake queries
    • BI tool integration
    • Pre-ETL exploration

2. How Athena Works

    1. Data Storage: Athena does not store data itself; it queries data directly in S3.
    2. Schema Definition: Metadata (tables, columns, partitions) is stored in the AWS Glue Data Catalog.
    3. Query Engine: SQL queries are executed using Presto (Trino-based distributed SQL engine).
    4. Results: Query results are stored back into S3 (in a results bucket).

 Query Flow

S3 (raw data) Athena engine SQL query Results stored in S3

 Architecture


  • Data Source Layer:
    • Amazon S3 (primary source)
    • Federated Queries (Athena can query RDS, Redshift, DynamoDB, etc. via connectors)
  • Metadata Layer:
    • AWS Glue Data Catalog (shared metadata across Athena, Redshift Spectrum, EMR, Lake Formation)
  • Query Execution Layer:
    • Presto engine, distributed across multiple nodes
  • Output Layer:
    • Query results stored in S3
    • Integration with QuickSight, Tableau, Power BI, etc.

4. Key Features

    • Serverless: No clusters or infrastructure to manage.
    • Standard SQL: ANSI SQL support, joins, aggregations, window functions.
    • Multiple Formats: Supports CSV, JSON, ORC, Avro, Parquet (columnar = better performance).
    • Federated Queries: Query across relational, NoSQL, and custom sources using Athena connectors.
    • Partitioning: Improves query performance by scanning fewer files.
    • CTAS (Create Table As Select): Write results back into S3 in optimized formats.
    • Security: IAM for auth, Lake Formation for fine-grained access, encryption (S3 SSE-KMS).

5. Performance Optimization

    • Use Columnar Formats: Convert data to Parquet/ORC reduces data scanned & cost.
    • Partition Data: Partition by date, region, etc. limits scanned data.
    • Compression: Store compressed files (Snappy, Gzip).
    • Optimize File Sizes: Avoid too many small files (“small file problem”); target 128 MB – 1 GB per file.
    • Predicate Pushdown: Write queries that leverage filters early.
    • CTAS / INSERT INTO: Pre-transform raw data into analytics-ready datasets.

6. Integrations

    • AWS Glue (metadata catalog, ETL jobs)
    • Amazon QuickSight (visualizations)
    • AWS Lambda (serverless data pipelines)
    • Amazon Redshift Spectrum (hybrid queries with Redshift)
    • Third-party BI Tools: Tableau, Looker, Power BI

7. Pricing

    • $5 per TB scanned (compressed & columnar = huge savings).
    • Partitioning and Parquet can reduce cost by 90% or more.
    • twtech is also charged for:
      • S3 storage
      • Glue Data Catalog requests
      • Result storage in S3

8. Use Cases

    • Log Analysis: Query CloudTrail, ELB, VPC Flow Logs, WAF logs.
    • Data Lakes: Explore S3 data without ETL.
    • Ad-hoc Queries: Business analysts can run queries without needing a Redshift cluster.
    • Federated Analytics: Query multiple databases from a single SQL interface.
    • ETL Alternative: Use CTAS to transform raw optimized datasets in S3.

9. Best Practices

    • Store data in open, columnar formats (Parquet/ORC).
    • Partition wisely (avoid too many tiny partitions).
    • Use Glue crawlers for schema inference.
    • Secure access with IAM + Lake Formation.
    • Monitor performance with CloudWatch + Athena Workgroups.
    • For heavy workloads consider using Redshift (Athena is more for interactive/ad-hoc).

Final thoughts:

  •  Athena = Serverless SQL-on-S3 + Glue Data Catalog + Pay-per-query model.
  • Athena is excellent for exploratory analytics, log querying, and powering dashboards without building full ETL(Extract, Transform, Load) pipelines.


Project: Hands-On

How twtech uses Amazon Athena Federated Query to ingest data and store in its S3 bucket

  • Search for aws service: Athena.

How it works:

  • Launch query editor: to Query data from S3 (twtech-S3bucket)

Set up a query location in Amazon s3 by editing the settings:

S3 bucket in Amazon: twtechs3

Reference the s3 bucket and make sure it is: accessible

Alternatively:

  • Browse  to Choose the s3 data set: twtechs3

Save the location of the query in s3 bucket selected:


twtech needs to run queries for logs in the s3 bucket configured: twtechs3

  • Query logs for all objects in the s3 bucket: twtechs3
  • Run the code to create a database: create database twtech_s3_access_logs_db;

From:

To:

A query to create a table in amazon Athena to access s3 logs

  • In Amazon Athena, twtech creates tables using DDL statements that define metadata in the AWS Glue Data Catalog (or Athena’s internal catalog). 
  • The table itself doesn’t store data.
  • Athena queries data directly in Amazon S3.

Here’s a simple example query to create a table:

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs_db; (

  id              INT,

  name            STRING,

  age             INT,

  created_at      TIMESTAMP

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

  'serialization.format' = ',',

  'field.delim' = ','

)

LOCATION 's3://twtechs3/';

Breakdown for table: twtech_s3_access_logs_db;

  • twtech_s3_access_logs_db; Database and table name.
  • Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
  • ROW FORMAT / SERDE How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
  • LOCATIONS3 path where twtech data lives.
  • TBLPROPERTIESOptional metadata flags (e.g., partitioned data, encryption).
  • Table name must be unique.

To:

More detailed logs table:

  • # AWS Official Link to Athena Querying documentation code

https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html

# Query code for Athena and s3 bucket

CREATE EXTERNAL TABLE twtech_s3_access_logs_db(

         bucket string,

         key string,

         version_id string,

         is_latest boolean,

         is_delete_marker boolean,

         size bigint,

         last_modified_date timestamp,

         e_tag string,

         storage_class string,

         is_multipart_uploaded boolean,

         replication_status string,

         encryption_status string,

         object_lock_retain_until_date bigint,

         object_lock_mode string,

         object_lock_legal_hold_status string,

         intelligent_tiering_access_tier string,

         bucket_key_status string,

         checksum_algorithm string,

         object_access_control_list string,

         object_owner string

) PARTITIONED BY (

        dt string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

  LOCATION 's3://twtechs3/'

  TBLPROPERTIES (

    "projection.enabled" = "true",

    "projection.dt.type" = "date",

    "projection.dt.format" = "yyyy-MM-dd-HH-mm",

    "projection.dt.range" = "2022-01-01-00-00,NOW",

    "projection.dt.interval" = "1",

    "projection.dt.interval.unit" = "HOURS"

  );

  • To a More detailed logs table for : s3_access_logs_db

  • Breakdown for table: s3_access_logs_db
    • s3_access_logs_db Database and table name. 
    • Columns Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
    • ROW FORMAT / SERDE How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
    • LOCATION S3 path where twtech data lives.
    • TBLPROPERTIES Optional metadata flags (e.g., partitioned data, encryption).
    • Table name must be unique.
  • Sample for a Parquet table (faster & cheaper queries):

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_database.parquet_table (

  user_id        BIGINT,

  event_type     STRING,

  event_time     TIMESTAMP

)

STORED AS PARQUET

LOCATION 's3://twtechs3/data/events/';

# Or:

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs (

    `host` STRING,

    `identity` STRING,

    `user` STRING,

    `time` STRING,

    `request` STRING,

    `status` INT,

    `size` BIGINT,

    `referer` STRING,

    `useragent` STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

    'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) \\[([^\\]]*)\\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"'

)

LOCATION 's3://twtechs3/';

  • How twtech previews table logs tabe created: s3-access_logs_db

  • How twtech does anlalytics for logs aggregation on its logs in s3: twtechs3

SELECT requesturi_operation, httpstatus, count(*) FROM " twtech_s3_access_logs"."twtechs3_logs"

GROUP BY requesturi_operation, httpstatus;

From:

  • How twtech gets unauthorized queries: is someone trying to query its data

SELECT * FROM " s3_access_logs_db", "twtechs3_logswhere httpstatus = ’403

  • How twtech gets unauthorized queries: is someone trying to query data

SELECT * FROM " s3_access_logs_db", "twtechs3_logs"  where httpstatus = '403'

Alternatively:

How twtech query S3 data from Athena using SQL

  • Create a catalog table for Netflix-listings:  Create AWS Glue Crawler (scan the database in s3 and create a table)

  • Set crawler properties

  • Choose data sources and classifiers: Data source configuration


  • Configure security settings: IAM role


  • Set output and scheduling:
  • Create a database for: twtech-netflix-db


  • twtech Verifies that the database is created: twtech-netflix-db

  • Go ahead refresh output configuration settings and select the database created: twtech-netflix-db


Review and create


  • twtech Runs the Crawler: twtech-Netflix-Crawler

From: creating a table

To: completed

  • twtech verifies the table created on Glue:


twtech Verifies the table created in aws Athena:

How twtech query the table: twtech-netfilxdbnetflix_listings

  • Link to download Sample listings for  Netflix movies database: files Should be downloaded and uploaded into a folder created in s3 bucket

https://github.com/Devopspat35/twtech-netflix-imdb

  • First, configure the Athena query location where the results of the query will be stored: settings

  • Manage settings:

Manage settings: s3://twtechs3/netflix-query-store/

Query result location and encryption

Location of query result - optional

Enter an S3 prefix in the current region where the query result will be saved as an object.

From: s3://twtechs3

To: s3://twtechs3/netflix-query-store/


  • How twtech previews the table:


  • Query results:


How twtech perfumes Fine-grained filters for query: Title

SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where title = 'The Lord of the Rings: The Return of the King'

SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where directors = 'Sam Mendes'


SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where genres = 'Adventure'






Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...