Sunday, August 31, 2025

Amazon Athena | Deep Dive & Hands-On.

Amazon Athena - Deep Dive & Hands-On.

scope:

The Concept of Athena,
How Athena Works,
What Amazon Athena is often used for,
Query Flow,
Architecture,
Key Features,
Performance Optimization,
Integrations,
Pricing,
Use Cases,
Best Practices,
Final thoughts,
Project: Hands-On.

Intro:

Amazon Athena is a serverless, interactive query service that enables data analysts to analyze data directly in Amazon S3 using standard SQL.
Amazon Athena is designed for ad-hoc analysis and data exploration without needing to manage any infrastructure or perform complex ETL (Extract, Transform, Load) jobs.

1. The Concept of Athena.

Amazon Athena is a serverless, interactive query service that lets twtech run SQL queries directly on data stored in Amazon S3.

No infrastructure to manage (serverless).
Uses Presto/Trino SQL engine under the hood.
twtech Pays only for the data scanned by queries.

What Amazon Athena is often used for:

Ad-hoc data exploration
Log analytics
Data lake queries
BI tool integration
Pre-ETL exploration

2. How Athena Works

Data Storage: Athena does not store data itself; it queries data directly in S3.
Schema Definition: Metadata (tables, columns, partitions) is stored in the AWS Glue Data Catalog.
Query Engine: SQL queries are executed using Presto (Trino-based distributed SQL engine).
Results: Query results are stored back into S3 (in a results bucket).

Query Flow

S3 (raw data) ➝ Athena engine ➝ SQL query ➝ Results stored in S3

Architecture

Data Source Layer:

Amazon S3 (primary source)
Federated Queries (Athena can query RDS, Redshift, DynamoDB, etc. via connectors)

Metadata Layer:

AWS Glue Data Catalog (shared metadata across Athena, Redshift Spectrum, EMR, Lake Formation)

Query Execution Layer:

Presto engine, distributed across multiple nodes

Output Layer:

Query results stored in S3
Integration with QuickSight, Tableau, Power BI, etc.

4. Key Features

Serverless: No clusters or infrastructure to manage.
Standard SQL: ANSI SQL support, joins, aggregations, window functions.
Multiple Formats: Supports CSV, JSON, ORC, Avro, Parquet (columnar = better performance).
Federated Queries: Query across relational, NoSQL, and custom sources using Athena connectors.
Partitioning: Improves query performance by scanning fewer files.
CTAS (Create Table As Select): Write results back into S3 in optimized formats.
Security: IAM for auth, Lake Formation for fine-grained access, encryption (S3 SSE-KMS).

5. Performance Optimization

Use Columnar Formats: Convert data to Parquet/ORC → reduces data scanned & cost.
Partition Data: Partition by date, region, etc. → limits scanned data.
Compression: Store compressed files (Snappy, Gzip).
Optimize File Sizes: Avoid too many small files (“small file problem”); target 128 MB – 1 GB per file.
Predicate Pushdown: Write queries that leverage filters early.
CTAS / INSERT INTO: Pre-transform raw data into analytics-ready datasets.

6. Integrations

AWS Glue (metadata catalog, ETL jobs)
Amazon QuickSight (visualizations)
AWS Lambda (serverless data pipelines)
Amazon Redshift Spectrum (hybrid queries with Redshift)
Third-party BI Tools: Tableau, Looker, Power BI

7. Pricing

$5 per TB scanned (compressed & columnar = huge savings).
Partitioning and Parquet can reduce cost by 90% or more.
twtech is also charged for:

S3 storage
Glue Data Catalog requests
Result storage in S3

8. Use Cases

Log Analysis: Query CloudTrail, ELB, VPC Flow Logs, WAF logs.
Data Lakes: Explore S3 data without ETL.
Ad-hoc Queries: Business analysts can run queries without needing a Redshift cluster.
Federated Analytics: Query multiple databases from a single SQL interface.
ETL Alternative: Use CTAS to transform raw → optimized datasets in S3.

9. Best Practices

Store data in open, columnar formats (Parquet/ORC).
Partition wisely (avoid too many tiny partitions).
Use Glue crawlers for schema inference.
Secure access with IAM + Lake Formation.
Monitor performance with CloudWatch + Athena Workgroups.
For heavy workloads → consider using Redshift (Athena is more for interactive/ad-hoc).

Final thoughts:

Athena = Serverless SQL-on-S3 + Glue Data Catalog + Pay-per-query model.
Athena is excellent for exploratory analytics, log querying, and powering dashboards without building full ETL(Extract, Transform, Load) pipelines.

Project: Hands-On

How twtech uses Amazon Athena Federated Query to ingest data and store in its S3 bucket

Search for aws service: Athena.

How it works:

Launch query editor: to Query data from S3 (twtech-S3bucket)

Set up a query location in Amazon s3 by editing the settings:

S3 bucket in Amazon: twtechs3

Reference the s3 bucket and make sure it is: accessible

Alternatively:

Browse to Choose the s3 data set: twtechs3

Save the location of the query in s3 bucket selected:

twtech needs to run queries for logs in the s3 bucket configured: twtechs3

Query logs for all objects in the s3 bucket: twtechs3
Run the code to create a database: create database twtech_s3_access_logs_db;

From:

To:

A query to create a table in amazon Athena to access s3 logs

In Amazon Athena, twtech creates tables using DDL statements that define metadata in the AWS Glue Data Catalog (or Athena’s internal catalog).
The table itself doesn’t store data.
Athena queries data directly in Amazon S3.

Here’s a simple example query to create a table:

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs_db; (

id INT,

name STRING,

age INT,

created_at TIMESTAMP

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'serialization.format' = ',',

'field.delim' = ','

)

LOCATION 's3://twtechs3/';

Breakdown for table: twtech_s3_access_logs_db;

twtech_s3_access_logs_db; → Database and table name.
Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
ROW FORMAT / SERDE → How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
LOCATION → S3 path where twtech data lives.
TBLPROPERTIES → Optional metadata flags (e.g., partitioned data, encryption).
Table name must be unique.

To:

More detailed logs table:

# AWS Official Link to Athena Querying documentation code ”

https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html

# Query code for Athena and s3 bucket

CREATE EXTERNAL TABLE twtech_s3_access_logs_db(

bucket string,

key string,

version_id string,

is_latest boolean,

is_delete_marker boolean,

size bigint,

last_modified_date timestamp,

e_tag string,

storage_class string,

is_multipart_uploaded boolean,

replication_status string,

encryption_status string,

object_lock_retain_until_date bigint,

object_lock_mode string,

object_lock_legal_hold_status string,

intelligent_tiering_access_tier string,

bucket_key_status string,

checksum_algorithm string,

object_access_control_list string,

object_owner string

) PARTITIONED BY (

dt string

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

LOCATION 's3://twtechs3/'

TBLPROPERTIES (

"projection.enabled" = "true",

"projection.dt.type" = "date",

"projection.dt.format" = "yyyy-MM-dd-HH-mm",

"projection.dt.range" = "2022-01-01-00-00,NOW",

"projection.dt.interval" = "1",

"projection.dt.interval.unit" = "HOURS"

);

To a More detailed logs table for : s3_access_logs_db

Breakdown for table: s3_access_logs_db

s3_access_logs_db →Database and table name.
Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
ROW FORMAT / SERDE → How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
LOCATION → S3 path where twtech data lives.
TBLPROPERTIES → Optional metadata flags (e.g., partitioned data, encryption).
Table name must be unique.

Sample for a Parquet table (faster & cheaper queries):

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_database.parquet_table (

user_id BIGINT,

event_type STRING,

event_time TIMESTAMP

)

STORED AS PARQUET

LOCATION 's3://twtechs3/data/events/';

# Or:

CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs (

`host` STRING,

`identity` STRING,

`user` STRING,

`time` STRING,

`request` STRING,

`status` INT,

`size` BIGINT,

`referer` STRING,

`useragent` STRING

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) \\[([^\\]]*)\\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"'

)

LOCATION 's3://twtechs3/';

How twtech previews table logs tabe created: s3-access_logs_db

How twtech does anlalytics for logs aggregation on its logs in s3: twtechs3

SELECT requesturi_operation, httpstatus, count(*) FROM " twtech_s3_access_logs"."twtechs3_logs"

GROUP BY requesturi_operation, httpstatus;

From:

How twtech gets unauthorized queries: is someone trying to query its data

SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where httpstatus = ’403’

How twtech gets unauthorized queries: is someone trying to query data

SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where httpstatus = '403'

Alternatively:

How twtech query S3 data from Athena using SQL

Create a catalog table for Netflix-listings: Create AWS Glue Crawler (scan the database in s3 and create a table)

Set crawler properties

Choose data sources and classifiers: Data source configuration

Configure security settings: IAM role

Set output and scheduling:
Create a database for: twtech-netflix-db

twtech Verifies that the database is created: twtech-netflix-db

Go ahead refresh output configuration settings and select the database created: twtech-netflix-db

Review and create

twtech Runs the Crawler: twtech-Netflix-Crawler

From: creating a table

To: completed

twtech verifies the table created on Glue:

twtech Verifies the table created in aws Athena:

How twtech query the table: twtech-netfilxdbnetflix_listings

Link to download Sample listings for Netflix movies database: files Should be downloaded and uploaded into a folder created in s3 bucket

https://github.com/Devopspat35/twtech-netflix-imdb

First, configure the Athena query location where the results of the query will be stored: settings

Manage settings:

Manage settings: s3://twtechs3/netflix-query-store/
Query result location and encryption
Location of query result - optional
Enter an S3 prefix in the current region where the query result will be saved as an object.

From: s3://twtechs3