Amazon Athena - Deep Dive & Hands-On.
scope:
- The Concept of Athena,
- How Athena Works,
- What Amazon Athena is often used for,
- Query Flow,
- Architecture,
- Key Features,
- Performance Optimization,
- Integrations,
- Pricing,
- Use Cases,
- Best Practices,
- Final thoughts,
- Project: Hands-On.
Intro:
- Amazon Athena is a serverless, interactive query service that enables data analysts to analyze data directly in Amazon S3 using standard SQL.
- Amazon Athena is designed for ad-hoc analysis and data exploration without needing to manage any infrastructure or perform complex ETL (Extract, Transform, Load) jobs.
1. The Concept of Athena.
Amazon Athena is a serverless, interactive query service that lets
twtech run SQL queries directly on data stored in Amazon S3.
- No infrastructure to manage (serverless).
- Uses Presto/Trino SQL engine under the hood.
- twtech Pays only for the data scanned by queries.
What Amazon Athena is often used for:
- Ad-hoc data exploration
- Log analytics
- Data lake queries
- BI tool integration
- Pre-ETL exploration
2. How Athena Works
- Data Storage: Athena does not store data itself; it queries data directly
in S3.
- Schema Definition: Metadata (tables,
columns, partitions) is stored in the AWS Glue Data Catalog.
- Query Engine: SQL
queries are executed using Presto (Trino-based distributed SQL engine).
- Results: Query results are stored back
into S3 (in a results bucket).
Query Flow
S3 (raw
data) ➝ Athena engine ➝ SQL query ➝ Results stored in S3
Architecture
- Data Source Layer:
- Amazon S3 (primary
source)
- Federated Queries (Athena
can query RDS, Redshift, DynamoDB, etc. via connectors)
- Metadata Layer:
- AWS Glue Data Catalog (shared metadata across Athena, Redshift Spectrum, EMR, Lake
Formation)
- Query Execution Layer:
- Presto engine, distributed across multiple nodes
- Output Layer:
- Query results stored in S3
- Integration with QuickSight, Tableau, Power BI, etc.

4. Key Features
- Serverless: No
clusters or infrastructure to manage.
- Standard SQL: ANSI
SQL support, joins, aggregations, window functions.
- Multiple Formats: Supports
CSV, JSON, ORC, Avro, Parquet (columnar
= better performance).
- Federated Queries: Query across relational, NoSQL, and custom sources
using Athena connectors.
- Partitioning: Improves query performance by scanning fewer files.
- CTAS
(Create Table As Select):
Write results back into S3 in optimized formats.
- Security: IAM for auth, Lake Formation for fine-grained access,
encryption (S3 SSE-KMS).
5. Performance Optimization
- Use Columnar Formats: Convert
data to Parquet/ORC → reduces data scanned & cost.
- Partition Data: Partition by date, region, etc. → limits scanned
data.
- Compression: Store
compressed files (Snappy, Gzip).
- Optimize File Sizes: Avoid too many small files (“small file problem”); target 128 MB – 1 GB per file.
- Predicate Pushdown: Write queries that leverage filters early.
- CTAS / INSERT INTO: Pre-transform
raw data into analytics-ready datasets.
6. Integrations
- AWS Glue (metadata catalog, ETL jobs)
- Amazon QuickSight (visualizations)
- AWS Lambda (serverless data pipelines)
- Amazon Redshift Spectrum (hybrid queries with Redshift)
- Third-party BI Tools: Tableau,
Looker, Power BI
7. Pricing
- $5 per TB scanned (compressed & columnar =
huge savings).
- Partitioning and Parquet can reduce cost by 90% or
more.
- twtech
is also charged for:
- S3
storage
- Glue
Data Catalog requests
- Result
storage in S3
8. Use Cases
- Log Analysis: Query CloudTrail, ELB, VPC Flow Logs, WAF logs.
- Data Lakes: Explore S3 data without ETL.
- Ad-hoc Queries: Business
analysts can run queries without needing a Redshift cluster.
- Federated Analytics: Query
multiple databases from a single SQL interface.
- ETL Alternative: Use CTAS to transform raw → optimized datasets in S3.
9. Best Practices
- Store data in open,
columnar formats (Parquet/ORC).
- Partition wisely (avoid too many tiny partitions).
- Use Glue crawlers for
schema inference.
- Secure access with IAM
+ Lake Formation.
- Monitor performance with CloudWatch + Athena Workgroups.
- For heavy workloads → consider using Redshift
(Athena is more for
interactive/ad-hoc).
Final thoughts:
- Athena =
Serverless SQL-on-S3 + Glue Data Catalog + Pay-per-query model.
- Athena is excellent for exploratory
analytics, log querying, and powering dashboards without building full ETL(Extract, Transform, Load) pipelines.

Project: Hands-On
How twtech uses Amazon Athena Federated Query to ingest data and
store in its S3 bucket
- Search for aws service: Athena.
How it works:
- Launch query editor: to Query data from S3
(twtech-S3bucket)
Set up a query location in Amazon s3 by editing the
settings:
S3 bucket in Amazon: twtechs3
Reference the s3 bucket and make sure it is: accessible
Alternatively:
- Browse to Choose
the s3 data set: twtechs3
Save the location of the query in s3 bucket selected:
twtech needs to run queries for logs in the s3 bucket
configured: twtechs3
- Query logs for all objects in the s3 bucket: twtechs3
- Run the code to create a database: create database twtech_s3_access_logs_db;
From:
To:
A query to create a table in amazon Athena to access s3
logs
- In Amazon Athena, twtech creates tables using DDL statements that define metadata in the AWS Glue Data
Catalog (or Athena’s internal catalog).
- The table itself doesn’t store
data.
- Athena queries data directly in Amazon S3.
Here’s a simple example
query to create a table:
CREATE EXTERNAL TABLE IF
NOT EXISTS twtech_s3_access_logs_db; (
id INT,
name STRING,
age INT,
created_at TIMESTAMP
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION 's3://twtechs3/';
Breakdown
for table: twtech_s3_access_logs_db;
- twtech_s3_access_logs_db;
→ Database and table name.
- Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp,
date, etc.).
- ROW FORMAT / SERDE → How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
- LOCATION → S3 path where twtech data lives.
- TBLPROPERTIES → Optional metadata flags (e.g., partitioned data, encryption).
- Table name must be unique.
To:
More detailed logs table:
- # AWS Official Link to Athena Querying documentation code ”
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html
# Query
code for Athena and s3 bucket
CREATE
EXTERNAL TABLE twtech_s3_access_logs_db(
bucket string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string,
object_lock_retain_until_date bigint,
object_lock_mode string,
object_lock_legal_hold_status string,
intelligent_tiering_access_tier
string,
bucket_key_status string,
checksum_algorithm string,
object_access_control_list string,
object_owner string
)
PARTITIONED BY (
dt string
)
ROW
FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://twtechs3/'
TBLPROPERTIES (
"projection.enabled" =
"true",
"projection.dt.type" =
"date",
"projection.dt.format" =
"yyyy-MM-dd-HH-mm",
"projection.dt.range" =
"2022-01-01-00-00,NOW",
"projection.dt.interval" =
"1",
"projection.dt.interval.unit" =
"HOURS"
);
- To a More detailed logs table
for : s3_access_logs_db
- Breakdown for table: s3_access_logs_db
- s3_access_logs_db →Database and table name.
- Columns → Define schema (Athena supports types
like string, int, bigint, boolean, double, timestamp, date, etc.).
- ROW FORMAT / SERDE → How data is interpreted (CSV, JSON,
Parquet, ORC, Avro).
- LOCATION → S3 path where twtech data lives.
- TBLPROPERTIES → Optional metadata flags (e.g.,
partitioned data, encryption).
- Table name must be unique.
- Sample for a Parquet
table (faster & cheaper queries):
CREATE EXTERNAL TABLE IF
NOT EXISTS twtech_database.parquet_table (
user_id
BIGINT,
event_type
STRING,
event_time
TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://twtechs3/data/events/';
# Or:
CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs (
`host`
STRING,
`identity`
STRING,
`user` STRING,
`time`
STRING,
`request`
STRING,
`status`
INT,
`size`
BIGINT,
`referer`
STRING,
`useragent` STRING
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) \\[([^\\]]*)\\]
"([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)"
"([^"]*)"'
)
LOCATION 's3://twtechs3/';
- How twtech previews table logs tabe created: s3-access_logs_db
- How twtech does anlalytics for logs aggregation on its logs
in s3: twtechs3
SELECT
requesturi_operation, httpstatus, count(*) FROM " twtech_s3_access_logs"."twtechs3_logs"
GROUP BY
requesturi_operation, httpstatus;
From:
- How
twtech gets unauthorized queries: is someone trying to query its data
SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where httpstatus = ’403’
- How twtech gets unauthorized queries:
is someone trying to query data
SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where
httpstatus = '403'
Alternatively:
How twtech query S3 data from Athena using SQL
- Create a catalog table for Netflix-listings: Create AWS Glue Crawler (scan the
database in s3 and create a table)
- Choose data sources and classifiers: Data source configuration
- Configure security settings: IAM role
- Set output and scheduling:
- Create a database for: twtech-netflix-db
- twtech Verifies that the database is created: twtech-netflix-db
- Go ahead refresh output configuration settings and select the database created: twtech-netflix-db
Review and create
- twtech Runs the Crawler: twtech-Netflix-Crawler
From: creating a table
To: completed
- twtech verifies the table created on Glue:
twtech Verifies the table created in aws Athena:
How twtech query the table:
twtech-netfilxdbnetflix_listings
- Link to download Sample listings for Netflix movies database: files Should be
downloaded and uploaded into a folder created in s3 bucket
https://github.com/Devopspat35/twtech-netflix-imdb
- First, configure the Athena query location where the
results of the query will be stored: settings
Manage settings: s3://twtechs3/netflix-query-store/
Query
result location and encryption
Location
of query result - optional
Enter an S3 prefix in the current region where the query
result will be saved as an object.
From: s3://twtechs3
To: s3://twtechs3/netflix-query-store/
- How twtech previews the table:
How twtech perfumes Fine-grained filters for query: Title
SELECT * FROM
"twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where title = 'The Lord of the Rings: The Return of the King'
SELECT * FROM
"twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where directors = 'Sam Mendes'
SELECT * FROM
"twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where genres = 'Adventure'