Amazon Athena - Deep Dive & Hands-On.
scope:
- The Concept of Athena,
- How Athena Works,
- What Amazon Athena is often used for,
- Query Flow,
- Architecture,
- Key Features,
- Performance Optimization,
- Integrations,
- Pricing,
- Use Cases,
- Best Practices,
- Final thoughts,
- Project: Hands-On.
Intro:
- Amazon Athena is a serverless, interactive query service that enables data analysts to analyze data directly in Amazon S3 using standard SQL.
- Amazon Athena is designed for ad-hoc analysis and data exploration without needing to manage any infrastructure or perform complex ETL (Extract, Transform, Load) jobs.
1. The Concept of Athena.
Amazon Athena is a serverless, interactive query service that lets twtech run SQL queries directly on data stored in Amazon S3.
- No infrastructure to manage (serverless).
- Uses Presto/Trino SQL engine under the hood.
- twtech Pays only for the data scanned by queries.
What Amazon Athena is often used for:
- Ad-hoc data exploration
- Log analytics
- Data lake queries
- BI tool integration
- Pre-ETL exploration
2. How Athena Works
- Data Storage: Athena does not store data itself; it queries data directly in S3.
- Schema Definition: Metadata (tables, columns, partitions) is stored in the AWS Glue Data Catalog.
- Query Engine: SQL queries are executed using Presto (Trino-based distributed SQL engine).
- Results: Query results are stored back into S3 (in a results bucket).
Query Flow
S3 (raw data) ➝ Athena engine ➝ SQL query ➝ Results stored in S3
Architecture
- Data Source Layer:
- Amazon S3 (primary source)
- Federated Queries (Athena can query RDS, Redshift, DynamoDB, etc. via connectors)
- Metadata Layer:
- AWS Glue Data Catalog (shared metadata across Athena, Redshift Spectrum, EMR, Lake Formation)
- Query Execution Layer:
- Presto engine, distributed across multiple nodes
- Output Layer:
- Query results stored in S3
- Integration with QuickSight, Tableau, Power BI, etc.
4. Key Features
- Serverless: No clusters or infrastructure to manage.
- Standard SQL: ANSI SQL support, joins, aggregations, window functions.
- Multiple Formats: Supports CSV, JSON, ORC, Avro, Parquet (columnar = better performance).
- Federated Queries: Query across relational, NoSQL, and custom sources using Athena connectors.
- Partitioning: Improves query performance by scanning fewer files.
- CTAS (Create Table As Select): Write results back into S3 in optimized formats.
- Security: IAM for auth, Lake Formation for fine-grained access, encryption (S3 SSE-KMS).
5. Performance Optimization
- Use Columnar Formats: Convert data to Parquet/ORC → reduces data scanned & cost.
- Partition Data: Partition by date, region, etc. → limits scanned data.
- Compression: Store compressed files (Snappy, Gzip).
- Optimize File Sizes: Avoid too many small files (“small file problem”); target 128 MB – 1 GB per file.
- Predicate Pushdown: Write queries that leverage filters early.
- CTAS / INSERT INTO: Pre-transform raw data into analytics-ready datasets.
6. Integrations
- AWS Glue (metadata catalog, ETL jobs)
- Amazon QuickSight (visualizations)
- AWS Lambda (serverless data pipelines)
- Amazon Redshift Spectrum (hybrid queries with Redshift)
- Third-party BI Tools: Tableau, Looker, Power BI
7. Pricing
- $5 per TB scanned (compressed & columnar = huge savings).
- Partitioning and Parquet can reduce cost by 90% or more.
- twtech is also charged for:
- S3 storage
- Glue Data Catalog requests
- Result storage in S3
8. Use Cases
- Log Analysis: Query CloudTrail, ELB, VPC Flow Logs, WAF logs.
- Data Lakes: Explore S3 data without ETL.
- Ad-hoc Queries: Business analysts can run queries without needing a Redshift cluster.
- Federated Analytics: Query multiple databases from a single SQL interface.
- ETL Alternative: Use CTAS to transform raw → optimized datasets in S3.
9. Best Practices
- Store data in open, columnar formats (Parquet/ORC).
- Partition wisely (avoid too many tiny partitions).
- Use Glue crawlers for schema inference.
- Secure access with IAM + Lake Formation.
- Monitor performance with CloudWatch + Athena Workgroups.
- For heavy workloads → consider using Redshift (Athena is more for interactive/ad-hoc).
Final thoughts:
- Athena = Serverless SQL-on-S3 + Glue Data Catalog + Pay-per-query model.
- Athena is excellent for exploratory analytics, log querying, and powering dashboards without building full ETL(Extract, Transform, Load) pipelines.
Project: Hands-On
How twtech uses Amazon Athena Federated Query to ingest data and store in its S3 bucket
- Search for aws service: Athena.
How it works:
- Launch query editor: to Query data from S3 (twtech-S3bucket)
Set up a query location in Amazon s3 by editing the settings:
S3 bucket in Amazon: twtechs3
Reference the s3 bucket and make sure it is: accessible
Alternatively:
- Browse to Choose the s3 data set: twtechs3
Save the location of the query in s3 bucket selected:
twtech needs to run queries for logs in the s3 bucket
configured: twtechs3
- Query logs for all objects in the s3 bucket: twtechs3
- Run the code to create a database: create database twtech_s3_access_logs_db;
From:
To:
A query to create a table in amazon Athena to access s3 logs
- In Amazon Athena, twtech creates tables using DDL statements that define metadata in the AWS Glue Data Catalog (or Athena’s internal catalog).
- The table itself doesn’t store data.
- Athena queries data directly in Amazon S3.
Here’s a simple example query to create a table:
CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs_db; (
id INT,
name STRING,
age INT,
created_at TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION 's3://twtechs3/';
Breakdown for table: twtech_s3_access_logs_db;
- twtech_s3_access_logs_db; → Database and table name.
- Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
- ROW FORMAT / SERDE → How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
- LOCATION → S3 path where twtech data lives.
- TBLPROPERTIES → Optional metadata flags (e.g., partitioned data, encryption).
- Table name must be unique.
To:
More detailed logs table:
- # AWS Official Link to Athena Querying documentation code ”
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html
# Query code for Athena and s3 bucket
CREATE EXTERNAL TABLE twtech_s3_access_logs_db(
bucket string,
key string,
version_id string,
is_latest boolean,
is_delete_marker boolean,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string,
object_lock_retain_until_date bigint,
object_lock_mode string,
object_lock_legal_hold_status string,
intelligent_tiering_access_tier string,
bucket_key_status string,
checksum_algorithm string,
object_access_control_list string,
object_owner string
) PARTITIONED BY (
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://twtechs3/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.dt.type" = "date",
"projection.dt.format" = "yyyy-MM-dd-HH-mm",
"projection.dt.range" = "2022-01-01-00-00,NOW",
"projection.dt.interval" = "1",
"projection.dt.interval.unit" = "HOURS"
);
- To a More detailed logs table for : s3_access_logs_db
- Breakdown for table: s3_access_logs_db
- s3_access_logs_db →Database and table name.
- Columns → Define schema (Athena supports types like string, int, bigint, boolean, double, timestamp, date, etc.).
- ROW FORMAT / SERDE → How data is interpreted (CSV, JSON, Parquet, ORC, Avro).
- LOCATION → S3 path where twtech data lives.
- TBLPROPERTIES → Optional metadata flags (e.g., partitioned data, encryption).
- Table name must be unique.
- Sample for a Parquet table (faster & cheaper queries):
CREATE EXTERNAL TABLE IF NOT EXISTS twtech_database.parquet_table (
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://twtechs3/data/events/';
# Or:
CREATE EXTERNAL TABLE IF NOT EXISTS twtech_s3_access_logs (
`host` STRING,
`identity` STRING,
`user` STRING,
`time` STRING,
`request` STRING,
`status` INT,
`size` BIGINT,
`referer` STRING,
`useragent` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) \\[([^\\]]*)\\] "([^"]*)" ([^ ]*) ([^ ]*) "([^"]*)" "([^"]*)"'
)
LOCATION 's3://twtechs3/';
- How twtech previews table logs tabe created: s3-access_logs_db
- How twtech does anlalytics for logs aggregation on its logs in s3: twtechs3
SELECT requesturi_operation, httpstatus, count(*) FROM " twtech_s3_access_logs"."twtechs3_logs"
GROUP BY requesturi_operation, httpstatus;
From:
- How twtech gets unauthorized queries: is someone trying to query its data
SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where httpstatus = ’403’
- How twtech gets unauthorized queries: is someone trying to query data
SELECT * FROM " s3_access_logs_db", "twtechs3_logs" where httpstatus = '403'
Alternatively:
How twtech query S3 data from Athena using SQL
- Create a catalog table for Netflix-listings: Create AWS Glue Crawler (scan the database in s3 and create a table)
- Set crawler properties
- Choose data sources and classifiers: Data source configuration
- Configure security settings: IAM role
- Set output and scheduling:
- Create a database for: twtech-netflix-db
- twtech Verifies that the database is created: twtech-netflix-db
- Go ahead refresh output configuration settings and select the database created: twtech-netflix-db
Review and create
- twtech Runs the Crawler: twtech-Netflix-Crawler
From: creating a table
To: completed
- twtech verifies the table created on Glue:
twtech Verifies the table created in aws Athena:
How twtech query the table: twtech-netfilxdbnetflix_listings
- Link to download Sample listings for Netflix movies database: files Should be downloaded and uploaded into a folder created in s3 bucket
https://github.com/Devopspat35/twtech-netflix-imdb
- First, configure the Athena query location where the results of the query will be stored: settings
- Manage settings:
Manage settings: s3://twtechs3/netflix-query-store/
Query result location and encryption
Location of query result - optional
Enter an S3 prefix in the current region where the query result will be saved as an object.
From: s3://twtechs3
To: s3://twtechs3/netflix-query-store/
- How twtech previews the table:
- Query results:
How twtech perfumes Fine-grained filters for query: Title
SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where title = 'The Lord of the Rings: The Return of the King'
SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where directors = 'Sam Mendes'
SELECT * FROM "twtech-netflix-db"."twtech-netflix-dbnetflix_listings" where genres = 'Adventure'
No comments:
Post a Comment