An Overview of AWS Well-Architected Tool (WAT).
Focus:
- Relevant
to DevOps /
DevSecOps / Cloud Engineers.
Breakdown:
- Intro,
- Key
Features and Functionality of WAT,
- The Six Pillars of AWS,
- The concept: AWS Well-Architected,
- Core Concepts twtech Must Understand,
- Review Process – Step by Step (Real-World),
- Pillar-Specific Deep Insights (Beyond Basics),
- Reports & Artifacts twtech Can Generate,
- Automation & Integration (Advanced Use),
- Anti-Patterns to Avoid,
- Best-Practice Cadence (What Teams Should Do),
- How This Helps in twtech Career Path,
- Insights.
Intro:
- The AWS
Well-Architected Tool (WAT) is a service that helps twtech to review
its workloads against current AWS architectural best
practices and provides guidance for improvement / remediation.
- AWS
Well-Architected Tool (WAT) service is based on the comprehensive AWS Well-Architected
Framework, which consists of six pillars.
Key
Features and Functionality
Workload
Reviews:
- The tool guides twtech through a set of questions aligned with the six pillars of the Framework to evaluate its architecture.
Risk
Identification:
- After a review, twtech receives an improvement plan that highlights high-risk (HRI) and medium-risk (MRI) issues, along with recommended remediation steps.
Milestone
Tracking:
- twtech can save the state of a workload at specific points in time as "milestones" to measure and track progress as its implement improvements.
Custom
Lenses:
- The tool allows twtech to create and share custom lenses, which are extensions of the core framework tailored to specific industry or internal organizational best practices.
Integrations:
- The tool can integrate with other AWS services like AWS Trusted Advisor to automatically incorporate operational checks and provide a more comprehensive view of twtech environment.
Collaboration:
- Workload reviews can be shared across teams or with external AWS Partners to enhance collaboration and ensure consistent application of best practices.
The
Six Pillars of AWS
- The framework and the tool organize best practices around six foundational pillars:
Operational
Excellence:
- Focuses on running and managing systems that deliver business value, including aspects like automation and runbooks.
Security:
- Centers on protecting data, systems, and assets, covering areas like identity and access management, detection, and incident response.
Reliability:
- Ensures a workload performs its intended function correctly and consistently when expected, focusing on foundations, change management, and failure handling.
Performance
Efficiency:
- Addresses the efficient use of computing resources to meet system requirements and maintain that efficiency as demand changes.
Cost
Optimization:
- Provides guidance on avoiding unneeded costs and managing expenditure effectively in the cloud.
Sustainability:
- Focuses on reducing energy consumption and increasing efficiency in the cloud to minimize environmental impacts.
NB:
- To get started using AWS Well-Architected Tool (WAT), visit the AWS
Well-Architected Tool Documentation on the link:
https://docs.aws.amazon.com/wellarchitected/latest/userguide/tutorial.html
1. The concept: AWS Well-Architected
The
AWS Well-Architected Tool is not
just a questionnaire. It is:
- A structured risk discovery system
- A design review framework aligned to
AWS best practices
- A continuous improvement mechanism, not a
one-time audit
- A governance artifact that can
support compliance, leadership reporting, and funding decisions
NB:
- It helps
twtech to identify architectural risks and map them
to AWS-recommended
improvements across the 6 Well-Architected Pillars.
2. Core Concepts twtech Must Understand
2.1
Workloads
A workload is:
A collection
of AWS resources that deliver business value
Examples:
- A production e-commerce platform
- A CI/CD pipeline + EKS cluster
- A serverless data ingestion platform
Each
workload:
- Has an owner
- Has an environment (Prod /
Non-Prod)
- Is reviewed pillar by pillar
2.2
Pillars & Lenses
Pillars (Built-in)
1.
Operational Excellence
2.
Security
3.
Reliability
4.
Performance Efficiency
5.
Cost Optimization
6.
Sustainability
Lenses
Lenses extend
the core pillars for specific domains:
- Serverless Lens
- SaaS Lens
- Financial Services Lens
- Data Analytics Lens
- Machine Learning Lens
- Custom Lenses (very
powerful for enterprises)
NB:
Advanced practice: Many
enterprises create custom lenses for:
- DevSecOps standards
- Regulatory requirements
- Internal platform rules
3. Review Process – Step by Step (Real-World)
Step
1: Define the Workload
twtech
provides:
- Business value
- Architecture type
- Industry
- Environment
- Review owner
NB:
- This context
influences risk weighting.
Step
2:
Answer Pillar Questions
Each pillar
has:
- Design principles
- Best-practice questions
- Multiple-choice answers
Example
(Security Pillar):
- How does twtech manage identities for people and machines?
Answers are
mapped internally to:
- Best practices
- Risk levels
Step
3:
Risk Identification (High / Medium / None)
After
answering:
- AWS flags High-Risk Issues (HRIs)
- Each HRI is tied to:
- A specific question
- A specific architectural gap
NB:
- HRIs are the currency of the
Well-Architected Tool.
Step
4: Improvement Plan Generation
For every
HRI, the tool provides:
- Recommended actions
- AWS documentation links
- Service-specific guidance
twtech can:
- Mark items as Not Applicable
- Track Improvement Status
- Assign Owners
4. Pillar-Specific Deep Insights (Beyond Basics)
4.1
Operational Excellence
Focus areas:
- Observability
- Change management
- Incident response
- Automation
Advanced
signals:
- No runbooks → HRI
- Manual deployments → HRI
- No game days → HRI
DevOps tie-in:
This pillar heavily rewards:
- CI/CD
- Infrastructure as Code
- Automated rollbacks
4.2
Security
Focus areas:
- Identity & access
- Detection
- Infrastructure protection
- Data protection
- Incident response
High-impact
HRIs:
- No MFA on root
- Over-permissive IAM
- No centralized logging
- No automated security monitoring
DevSecOps insight:
Security HRIs often correlate directly with:
- IAM anti-patterns
- Lack of SCPs
- Weak secrets management
4.3
Reliability
Focus areas:
- Fault isolation
- Recovery planning
- Change management
- Capacity planning
Red flags:
- Single AZ deployments
- No DR strategy
- No backup testing
- Tight coupling between services
SRE mapping:
This pillar aligns closely with:
- Error budgets
- MTTR reduction
- Chaos engineering
4.4
Performance Efficiency
Focus areas:
- Compute selection
- Scaling strategies
- Monitoring & tuning
Advanced
signals:
- Static instance sizing
- No auto scaling
- Poor data access patterns
Cloud engineering angle:
This is where:
- Graviton
- Spot
- Serverless shine when architected properly.
4.5
Cost Optimization
Focus areas:
- Visibility
- Resource right-sizing
- Demand management
Common HRIs:
- No cost allocation tags
- No budgets/alerts
- Idle resources
- Over-provisioned storage
Leadership impact:
This pillar often justifies:
- FinOps initiatives
- Platform teams
- Executive dashboards
4.6
Sustainability
Focus areas:
- Energy-efficient services
- Right-sizing
- Managed services
Signals:
- Long-running idle compute
- Inefficient storage tiers
- No lifecycle policies
Forward-looking insight:
This pillar is increasingly tied to:
- ESG reporting
- Regulatory pressure
- Enterprise procurement decisions
5. Reports & Artifacts twtech Can Generate
The tool
generates:
- Executive summary
- HRI list
- Improvement plan
- Pillar heat maps
These are
commonly used for:
- Architecture reviews
- Funding requests
- Compliance evidence
- Cloud Center of Excellence (CCoE) reporting
6. Automation & Integration (Advanced Use)
6.1
APIs & Export
twtech can:
- Export reviews
- Track changes over time
- Feed data into dashboards
6.2
CI/CD Integration (Advanced Pattern)
Some teams:
- Run Well-Architected reviews at release milestones
- Fail promotion if new HRIs appear
- Track architectural debt like technical debt
7. Anti-Patterns to Avoid
❌ Treating
it as a checkbox exercise
❌ Ignoring
HRIs because “it works today”
❌ Doing
reviews only once
❌ Not
assigning owners to improvements
8. Best-Practice Cadence (What Teams Should Do)
|
Environment |
Review
Frequency |
|
Production |
Every 6–12 months |
|
Major redesign |
Before launch |
|
Regulated workloads |
Quarterly |
|
Platform services |
After major changes |
9. How This Helps in twtech Career Path
Given DevOps /
DevSecOps / Cloud Engineering focus:
- Strengthens architecture credibility
- Aligns with AWS Partner & enterprise standards
- Positions you for:
- Principal Engineer
- Cloud Architect
- Platform Lead
- SRE / DevSecOps leadership
Insights:
- A full,
end-to-end AWS Well-Architected review walkthrough for a modern SaaS
platform using EKS + Serverless.
Focus:
- Executed in a real enterprise or AWS Partner engagement.
- what a Principal Cloud / DevSecOps Engineer would
actually do.
- Example
Workload:
SaaS Platform (EKS + Serverless)
1. Workload Definition (Foundation)
- Workload
Name:
acme-saas-platform-prod - Environment: Production
- Industry: Software /
SaaS
- Regions: us-east-2
(primary), us-west-1 (DR)
- Business Value:
- Multi-tenant B2B SaaS
- 99.9% customer SLA (Service Level Agreement)
- Regulated customer data (SOC 2)
Architecture
Overview
- Amazon EKS (multi-AZ)
- ALB Ingress Controller
- Lambda (async processing, webhooks)
- API Gateway
- Aurora PostgreSQL (Multi-AZ)
- S3 (object storage + backups)
- CloudFront
- Cognito (customer auth)
- GitHub Actions → ArgoCD
- Terraform + Helm
- Datadog + CloudWatch
- AWS WAF + Shield
2. Operational Excellence Pillar – Deep Review
Key Questions
& Answers
Q:
How do you manage and automate changes?
- ✅ GitOps
(ArgoCD)
- ✅ IaC (Terraform)
- ❌ No
automated rollback validation
Q:
How do you respond to incidents?
- ❌ No formal runbooks
- ❌ No game days
- ✅ PagerDuty + Datadog alerts
High-Risk
Issues (HRIs)
|
Risk |
Description |
|
HRI-OE-01 |
Incident response not codified |
|
HRI-OE-02 |
No failure injection testing |
Recommended
Improvements
- Create runbooks (per
service)
- Implement automated rollback checks
- Run quarterly game days
DevOps
insight:
- Most SaaS teams fail this pillar not due to tooling, but due to process gaps.
3. Security Pillar – DevSecOps Lens
Key Questions
& Answers
Q:
How are identities managed?
- ✅ IAM roles for service accounts (IRSA)
- ❌ No SCPs
- ❌ Broad admin roles
Q:
How do you detect security events?
- ✅ GuardDuty
- ❌ No Security Hub aggregation
- ❌ Logs not centralized cross-account
Q:
How is data protected?
- ✅ KMS encryption (S3, RDS)
- ❌ Secrets
partially stored in env vars
HRIs
|
Risk |
Description |
|
HRI-SEC-01 |
Over-permissive IAM |
|
HRI-SEC-02 |
Weak secrets management |
|
HRI-SEC-03 |
Incomplete security monitoring |
Improvements
- Enforce least privilege IAM
- Migrate to AWS Secrets Manager
- Enable Security Hub + Org aggregation
DevSecOps
note:
This pillar directly exposes IAM debt, which is often
invisible until reviewed.
4. Reliability Pillar – SRE-Focused
Key Questions
& Answers
Q:
How do you manage failures?
- ✅ Multi-AZ EKS
- ❌ Single-region active
- ❌ No DR testing
Q:
How do you back up data?
- ✅ Automated RDS backups
- ❌ Restore not tested
- ❌ No RPO/RTO defined
HRIs
|
Risk |
Description |
|
HRI-REL-01 |
No DR strategy |
|
HRI-REL-02 |
Backup restore unvalidated |
Improvements
- Define RTO/RPO
- Implement pilot-light DR
- Test restores quarterly
SRE
insight:
Reliability HRIs almost always correlate with business SLA
risk.
5. Performance Efficiency Pillar
Key Questions
& Answers
Q:
How is compute selected?
- ❌ Fixed node groups
- ❌ No Graviton
- ❌ Lambda memory not tuned
Q:
How do you scale?
- ✅ HPA enabled
- ❌ No KEDA for async
- ❌ No load testing
HRIs
|
Risk |
Description |
|
HRI-PERF-01 |
Sub-optimal compute |
|
HRI-PERF-02 |
Scaling not validated |
Improvements
- Introduce Graviton node groups
- Use KEDA for event-driven scaling
- Run load & stress tests
Platform
insight:
Performance inefficiencies often inflate cost and reduce
reliability.
6. Cost Optimization Pillar – FinOps View
Key Questions
& Answers
Q:
How do you track cost?
- ❌ Incomplete cost allocation tags
- ❌ No budgets or alerts
- ❌ No rightsizing reviews
Q:
How does twtech optimize resources?
- ❌ Idle dev clusters
- ❌ Over-provisioned RDS
HRIs
|
Risk |
Description |
|
HRI-COST-01 |
No cost visibility |
|
HRI-COST-02 |
No governance |
Improvements
- Enforce mandatory tagging
- Create AWS Budgets
- Enable Compute Optimizer
Leadership
takeaway:
Cost HRIs are often the fastest ROI fixes.
7. Sustainability Pillar
Key Questions
& Answers
Q:
How do you reduce energy usage?
- ❌ No lifecycle policies
- ❌ Always-on non-prod clusters
- ❌ Inefficient storage tiers
HRIs
|
Risk |
Description |
|
HRI-SUS-01 |
Inefficient resource usage |
Improvements
- Auto-shutdown non-prod
- Add S3 lifecycle rules
- Prefer managed services
Enterprise
trend:
Sustainability findings increasingly show up in executive
reviews.
8. Final Review Output (What Leadership Sees)
HRI Summary
|
Pillar |
HRIs |
|
Operational Excellence |
2 |
|
Security |
3 |
|
Reliability |
2 |
|
Performance |
2 |
|
Cost |
2 |
|
Sustainability |
1 |
|
Total |
12 HRIs |
Improvement
Plan
- 30-day: IAM, logging, tagging
- 60-day: DR design, secrets migration
- 90-day: Game days, Graviton, FinOps cadence
9. How Mature Teams Use This Review
✔ Feed HRIs
into Jira
✔ Assign
owners per pillar
✔ Track
reduction over time
✔ Re-review
before major releases
10. Why This Matters for You
For a DevOps /
DevSecOps / Cloud Engineer, running reviews like this:
- Demonstrates architecture ownership
- Bridges engineering ↔ leadership
- Aligns with AWS Partner standards
- Builds Principal / Architect credibility
Project: Hands-On
How twtech uses AWS Well-Architected Tool (WAT) in its Environment to:
- Review workloads against current AWS architectural best practices and provides guidance for improvement / remediation.
- Search for AWS service: Well-Architected Tool (WAT
How Well-Architected Tool (WAT) works
Benefits and features
Step-1:
Define workload:
Specify properties
Step-2:Apply lenses
Define workload:
Step-3: Start Reviewing workload from the lens of AWS well
Architected Framework.
Step-4: Determinet twtech priorities
- Select from the
options available on the console.
Ops 2: How does twtech structure its organization to support its business
Ops 3: How does twtech organization culture support its business outcomes?
Ops 4: How does twtech implement observability in its workload?
Ops 5: How does twtech reduce defects, ease remediation, and improve flow into production?
Ops 6: How does twtech mitigate deployment risks?
Ops 7: How does twtech know that its is ready to support a workload?
Ops 8: How does twtech utilize workload observability in its organization?
Ops 9: How does twtech understand the health of its Operation?
Ops 10: How does twtech manage workload and operations events?
Ops 11: How does twtech evolve operations?
SEC 1.How does twtech securely operate its workload
SEC 2.How does twtech manage identities for
people and machines?
SEC 3.How does twtech manage permissions for
people and machines
SEC 4.How does twtech detect and investigate
security events?
SEC 5. How does twtech protect its network
resources?
SEC 6. How does twtech protect its compute
resources?
SEC 7. How does twtech classify its data?
SEC 8. How does twtech protect its data at
rest?
SEC 9. How does twtech protect its data in
transit?
SEC 10. How does twtech anticipate, respond to,
and recover from incidents?
SEC 11. How des twtech incorporate and validate
the security properties of applications throughout the design, development, and
deployment lifecycle?
REL 1. How does twtech manage service quotas
and constraints
REL 2. How does twtech plan its network
topology?
NB:
- twtech at this point can save and exit to review the results.
- However, for best practices, twtech recommends that all the sectors of the 6 pillar framework be also configured, especially for the prod environment.
Review results
How twtech gets to the
recommendations from aws well-Architected Tool.
Navigate down to
recommendations and possible remediation from aws.
OPS01-BP02 Evaluate
internal customer needs
https://docs.aws.amazon.com/wellarchitected/2025-02-25/framework/ops_priorities_int_cust_needs.html
Implementation guidance (remediation)
OPS01-BP05 Evaluate threat landscape
Implementation guidance
Finally, twtech can continues reviewing its best practices as follow
No comments:
Post a Comment