Thursday, December 25, 2025

AWS Trusted Advisor (TA) | Deep Dive & Hands-On.

An Overview of AWS Trusted Advisor (TA).

Focus:

  •        Framed for DevOps / Cloud / SRE
  •        Aligned with Well-Architected thinking.

Breakdown:

  •        Intro,
  •        Key Features and Functionality,
  •        Integration with Other Services,
  •        Accessing Trusted Advisor,
  •        The concept: AWS Trusted Advisor,
  •        Trusted Advisor vs Well-Architected Tool (Quick Context)
  •        The 5 Trusted Advisor Check Categories (Deep Dive)
  •        Cost Optimization,
  •        Security,
  •        Fault Tolerance (Reliability),
  •        Performance,
  •        Fault Tolerance (Reliability),
  •        Service Limits (Quotas),
  •        Support Plan Impact (Very Important),
  •        Automation & Integrations (Where TA Shines),
  •        Sample DevOps Automation Flow,
  •        Trusted Advisor vs Third-Party Tools,
  •        When Should twtech Rely on Trusted Advisor,
  •        When Should twtech NOT Rely on Trusted Advisor,
  •        twtech Recommendation  (DevOps/SRE Playbook),
  •        Insights.

Intro:

  •        AWS Trusted Advisor is a web service that inspects twtech AWS environment and provides real-time recommendations based on best practices across six categories: 
    •    cost optimization,
    •    performance,
    •     resilience,
    •    security,
    •    operational excellence,
    •    and service limits
  •        Trusted Advisor draws upon best practices learned from serving hundreds of thousands of AWS customers to identify opportunities to save money, improve availability and performance, and help close security gaps.

Key Features and Functionality

Best Practice Checks:

  •          Trusted Advisor continuously evaluates your AWS environment using a set of automated checks and then recommends actions to remediate any deviations from best practices.

Support Plan Integration:

  •          The number of checks available depends on your AWS Support plan.
  •    Basic and Developer Support plans have access to all service limits checks and selected security and fault tolerance checks.
  •    Business, Enterprise On-Ramp, and Enterprise Support plans have access to the full suite of checks.

Organizational View:

  •          For organizations using AWS Organizations, the organizational view provides a consolidated view of Trusted Advisor recommendations across all accounts.

Integration with Other Services:

AWS Support API:

  •     Allows programmatic access to check results.

Amazon CloudWatch and EventBridge:

  •    twtech can create alarms and rules to monitor Trusted Advisor metrics and check status changes.

AWS Well-Architected Tool:

  •     Integrates with the tool to evaluate workloads and provide data-driven insights.

AWS Config:

  •     Many new checks are powered by AWS Config managed rules, enhancing the monitoring of operational excellence.

Prioritized Recommendations:

  •          Available to Enterprise Support customers, Trusted Advisor Priority highlights the most critical recommendations, often including context-driven insights from your AWS account team. 

Accessing Trusted Advisor 

The concept: AWS Trusted Advisor

  • AWS Trusted Advisor is a real-time advisory service that continuously evaluates twtech AWS environment against AWS best practices and surfaces actionable recommendations across five domains:

     1.     Cost Optimization
2.     Security
3.     Fault Tolerance
4.     Performance
5.     Service Limits

Think of AWS Trusted Advisor (TA) as:

A continuously running automated cloud review engine ...not a one-time audit.

Trusted Advisor vs Well-Architected Tool (Quick Context)

Trusted Advisor

     Well-Architected Tool

Continuous

Point-in-time review

Automated checks

Architect-led assessment

Resource-level findings

Design-level questions

Ops-focused

Architecture-focused

In practice:

  •         Trusted Advisor day-to-day hygiene
  •         Well-Architected quarterly / major-change reviews

The 5 Trusted Advisor Check Categories (Deep Dive)

1. Cost Optimization

  • Identifies waste, overprovisioning, and idle resources.

Common Checks

  •         Idle EC2 instances
  •         Underutilized EBS volumes
  •         Idle Load Balancers
  •         Low-utilization RDS
  •         Unassociated Elastic IPs
  •         Reserved Instance & Savings Plan optimization

Sample Finding

  • 12 EBS volumes unattached for over 30 daysestimated monthly savings: $480”

DevOps Best Practice

  •         Integrate TA cost checks into FinOps dashboards
  •         Use tagging enforcement + TA findings to assign ownership
  •         Auto-remediate using Lambda where safe

2. Security

  • Maps closely to CIS benchmarks and AWS security best practices.

Common Checks

  •         S3 buckets with public access
  •         Security groups allowing 0.0.0.0/0 on sensitive ports
  •         IAM users with:
    •    No MFA
    •    Unused access keys
    •    Passwords older than policy
  •         Root account without MFA
  •         Exposed RDS snapshots

Sample Finding

  • Security Group sg-xxxx allows SSH from 0.0.0.0/0

DevSecOps Tie-in

  •         Treat TA findings as security debt
  •         Send findings to:
    •    Security Hub
    •    Jira / ServiceNow
    •    SIEM (via EventBridge)

3. Fault Tolerance (Reliability)

  • Focuses on resilience and availability.

Common Checks

  •         EC2 instances without EBS-backed volumes
  •         Single-AZ RDS databases
  •         ELBs without multiple targets
  •         Auto Scaling groups without health checks
  •         Missing backups

Sample Finding

  • RDS instance is running in a single Availability Zone”

SRE Angle

    •         TA highlights fragile infrastructure
    •         Use it to prioritize:
      •    Multi-AZ
      •    Auto Scaling
      •    Backup policies

4. Performance

  • Ensures services are appropriately sized and configured.

Common Checks

  •         EC2 instances with high CPU or memory pressure
  •         Classic Load Balancer usage (legacy)
  •         CloudFront configuration inefficiencies
  •         Suboptimal EBS volume types

Sample Finding

  • Instance t3.micro experiencing sustained CPU throttling

Platform Engineering Use

    •         Feed TA signals into:
      •    Capacity planning
      •    Instance family modernization
      •    Graviton adoption programs

5. Service Limits (Quotas)

  • Prevents scaling failures caused by quota exhaustion.

Common Checks

  •         EC2 instance limits
  •         VPC limits
  •         EIP limits
  •         Load balancer limits
  •         Lambda concurrency limits

Sample Finding

  • EC2 On-Demand instance usage at 85% of quota

Ops Impact

    •         One of the highest-value checks
    •         Prevents:
      •    Failed deployments
      •    Incident escalations
    •         Should be monitored like alerts

Support Plan Impact (Very Important)

Support Plan

     Checks Available

Basic / Developer

Limited checks only

Business

Full Trusted Advisor

Enterprise

Full + prioritized support

NB:

  •  Full value requires Business or Enterprise support

Automation & Integrations (Where TA Shines)

Event-Driven Ops

  •         TA publishes findings to Amazon EventBridge
  •         Enables:
    •    Auto-ticket creation
    •    Slack notifications
    •    Auto-remediation

Security Hub

  •         TA security checks can flow into AWS Security Hub
  •         Unified security posture view

API & CLI

  •         Query findings programmatically
  •         Build custom dashboards

Sample DevOps Automation Flow

Trusted Advisor vs Third-Party Tools

TA is:

  •         Native
  •         Free with support
  •         Low false positives

But:

  •         Not deeply customizable
  •         Doesn’t replace:
    •    CSPM tools
    •    Advanced cost optimization platforms

NB:

  • Best used as a baseline control plane.

When Should twtech Rely on Trusted Advisor

Daily operational hygiene
Security posture monitoring
Cost waste detection
Pre-incident prevention
Leadership dashboards

❌   When Should twtech NOT Rely on Trusted Advisor

  •         Application logic issues
  •         Custom compliance frameworks
  •         Deep performance profiling

twtech Recommendation  (DevOps/SRE Playbook)

1.     Enable full TA (Business Support)

2.     Export findings via EventBridge

3.     Classify findings:

    •    Auto-fix
    •    Ticket
    •    Ignore (with justification)

4.     Review trends monthly

5.     Map findings to Well-Architected Pillars 

Insight:

Trusted Advisor Review — EKS-Based SaaS (Production)

  •        A realistic, end-to-end Trusted Advisor (TA)
  •        Review for a production EKS-based SaaS, & serverless infrastructure.
  •        Talored for DevOps / SRE / Platform lead.

Scenario

  •         Multi-tenant SaaS
  •         Amazon EKS (managed node groups + Fargate)
  •         ALB Ingress Controller
  •         RDS Aurora (Multi-AZ)
  •         S3 + CloudFront
  •         CI/CD via GitHub Actions
  •         Business Support enabled

Step 1: Open Trusted Advisor (What twtech Actually See)

In the AWS Console:

Support Trusted Advisor  Dashboard

twtech see:

  •         Overall check summary
  •         Counts per category
  •         Red / Yellow / Green indicators

Sample snapshot:

Category

Status

Cost Optimization

🔴 8

Security

🔴 3

Fault Tolerance

🟡 5

Performance

🟢 1

Service Limits

🟡 2

Step 2: Cost Optimization Findings (EKS Reality)

 Finding 1: Underutilized EC2 Instances (Worker Nodes)

TA Output

  • “5 EC2 instances with average CPU utilization below 10% over 14 days”

Why This Happens in EKS

  •         Static node groups
  •         Poor pod bin-packing
  •         No Cluster Autoscaler or misconfigured limits

Action

  •         Enable Cluster Autoscaler
  •         Right-size node groups
  •         Use multiple instance types
  •         Add pod requests/limits

SRE Note

  • This is a platform problem, not an app problem.

 Finding 2: Idle Load Balancer

TA Output

  • “1 Application Load Balancer with no active targets”

Root Cause

  •         Old Ingress left behind
  •         Blue/green deployment cleanup failure

Action

  •         Validate Ingress ownership via tags
  •         Delete unused ALB
  •         Add CI/CD cleanup checks

 Finding 3: Unattached EBS Volumes

TA Output

  • 12 EBS volumes unattached for 30+ days

Common EKS Cause

  •         PVC deleted
  •         Volume left behind due to reclaim policy

Action

  •         Audit Retain vs Delete
  •         Use CSI driver lifecycle policies

Step 3: Security Findings (High Signal)

Finding 4: Security Group Allows 0.0.0.0/0 on Port 443

TA Output

  • “Security group allows unrestricted access”

Reality Check

  •         ALB SG intentionally public
  •         But backend node SG also exposed ❌

Action

  •         ALB SG 0.0.0.0/0
  •         Node SG ALB SG only
  •         Lock down NodePort ranges

 Finding 5: IAM User Without MFA

TA Output

  • “IAM user has console access without MFA”

Root Cause

  •         Legacy CI user
  •         Someone bypassed IAM roles

Action

  •         Kill static users
  •         Enforce:
    •    IAM roles
    •    OIDC (GitHub Actions / IRSA)
  •         SCP: DenyWithoutMFA

DevSecOps Callout

  • This is a release-blocking issue in mature orgs.

 Finding 6: S3 Bucket Allows Public Access

TA Output

  • “S3 bucket allows public access”

False Positive? Maybe.

  •         Static assets behind CloudFront
  •         But Block Public Access disabled ❌

Action

  •         Enable Block Public Access
  •         Use Origin Access Control (OAC)
  •         Restrict bucket policy to CloudFront only

Step 4: Fault Tolerance (Where Incidents Are Born)

 Finding 7: Auto Scaling Group in Single AZ

TA Output

  • “Auto Scaling group spans a single Availability Zone”

Impact

  •         AZ outage = platform outage
Action

  •         Spread node groups across ≥2 AZs
  •         Verify pod anti-affinity rules

 Finding 8: RDS Backup Retention Low

TA Output

  • “RDS backup retention less than 7 days”

Reality

·        Dev/test DB accidentally promoted

Action

    •         Enforce via:
      •    AWS Config
      •    Terraform guardrails

Step 5: Performance Findings

 Finding 9: No Major Issues

  • Typical for EKS because TA doesn’t deeply inspect Kubernetes internals.

NB:

  •         TA won’t see:
    •    Pod CPU throttling
    •    Memory OOMs
    •    API server saturation

Use:

  •         Prometheus
  •         Karpenter metrics
  •         CloudWatch Container Insights

Step 6: Service Limits (Critical but Ignored)

 Finding 10: EC2 Instance Limit at 80%

TA Output

  • “EC2 On-Demand instance usage approaching limit”

Why This Matters

  •         Scaling events will fail
  •         Deployments stall during incidents

Action

  •         Request quota increase
  •         Migrate to: 
    •    Spot 
    • Gravito  
    • Fargate where possible

Step 7: Prioritization (How Pros Do It)

Severity

    Action

Security

Fix immediately

Service Limits

Fix before next deploy

Cost

Schedule within sprint

Fault Tolerance

Roadmap

Performance

Monitor

Step 8: Automation Pipeline (Real World)

Auto-Fix Candidates

  •         Unattached EBS
  •         Idle ELBs
  •         Unused EIPs
  •         IAM access key rotation reminders

Serverless SaaS Differences (Quick Contrast)

Area

EKS

Serverless

Compute cost

Underutilized EC2

Lambda duration

Security

SG + IAM

IAM + resource policies

Fault tolerance

AZ spread

Mostly managed

Service limits

EC2, ENIs

Lambda concurrency

Common serverless TA findings:

  •         Lambda concurrency limits
  •         Public S3 buckets
  •         Idle API Gateway stages
  •         Underutilized provisioned concurrency

Final takeaway (What to Tell Organization Leadership)

  •        Trusted Advisor identified 3 security risks, 2 scaling blockers, and ~$1,200/month in-waste.
  •        All critical security findings (Concerns) were remediated within 24 hours.”

 Links to useful resources:

https://aws.amazon.com/architecture/

https://aws.amazon.com/solutions/

Project: Hands-On

How twtech uses AWS Trusted Advisor in its Environment to:

  • Provides real-time recommendations based on best practices across six categories: 
    •    cost optimization,
    •    performance,
    •     resilience,
    •    security,
    •    operational excellence,
    •    and service limits
  •        Draws upon best practices (learned from serving AWS customers) to identify opportunities that save money, improve availability, performance, and help close security gaps.
  • Login to aws  account and use the link provided herein to reach AWS Service: AWS Trust Advisor (TA) https://console.aws.amazon.com/trustedadvisor/home.

  • Upgrade twtech AWS Support Plan to get all Trusted Advisor checks

NB:

  • Without Upgrade, twtech AWS Support Plan gets only Limited Trusted Advisor checks.

Service limits

  •        twtech Chooses a check name to see recommendations for services that use more than 80 percent of a service quota.
  •        The check results use values based on a snapshot, so twtech current usage might vary.
  •        Quota and usage data can take up to 24 hours to reflect any chang.

NB:

  • twtech need to pay for a support plan when it Upgrades its AWS Support Plan to get all (full) Trusted Advisor checks.

Addendum:

Links to More Architecture Examples

Link to ASW Certification Solution Architect- Associcate (Exam)




Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...