Thursday, September 18, 2025

CloudWatch Logs | Overview & Hands-On.

Amazon CloudWatch Logs - Overview & Hands-On.

 Scope:

  • Intro,
  • Key Features & Concepts,
  • Link to official documentation,
  • Sources of Log Data,
  • Log Ingestion,
  • Log Storage & Management,
  • Processing & Analysis,
  • Destinations & Integrations,
  • Monitoring & Alerting,
  • Security & Access Control,
  • Final thoughts,
  • Project: Hands-On,
  • Key word to filter search in logs,
  • Official documentation link to get Queries.
Intro:
    • Amazon CloudWatch Logs is a service for monitoring and troubleshooting twtech applications and systems from its log data
    • Amazon CloudWatch Logs enables twtech to:
      •  Centralize the logs from all of its systems, 
      • Applications, 
      • AWS services that twtech runs, 
      • Regardless of where they are running, 
      • Provides features for searching: 
        • Analysis, 
        • Storage, 
        • Protection of that data.
Key Features and Concepts
Log Ingestion and Management
    • twtech can send logs from various sources to CloudWatch Logs, including Amazon EC2 instances, AWS Lambda functions, containerized applications (ECS, EKS), and more, using the CloudWatch agent or direct API calls.
Log Groups and Log Streams
    • Logs are organized into logical log groups, which are collections of log streams that share the same retention, monitoring, and access control settings.
Search and Analysis with Logs Insights:
    •  CloudWatch Logs Insights provides a powerful, SQL-like query language to interactively search and analyze twtech log data, helping it to troubleshoot operational issues more quickly.
Monitoring and Alarms:
    •  twtech can create metric filters based on log content to extract numerical values and use them to generate CloudWatch metrics and trigger alarms, proactively notifying its of specific events (e.g., error counts exceeding a threshold).
Data Protection and Security
    • The service includes features to help protect sensitive data by automatically masking personally identifiable information (PII) and encrypting log data using AWS KMS keys.
Retention and Archival
    • twtech has control over how long log events are stored. 
    • twtech can retain logs for a specified period or archive them to Amazon S3 for long-term storage and compliance purposes.
Real-time Processing:
    •  Using subscription filters, twtech can stream log data in real time to Amazon Kinesis, Amazon Kinesis Data Firehose, or AWS Lambda for custom processing, analysis, or integration with third-party tools.
Anomaly Detection:
    •  Machine learning-powered capabilities help summarize thousands of log entries into patterns and detect anomalies, reducing the manual effort in log analysis.
Link to official documentation:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html

1. Sources of Log Data

Logs can be ingested from multiple places:

  • AWS Services
    • Lambda function logs (stdout...standardOutput, stderr...standardError)
    • ECS / EKS / Fargate container logs
    • API Gateway execution/access logs
    • VPC Flow Logs, Route 53 Resolver Logs
    • CloudTrail events
  • EC2 Instances
    • CloudWatch Logs Agent
    • CloudWatch Unified Agent
  • On-premises / Hybrid
    • Using the unified agent or Kinesis Agent

2. Log Ingestion

  • Logs are collected and pushed into Log Groups (logical containers).
  • Each log-producing entity writes to a Log Stream (sequence of log events).
  • Events are timestamped, stored, and indexed.
  • Retention policies can be set (from 1 day to indefinite).

3. Log Storage & Management

  • Log Groups
    • Organize logs by application, service, or environment.
    • Can set retention policies.
  • Log Streams
    • Represent individual sources (e.g., per Lambda instance, per EC2).
  • Subscriptions
    • Define streaming of logs to other services.

4. Processing & Analysis

  • CloudWatch Logs Insights
    • Purpose-built query engine.
    • SQL-like queries for searching, filtering, and aggregating logs.
    • Supports visualization and dashboard integration.
  • Metric Filters
    • Convert specific patterns in logs into CloudWatch Metrics.
    • Example: Count ERROR occurrences → metric → alarm.

5. Destinations & Integrations

  • Kinesis Data Streams / Firehose
    • For near real-time log streaming to S3, Redshift, Elasticsearch/OpenSearch, or 3rd-party tools.
  • Lambda
    • Trigger functions on specific log events.
  • S3 (via Firehose)
    • Long-term storage and archival.
  • OpenSearch Service
    • For log search and visualization.
  • Security & Compliance Integrations
    • With services like GuardDuty, Security Hub, SIEM tools.

6. Monitoring & Alerting

  • CloudWatch Metrics + Alarms
    • From metric filters or generated by services.
    • Can trigger notifications (SNS), scaling actions, or custom automation.
  • Dashboards
    • Combine logs, metrics, and alarms for observability.

7. Security & Access Control

  • Encryption
    • Logs at rest encrypted with KMS keys.
  • IAM Policies
    • Control who can read/write/stream logs.
  • Cross-Account Sharing
    • Subscription filters can send logs to another account.

Final thoughts:

    •   CloudWatch Logs act as the central nervous system for AWS observability (monitory plus).
    •    Ingesting logs from:
      •     AWS services,
      •    Applications,
      •    On-premises systems; storing and indexing them; providing query and alerting features;
    •    Integrating with downstream analytics, monitoring, and SIEM (Security Information and Event Management) systems.


Project: Hands-On

  • How twtech creates and use CloudWatch logs for Monitoring and Observability of its Metrics.
Search for aws service: CloudWatch

  • How twtech access CloudWatch logs: Log groups



  • Log groups are created by different: services

  • Each log group has it own embedded: log streams

  • With log streams are also embedded: Log events

  • From within logs events, twtech filters (with key words) to get insights of the logs.
Key word: Error,



  • Key word:  http,


  • Key word:  exceptions.




Other Key word to filter search in logs include:  
  • How twtech Filters keywords/terms used in CloudWatch Logs filters (for metric filters, subscription filters, and searching logs).
  • Breakdown of the main ones and their meaning:
  • CloudWatch Logs uses a simple filter pattern language to match log events.
Sample keywords and operators twtech uses:
1. Terms / Strings

    • "ERROR" – Matches any log event containing the exact string ERROR.
    • "200" – Matches log events containing the string 200.
NB:
  • Strings are case sensitive.
2. Space-separated Terms (AND logic)
    • "ERROR Timeout" Matches logs that contain both ERROR and Timeout.
3. Quoted Strings (exact match including spaces)
    • "\"User login failed\"" Matches the exact phrase User login failed.
4. Negation
    • -ERROR Matches logs that do not contain ERROR.
5. OR Operator
    • ERROR || Exception Matches if either ERROR or Exception is present.
6. Fields (Structured JSON / Key-Value logs)
NB:
  • When logs are JSON, twtech can filter on fields:
    • { $.status = 404 } Matches if the JSON field status is 404.
    • { $.latency > 500 } Matches if latency field is greater than 500.
    • { $.user != "admin" } Matches if user is not admin.
7. Numeric Ranges
    • { $.bytes >= 1000 } Matches if the bytes field is 1000 or more.
8. Exists Operator
    • { $.requestId = * } Matches logs where the requestId field exists (any value).
9. Wildcard Matching
    • "?ERROR*" Matches any string where ERROR appears with prefix/suffix wildcards.
    • ? single character wildcard.
    • * multi-character wildcard.
10. Boolean Expressions (Grouping)
    • (ERROR || Exception) && Timeout Matches logs with Timeout and either ERROR or Exception.
11. Numbers Without JSON
NB:
  • Even without JSON, twtech can filter numeric-looking text:
    • [ip, user, status=404, bytes>1000] Matches space-delimited fields with conditions.
twtech-Summary Table

Keyword / Pattern

Meaning

ERROR

Contains word ERROR

-ERROR

Does not contain ERROR

ERROR Timeout

Must contain both terms

`ERROR

"Exact Match"

Matches exact phrase

{ $.field = value }

JSON field equals value

{ $.field > 100 }

JSON field greater than 100

{ $.field != value }

JSON field not equal

{ $.field = * }

Field exists (any value)

? and *

Wildcards (single/multi-char)

()


Grouping for AND/OR logic



  • How twtech creates metric filters: To find filter key words







  • Test the pattern: Results

Next: Assign metric
  • twtech Creates filter name: twtechFilterMetric
  • Log events that match the pattern twtech defines are recorded to the metric that its specifies. 
  • twtech can therefore, graph the metric and set alarms for notification.

Metric details
  • Metric namespace: twtechFilterMetricNs
  • Namespaces let twtech group similar metrics

  • Review and create Metric filter

  • Create metric filter:


  • How twtech accesses the metric created in CloudWatch: Metrics / All metrics

  • Access:  graphed metrics

To

  • Create alarms on top of the metrics: create alarm to notify if metric exceeds value set.



  • Conditions: Threshold type

Configure actions: Alarm state trigger
  • Define the alarm state that will trigger this action.
  • From: create SNS Topic

To:

  • Check email and: confirm subscription

  • Confirm subscription:
NB:
  • Distribution email (group email) recommended if twtech CloudWatch sould be notified at the same time.



Add alarm details
  • Name and description
  • Alarm name: twtechMetricsAlarm

  • Preview and create


  • Create Alarm: twteMetricsAlarm

  • Details of Alarm: twtechMetricsAlarm



  • How twtech creates Subscription filters for:  Log groups
  • Create Subscription filters for:  Create Lambda subscription filter

  • How twtech may edit logs retenntion settings: Duration twtech prefer to keep the logs

  • Retention period: 1 day – 10 years

  • How twtech may export data into s3 bucket: twtechs3Bucket

Export data to Amazon S3: twtech-s3bucket

  • How twtech creates log groups: twtechMetricLG

  • Create Log group (LG)


  • How twtech used CloudWatch logs insight for : in-depth analysis

  • How twtech runs Query for specific log groups: Select log group

Select log group and : Run Query
  • # using the Query language
fields @timestamp@message@logStream@log
| sort @timestamp desc
| limit 10000


  • Run Query for the past: Custom 3days

  • Logs created in the past 3 days: 94
  • How twtech saves its query results:
  • Assign a query name: twtechMetericLogQuery




  • The Official documentation link to get Queries:

Common queries used to get logs for:
  •        Lambda,
  •        VPC Flow Logs,
  •        CloudTrail,
  •        NetworkFirewall,
  •        Route53,
  •        AWS AppSync,
  •        NAT Gateway,
  •        IoT,
  •        Elemental MediaPackage V2 Access Logs,
  •        SES Mail Manager,
  •        Amazon Q Business Conversation Log.

      General queries

  •       # To Find the 25 most recently added log events.

fields @timestamp, @message | sort @timestamp desc | limit 25

  • # To Get a list of the number of exceptions per hour.

filter @message like /Exception/ 
    | stats count(*) as exceptionCount by bin(1h)
    | sort exceptionCount desc

  • # To Get a list of log events that aren't exceptions.

fields @message | filter @message not like /Exception/

  • # To Get the most recent log event for each unique value of the server field.

fields @timestamp, server, severity, message 
| sort @timestamp asc 
| dedup server

  • # To Get the most recent log event for each unique value of the server field for each severity type.

fields @timestamp, server, severity, message 
| sort @timestamp desc 
| dedup server, severity

Queries for Lambda logs

  • # To Determine the amount of overprovisioned memory.

filter @type = "REPORT"
    | stats max(@memorySize / 1000 / 1000) as provisonedMemoryMB,
        min(@maxMemoryUsed / 1000 / 1000) as smallestMemoryRequestMB,
        avg(@maxMemoryUsed / 1000 / 1000) as avgMemoryUsedMB,
        max(@maxMemoryUsed / 1000 / 1000) as maxMemoryUsedMB,
        provisonedMemoryMB - maxMemoryUsedMB as overProvisionedMB 

  • # To Create a latency report.

filter @type = "REPORT" |
    stats avg(@duration), max(@duration), min(@duration) by bin(5m)

  • # To Search for slow function invocations, and eliminate duplicate requests that can arise from retries or client-side code. In this query, @duration is in milliseconds.

fields @timestamp, @requestId, @message, @logStream 
| filter @type = "REPORT" and @duration > 1000
| sort @timestamp desc
| dedup @requestId 
| limit 20

Queries for Amazon VPC flow logs

  • # To Find the top 15 packet transfers across hosts:

stats sum(packets) as packetsTransferred by srcAddr, dstAddr
    | sort packetsTransferred  desc
    | limit 15

  • # To Find the top 15 byte transfers for hosts on a given subnet.

filter isIpv4InSubnet(srcAddr, "192.0.2.0/24")
    | stats sum(bytes) as bytesTransferred by dstAddr
    | sort bytesTransferred desc
    | limit 15

  • # To Find the IP addresses that use UDP as a data transfer protocol.

filter protocol=17 | stats count(*) by srcAddr

  • # To Find the IP addresses where flow records were skipped during the capture window.

filter logStatus="SKIPDATA"
    | stats count(*) by bin(1h) as t
    | sort t

  • # To Find a single record for each connection, to help troubleshoot network connectivity issues.

fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, bytes 
| filter logStream = 'vpc-flow-logs' and interfaceId = 'eni-0123456789abcdef0' 
| sort @timestamp desc 
| dedup srcAddr, dstAddr, srcPort, dstPort, protocol 
| limit 20

Queries for Route 53 logs

# To Find the distribution of records per hour by query type.

stats count(*) by queryType, bin(1h)

  • # To Find the 10 DNS resolvers with the highest number of requests.

stats count(*) as numRequests by resolverIp
    | sort numRequests desc
    | limit 10

  • # To Find the number of records by domain and subdomain where the server failed to complete the DNS request.

filter responseCode="SERVFAIL" | stats count(*) by queryName

Queries for CloudTrail logs

# To Find the number of log entries for each service, event type, and AWS Region.

stats count(*) by eventSource, eventName, awsRegion

  • # To Find the Amazon EC2 hosts that were started or stopped in a given AWS Region.

filter (eventName="StartInstances" or eventName="StopInstances") and awsRegion="us-east-2"

# To Find the AWS Regions, user names, and ARNs of newly created IAM users.

filter eventName="CreateUser"
    | fields awsRegion, requestParameters.userName, responseElements.user.arn

  • # To Find the number of records where an exception occurred while invoking the API UpdateTrail.

filter eventName="UpdateTrail" and ispresent(errorCode)
    | stats count(*) by errorCode, errorMessage

  • # To Find log entries where TLS 1.0 or 1.1 was used

 filter tlsDetails.tlsVersion in [ "TLSv1", "TLSv1.1" ]

| stats count(*) as numOutdatedTlsCalls by userIdentity.accountId, recipientAccountId, eventSource, eventName, awsRegion, tlsDetails.tlsVersion, tlsDetails.cipherSuite, userAgent
| sort eventSource, eventName, awsRegion, tlsDetails.tlsVersion  

  • # To Find the number of calls per service that used TLS versions 1.0 or 1.1

filter tlsDetails.tlsVersion in [ "TLSv1", "TLSv1.1" ]
| stats count(*) as numOutdatedTlsCalls by eventSource
| sort numOutdatedTlsCalls desc     

Queries for Amazon API Gateway

# To Find the last 10 4XX errors

fields @timestamp, status, ip, path, httpMethod
| filter status>=400 and status<=499
| sort @timestamp desc
| limit 10

  • # To Identify the 10 longest-running Amazon API Gateway requests in twtech Amazon API Gateway access log group

fields @timestamp, status, ip, path, httpMethod, responseLatency
| sort responseLatency desc
| limit 10

  • # To Return the list of the most popular API paths in twtech Amazon API Gateway access log group

stats count(*) as requestCount by path
| sort requestCount desc
| limit 10

  • # To Create an integration latency report for twtech Amazon API Gateway access log group

filter status=200
| stats avg(integrationLatency), max(integrationLatency), 
min(integrationLatency) by bin(1m)

Queries for NAT gateway

 # If twtech notice higher than normal costs in its AWS bill, it can use CloudWatch Logs Insights to find the top contributors. 
 # To For more information about the following query commands.

NB

  • In the following query commands, replace "x.x.x.x" with the private IP of twtech NAT gateway, and replace "y.y" with the first two octets of its VPC CIDR range.
# To Find the instances that are sending the most traffic through twtech NAT gateway.

filter (dstAddr like 'x.x.x.x' and srcAddr like 'y.y.') 
| stats sum(bytes) as bytesTransferred by srcAddr, dstAddr
| sort bytesTransferred desc
| limit 10

# To Determine the traffic that's going to and from the instances in twtech NAT gateways.

 filter (dstAddr like 'x.x.x.x' and srcAddr like 'y.y.') or (srcAddr like 'xxx.xx.xx.xx' and dstAddr like 'y.y.')

| stats sum(bytes) as bytesTransferred by srcAddr, dstAddr
| sort bytesTransferred desc
| limit 10      

  • # To Determine the internet destinations that the instances in twtech VPC communicate with most often for uploads and downloads.

# For uploads

filter (srcAddr like 'x.x.x.x' and dstAddr not like 'y.y.') 
| stats sum(bytes) as bytesTransferred by srcAddr, dstAddr
| sort bytesTransferred desc
| limit 10

# For downloads

filter (dstAddr like 'x.x.x.x' and srcAddr not like 'y.y.') 
| stats sum(bytes) as bytesTransferred by srcAddr, dstAddr
| sort bytesTransferred desc
| limit 10

# To Queries for Apache server logs. twtech can use CloudWatch Logs Insights to query Apache server logs. 
# To Find the most relevant fields, so twtech can review its access logs and check for traffic in the /admin path of its application.

fields @timestamp, remoteIP, request, status, filename| sort @timestamp desc
| filter filename="/var/www/html/admin"
| limit 20    

  • # To Find the number unique GET requests that accessed your main page with status code "200" (success).

 fields @timestamp, remoteIP, method, status

| filter status="200" and referrer= http://34.250.27.141/ and method= "GET"
| stats count_distinct(remoteIP) as UniqueVisits

| limit 10    

  • # To Find the number of times your Apache service restarted.

 fields @timestamp, function, process, message

| filter message like "resuming normal operations"
| sort @timestamp desc

| limit 20              

Queries for Amazon EventBridge

# To Get the number of EventBridge events grouped by event detail type

fields @timestamp, @message
| stats count(*) as numberOfEvents by `detail-type`
| sort numberOfEvents desc

Examples of the parse command.

# To Use a glob expression to extract the fields @user@method, and @latency from the log field @message and return the average latency for each unique combination of @method and @user.

parse @message "user=*, method:*, latency := *" as @user,
    @method, @latency | stats avg(@latency) by @method,
    @user

  • # To Use a regular expression to extract the fields @user2@method2, and @latency2 from the log field @message and return the average latency for each unique combination of @method2 and @user2.

parse @message /user=(?<user2>.*?), method:(?<method2>.*?),
    latency := (?<latency2>.*?)/ | stats avg(latency2) by @method2, 
    @user2

  • # To Extracts the fields loggingTimeloggingType and loggingMessage, filters down to log events that contain ERROR or INFO strings, and then displays only the loggingMessage and loggingType fields for events that contain an ERROR string.

 FIELDS @message| PARSE @message "* [*] *" as loggingTime, loggingType, loggingMessage

          | FILTER loggingType IN ["ERROR", "INFO"]

          | DISPLAY loggingMessage, loggingType = "ERROR" as isError






No comments:

Post a Comment

Amazon EventBridge | Overview.

Amazon EventBridge - Overview. Scope: Intro, Core Concepts, Key Benefits, Link to official documentation, Insights. Intro: Amazon EventBridg...