Amazon CloudWatch Logs - Overview & Hands-On.
Scope:
- Intro,
- Key Features & Concepts,
- Link to official documentation,
- Sources of Log Data,
- Log Ingestion,
- Log Storage & Management,
- Processing & Analysis,
- Destinations & Integrations,
- Monitoring & Alerting,
- Security & Access Control,
- Final thoughts,
- Project: Hands-On,
- Key word to filter search in logs,
- Official documentation link to get Queries.
- Amazon CloudWatch Logs is a service for monitoring and troubleshooting twtech applications and systems from its log data.
- Amazon CloudWatch Logs enables twtech to:
- Centralize the logs from all of its systems,
- Applications,
- AWS services that twtech runs,
- Regardless of where they are running,
- Provides features for searching:
- Analysis,
- Storage,
- Protection of that data.
- twtech can send logs from various sources to CloudWatch Logs, including Amazon EC2 instances, AWS Lambda functions, containerized applications (ECS, EKS), and more, using the CloudWatch agent or direct API calls.
- Logs are organized into logical log groups, which are collections of log streams that share the same retention, monitoring, and access control settings.
- CloudWatch Logs Insights provides a powerful, SQL-like query language to interactively search and analyze twtech log data, helping it to troubleshoot operational issues more quickly.
- twtech can create metric filters based on log content to extract numerical values and use them to generate CloudWatch metrics and trigger alarms, proactively notifying its of specific events (e.g., error counts exceeding a threshold).
- The service includes features to help protect sensitive data by automatically masking personally identifiable information (PII) and encrypting log data using AWS KMS keys.
- twtech has control over how long log events are stored.
- twtech can retain logs for a specified period or archive them to Amazon S3 for long-term storage and compliance purposes.
- Using subscription filters, twtech can stream log data in real time to Amazon Kinesis, Amazon Kinesis Data Firehose, or AWS Lambda for custom processing, analysis, or integration with third-party tools.
- Machine learning-powered capabilities help summarize thousands of log entries into patterns and detect anomalies, reducing the manual effort in log analysis.
1. Sources of Log Data
Logs can be ingested from multiple places:
- AWS Services
- Lambda function logs (stdout...standardOutput, stderr...standardError)
- ECS / EKS / Fargate container logs
- API Gateway execution/access logs
- VPC Flow Logs, Route 53 Resolver Logs
- CloudTrail events
- EC2 Instances
- CloudWatch Logs Agent
- CloudWatch Unified Agent
- On-premises / Hybrid
- Using the unified agent or Kinesis Agent
2. Log Ingestion
- Logs are collected and pushed into Log Groups (logical containers).
- Each log-producing entity writes to a Log Stream
(sequence of log events).
- Events are timestamped, stored, and indexed.
- Retention policies can be set (from 1 day to indefinite).
3. Log Storage & Management
- Log Groups
- Organize logs by application, service, or environment.
- Can set retention policies.
- Log Streams
- Represent individual sources (e.g., per Lambda instance, per EC2).
- Subscriptions
- Define streaming of logs to other services.
4. Processing & Analysis
- CloudWatch Logs Insights
- Purpose-built query engine.
- SQL-like queries for searching, filtering, and
aggregating logs.
- Supports visualization and dashboard integration.
- Metric Filters
- Convert specific patterns in logs into CloudWatch
Metrics.
- Example: Count ERROR occurrences → metric → alarm.
5. Destinations & Integrations
- Kinesis Data Streams / Firehose
- For near real-time log streaming to S3, Redshift,
Elasticsearch/OpenSearch, or 3rd-party tools.
- Lambda
- Trigger functions on specific log events.
- S3 (via
Firehose)
- Long-term storage and archival.
- OpenSearch Service
- For log search and visualization.
- Security & Compliance Integrations
- With services like GuardDuty, Security Hub, SIEM
tools.
6. Monitoring & Alerting
- CloudWatch Metrics + Alarms
- From metric filters or generated by services.
- Can trigger notifications (SNS), scaling actions, or
custom automation.
- Dashboards
- Combine logs, metrics, and alarms for observability.
7. Security & Access Control
- Encryption
- Logs at rest encrypted with KMS keys.
- IAM Policies
- Control who can read/write/stream logs.
- Cross-Account Sharing
- Subscription filters can send logs to another account.
Final thoughts:
- CloudWatch Logs act as the central nervous system for AWS observability (monitory plus).
- Ingesting
logs from:
- AWS services,
- Applications,
- On-premises systems; storing and indexing them; providing query and alerting features;
- Integrating with downstream analytics, monitoring, and SIEM (Security Information and Event Management) systems.
Project: Hands-On
- How twtech creates and use CloudWatch logs for Monitoring and Observability of its Metrics.
- How twtech access CloudWatch logs: Log groups
- Log groups are created by different: services
- Each log group has it own embedded: log streams
- With log streams are also embedded: Log events
- From within logs events, twtech filters (with key words) to get insights of the logs.
- Key word: http,
- Key word: exceptions.
- How twtech Filters keywords/terms used in CloudWatch Logs filters (for metric filters, subscription filters, and searching logs).
- Breakdown of the main ones and their meaning:
- CloudWatch Logs uses a simple filter pattern language to match log events.
1. Terms / Strings
- "ERROR" – Matches any log event containing the exact string ERROR.
- "200" – Matches log events containing the string 200.
- Strings are case sensitive.
- "ERROR Timeout" → Matches logs that contain both ERROR and Timeout.
- "\"User login failed\"" → Matches the exact phrase User login failed.
- -ERROR → Matches logs that do not contain ERROR.
- ERROR || Exception → Matches if either ERROR or Exception is present.
- When logs are JSON, twtech can filter on fields:
- { $.status = 404 } → Matches if the JSON field status is 404.
- { $.latency > 500 } → Matches if latency field is greater than 500.
- { $.user != "admin" } → Matches if user is not admin.
- { $.bytes >= 1000 } → Matches if the bytes field is 1000 or more.
- { $.requestId = * } → Matches logs where the requestId field exists (any value).
- "?ERROR*" → Matches any string where ERROR appears with prefix/suffix wildcards.
- ? → single character wildcard.
- * → multi-character wildcard.
- (ERROR || Exception) && Timeout → Matches logs with Timeout and either ERROR or Exception.
NB:
- Even without JSON, twtech can filter numeric-looking text:
- [ip, user, status=404, bytes>1000] → Matches space-delimited fields with conditions.
Keyword
/ Pattern | Meaning |
ERROR | Contains word ERROR |
-ERROR | Does not
contain ERROR |
ERROR
Timeout | Must contain
both terms |
`ERROR | |
"Exact
Match" | Matches exact
phrase |
{
$.field = value } | JSON field
equals value |
{
$.field > 100 } | JSON field
greater than 100 |
{
$.field != value } | JSON field not
equal |
{
$.field = * } | Field exists
(any value) |
?
and * | Wildcards
(single/multi-char) |
() | Grouping for
AND/OR logic |
- How twtech creates metric filters: To find filter key words
- Test the pattern: Results
Next: Assign metric
- twtech Creates filter name: twtechFilterMetric
- Log events that match the pattern twtech defines are recorded to the metric that its specifies.
- twtech can therefore, graph the metric and set alarms for notification.
Metric details
- Metric namespace: twtechFilterMetricNs
- Namespaces let twtech group similar metrics
- Review and create Metric filter
- Create metric filter:
- How twtech accesses the metric created in CloudWatch: Metrics / All metrics
- Access: graphed metrics
To
- Create alarms on top of the metrics: create alarm to notify if metric exceeds value set.
- Conditions: Threshold type
Configure actions: Alarm state trigger
- Define the alarm state that will trigger this action.
- From: create SNS Topic
To:
- Check email and: confirm subscription
- Confirm subscription:
- Distribution email (group email) recommended if twtech CloudWatch sould be notified at the same time.
Add alarm details
- Name and description
- Alarm name: twtechMetricsAlarm
- Preview and create
- Create Alarm: twteMetricsAlarm
- Details of Alarm: twtechMetricsAlarm
- How twtech creates Subscription filters for: Log groups
- Create Subscription filters for: Create Lambda subscription filter
- How twtech may edit logs retenntion settings: Duration twtech prefer to keep the logs
- Retention period: 1 day – 10 years
- How twtech may export data into s3 bucket: twtechs3Bucket
Export data to Amazon S3: twtech-s3bucket
- How twtech creates log groups: twtechMetricLG
- Create Log group (LG)
- How twtech used CloudWatch logs insight for : in-depth analysis
- How twtech runs Query for specific log groups: Select log group
Select log group and : Run Query
- # using the Query language
| sort @timestamp desc
| limit 10000
- Run Query for the past: Custom 3days
- Logs created in the past 3 days: 94
- How twtech saves its query results:
- Assign a query name: twtechMetericLogQuery
- The Official documentation link to get Queries:
Common queries used to get logs for:
- Lambda,
- VPC Flow Logs,
- CloudTrail,
- NetworkFirewall,
- Route53,
- AWS AppSync,
- NAT Gateway,
- IoT,
- Elemental MediaPackage V2 Access Logs,
- SES Mail Manager,
- Amazon Q Business Conversation Log.
General queries
- # To Find the 25 most recently added log events.
fields @timestamp, @message | sort @timestamp desc | limit 25
- # To Get a list of the number of exceptions per hour.
filter @message like /Exception/ | stats count(*) as exceptionCount by bin(1h) | sort exceptionCount desc
- # To Get a list of log events that aren't exceptions.
fields @message | filter @message not like /Exception/
- # To Get the most recent log event for each unique value of
the
serverfield.
fields @timestamp, server, severity, message | sort @timestamp asc | dedup server
- # To Get the most recent log event for each unique value of
the
serverfield for eachseveritytype.
fields @timestamp, server, severity, message | sort @timestamp desc | dedup server, severity
Queries
for Lambda logs
- # To Determine the amount of overprovisioned memory.
filter @type = "REPORT" | stats max(@memorySize / 1000 / 1000) as provisonedMemoryMB, min(@maxMemoryUsed / 1000 / 1000) as smallestMemoryRequestMB, avg(@maxMemoryUsed / 1000 / 1000) as avgMemoryUsedMB, max(@maxMemoryUsed / 1000 / 1000) as maxMemoryUsedMB, provisonedMemoryMB - maxMemoryUsedMB as overProvisionedMB
- # To Create a latency report.
filter @type = "REPORT" | stats avg(@duration), max(@duration), min(@duration) by bin(5m)
- # To Search for slow function invocations, and eliminate
duplicate requests that can arise from retries or client-side code. In this
query,
@durationis in milliseconds.
fields @timestamp, @requestId, @message, @logStream | filter @type = "REPORT" and @duration > 1000| sort @timestamp desc| dedup @requestId | limit 20
Queries
for Amazon VPC flow logs
- # To Find the top 15 packet transfers across hosts:
stats sum(packets) as packetsTransferred by srcAddr, dstAddr | sort packetsTransferred desc | limit 15
- # To Find the top 15 byte transfers for hosts on a given subnet.
filter isIpv4InSubnet(srcAddr, "192.0.2.0/24") | stats sum(bytes) as bytesTransferred by dstAddr | sort bytesTransferred desc | limit 15
- # To Find the IP addresses that use UDP as a data transfer protocol.
filter protocol=17 | stats count(*) by srcAddr
- # To Find the IP addresses where flow records were skipped during the capture window.
filter logStatus="SKIPDATA" | stats count(*) by bin(1h) as t | sort t
- # To Find a single record for each connection, to help troubleshoot network connectivity issues.
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, bytes | filter logStream = 'vpc-flow-logs' and interfaceId = 'eni-0123456789abcdef0' | sort @timestamp desc | dedup srcAddr, dstAddr, srcPort, dstPort, protocol | limit 20
Queries for Route 53 logs
# To Find the distribution of records per hour by query type.
stats count(*) by queryType, bin(1h)
- # To Find the 10 DNS resolvers with the highest number of requests.
stats count(*) as numRequests by resolverIp | sort numRequests desc | limit 10
- # To Find the number of records by domain and subdomain where the server failed to complete the DNS request.
filter responseCode="SERVFAIL" | stats count(*) by queryName
Queries for CloudTrail logs
# To Find the number of log entries for each service, event type, and AWS Region.
stats count(*) by eventSource, eventName, awsRegion
- # To Find the Amazon EC2 hosts that were started or stopped in a given AWS Region.
filter (eventName="StartInstances" or eventName="StopInstances") and awsRegion="us-east-2"
# To Find the AWS Regions, user names, and ARNs of newly created IAM users.
filter eventName="CreateUser" | fields awsRegion, requestParameters.userName, responseElements.user.arn
- # To Find the number of records where an exception occurred
while invoking the API
UpdateTrail.
filter eventName="UpdateTrail" and ispresent(errorCode) | stats count(*) by errorCode, errorMessage
- # To Find log entries where TLS 1.0 or 1.1 was used
filter tlsDetails.tlsVersion in [ "TLSv1", "TLSv1.1" ]
| stats count(*) as numOutdatedTlsCalls by userIdentity.accountId, recipientAccountId, eventSource, eventName, awsRegion, tlsDetails.tlsVersion, tlsDetails.cipherSuite, userAgent| sort eventSource, eventName, awsRegion, tlsDetails.tlsVersion
- # To Find the number of calls per service that used TLS versions 1.0 or 1.1
filter tlsDetails.tlsVersion in [ "TLSv1", "TLSv1.1" ]| stats count(*) as numOutdatedTlsCalls by eventSource| sort numOutdatedTlsCalls desc
Queries for Amazon API Gateway
# To Find the last 10 4XX errors
fields @timestamp, status, ip, path, httpMethod| filter status>=400 and status<=499| sort @timestamp desc| limit 10
- # To Identify the 10 longest-running Amazon API Gateway requests in twtech Amazon API Gateway access log group
fields @timestamp, status, ip, path, httpMethod, responseLatency| sort responseLatency desc| limit 10
- # To Return the list of the most popular API paths in twtech Amazon API Gateway access log group
stats count(*) as requestCount by path| sort requestCount desc| limit 10
- # To Create an integration latency report for twtech Amazon API Gateway access log group
filter status=200| stats avg(integrationLatency), max(integrationLatency), min(integrationLatency) by bin(1m)
Queries for NAT gateway
# If
twtech notice higher than normal costs in its AWS bill, it can use CloudWatch
Logs Insights to find the top contributors.
# To For more information about the
following query commands.
NB
- In the following query commands, replace "x.x.x.x"
with the private IP of twtech NAT gateway, and replace "y.y" with the
first two octets of its VPC CIDR range.
filter (dstAddr like 'x.x.x.x' and srcAddr like 'y.y.') | stats sum(bytes) as bytesTransferred by srcAddr, dstAddr| sort bytesTransferred desc| limit 10
# To Determine the traffic that's going to and from the
instances in twtech NAT gateways.filter (dstAddr like 'x.x.x.x' and srcAddr like 'y.y.') or (srcAddr like 'xxx.xx.xx.xx' and dstAddr like 'y.y.')
| stats sum(bytes) as bytesTransferred by srcAddr, dstAddr| sort bytesTransferred desc| limit 10
- # To Determine the internet destinations that the instances in twtech VPC communicate with most often for uploads and downloads.
# For uploads
filter (srcAddr like 'x.x.x.x' and dstAddr not like 'y.y.') | stats sum(bytes) as bytesTransferred by srcAddr, dstAddr| sort bytesTransferred desc| limit 10
# For downloads
filter (dstAddr like 'x.x.x.x' and srcAddr not like 'y.y.') | stats sum(bytes) as bytesTransferred by srcAddr, dstAddr| sort bytesTransferred desc| limit 10
# To Queries
for Apache server logs. twtech
can use CloudWatch Logs Insights to query Apache server logs.
# To Find the most relevant fields, so twtech can review its
access logs and check for traffic in the /admin path of its
application.
fields @timestamp, remoteIP, request, status, filename| sort @timestamp desc| filter filename="/var/www/html/admin"| limit 20
- # To Find the number unique GET requests that accessed your main page with status code "200" (success).
fields @timestamp, remoteIP, method, status
| filter status="200" and referrer= http://34.250.27.141/ and method= "GET"| stats count_distinct(remoteIP) as UniqueVisits| limit 10
- # To Find the number of times your Apache service restarted.
fields @timestamp, function, process, message
| filter message like "resuming normal operations"| sort @timestamp desc| limit 20
Queries for Amazon EventBridge
# To Get the number of EventBridge events grouped by event detail type
fields @timestamp, @message| stats count(*) as numberOfEvents by `detail-type`| sort numberOfEvents desc
Examples of the parse command.
# To Use
a glob expression to extract the fields @user, @method, and @latency from the log field @message and return the average latency for each
unique combination of @method and @user.
parse @message "user=*, method:*, latency := *" as @user, @method, @latency | stats avg(@latency) by @method, @user
- # To Use a regular expression to extract the fields
@user2,@method2, and@latency2from the log field@messageand return the average latency for each unique combination of@method2and@user2.
parse @message /user=(?<user2>.*?), method:(?<method2>.*?), latency := (?<latency2>.*?)/ | stats avg(latency2) by @method2, @user2
- # To Extracts the fields
loggingTime,loggingTypeandloggingMessage, filters down to log events that containERRORorINFOstrings, and then displays only theloggingMessageandloggingTypefields for events that contain anERRORstring.
FIELDS @message| PARSE @message "* [*] *" as loggingTime, loggingType, loggingMessage
| FILTER loggingType IN ["ERROR", "INFO"]
| DISPLAY loggingMessage, loggingType = "ERROR" as isError
No comments:
Post a Comment