Concept

Hard

CloudWatch, X-Ray & Observability

13 min read·CloudWatchX-RayCloudTrailObservabilityDVA-C02

A comprehensive deep dive into AWS observability — CloudWatch Metrics, Alarms, Logs, Logs Insights, Dashboards, EMF, AWS X-Ray distributed tracing, CloudTrail, and the three pillars of observability for the DVA-C02 exam.

Observability Mental Model

The three pillars of observability tell you what your system is doing, why it broke, and where the bottleneck is.

Rendering diagram…

Part 1 — Amazon CloudWatch Metrics

How Metrics Work

CloudWatch stores metrics as time-series data. Each metric is identified by a namespace, metric name, and zero or more dimensions.

Rendering diagram…

Concept	Detail
Namespace	Container for metrics — e.g., `AWS/EC2`, `AWS/Lambda`, `MyApp/Orders`
Metric name	e.g., `CPUUtilization`, `Duration`, `ErrorCount`
Dimension	Key-value filter — e.g., `FunctionName=my-fn`, `InstanceId=i-123`
Resolution	Standard: 60s minimum; High-resolution: 1s (StorageResolution=1)
Retention	3h for 1s, 15 days for 1m, 63 days for 5m, 15 months for 1h
Statistics	Average, Sum, Min, Max, SampleCount, pNN.NN (percentile)

Publishing Custom Metrics

javascript

1import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({ region: 'us-east-1' });
4
5// Standard resolution (60s granularity) — free
6await cw.send(new PutMetricDataCommand({
7  Namespace: 'MyApp/Orders',
8  MetricData: [
9    {
10      MetricName: 'OrdersProcessed',
11      Value: 42,
12      Unit: 'Count',
13      Dimensions: [
14        { Name: 'Environment', Value: 'prod' },
15        { Name: 'Region', Value: 'us-east-1' },
16      ],
17      Timestamp: new Date(),
18    },
19    {
20      MetricName: 'OrderProcessingLatency',
21      Value: 234,
22      Unit: 'Milliseconds',
23      Dimensions: [{ Name: 'Environment', Value: 'prod' }],
24    },
25  ],
26}));
27
28// High-resolution metric (1s granularity) — $0.30/metric/month
29await cw.send(new PutMetricDataCommand({
30  Namespace: 'MyApp/Payments',
31  MetricData: [{
32    MetricName: 'PaymentErrors',
33    Value: 1,
34    Unit: 'Count',
35    StorageResolution: 1,     // 1 = high-resolution; 60 = standard
36  }],
37}));

Embedded Metric Format (EMF)

EMF lets you emit custom metrics directly from Lambda log lines — zero extra API calls, no cost per metric.

javascript

1// Lambda — emit metrics via structured log line (EMF format)
2export async function handler(event) {
3  const start = Date.now();
4
5  // ... do work ...
6
7  const duration = Date.now() - start;
8
9  // EMF structured log — CloudWatch automatically extracts the metrics
10  console.log(JSON.stringify({
11    '_aws': {
12      Timestamp: Date.now(),
13      CloudWatchMetrics: [{
14        Namespace: 'MyApp/Lambda',
15        Dimensions: [['FunctionName'], ['Environment']],
16        Metrics: [
17          { Name: 'ProcessingTime', Unit: 'Milliseconds' },
18          { Name: 'ItemsProcessed', Unit: 'Count' },
19        ],
20      }],
21    },
22    FunctionName: process.env.AWS_LAMBDA_FUNCTION_NAME,
23    Environment: process.env.NODE_ENV,
24    ProcessingTime: duration,
25    ItemsProcessed: event.Records?.length ?? 1,
26  }));
27}

Metric Math

Combine metrics with mathematical expressions:

bash

1# Error rate = errors / invocations * 100
2# In CloudWatch console or via API:
3# EXPRESSION: (m1/m2)*100
4# m1 = Errors metric
5# m2 = Invocations metric

Useful functions: SUM(), AVG(), MAX(), MIN(), RATE(), DIFF(), FILL(), SEARCH(), IF()

Part 2 — CloudWatch Alarms

Alarm States

Rendering diagram…

State	Meaning
`OK`	Metric is within threshold
`ALARM`	Metric has breached threshold
`INSUFFICIENT_DATA`	Not enough data to evaluate (startup, gaps)

Creating an Alarm

javascript

1import { CloudWatchClient, PutMetricAlarmCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutMetricAlarmCommand({
6  AlarmName: 'HighLambdaErrorRate',
7  AlarmDescription: 'Lambda error rate > 1% for 5 consecutive minutes',
8  Namespace: 'AWS/Lambda',
9  MetricName: 'Errors',
10  Dimensions: [{ Name: 'FunctionName', Value: 'my-function' }],
11  Statistic: 'Sum',
12  Period: 60,              // evaluation period in seconds
13  EvaluationPeriods: 5,    // number of periods to evaluate
14  DatapointsToAlarm: 3,    // only alarm if 3 out of 5 periods breach (M of N)
15  Threshold: 10,
16  ComparisonOperator: 'GreaterThanThreshold',
17  TreatMissingData: 'notBreaching',  // breaching | notBreaching | ignore | missing
18  AlarmActions: [
19    'arn:aws:sns:us-east-1:123456789012:ops-alerts',
20    'arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:...',
21  ],
22  OKActions: ['arn:aws:sns:us-east-1:123456789012:ops-alerts'],
23}));

Alarm Actions

Action type	Use case
SNS notification	Email, Slack webhook, PagerDuty, on-call
Auto Scaling policy	Scale in/out EC2 fleet
EC2 action	Reboot, stop, terminate, recover instance
Systems Manager OpsItem	Create incident ticket
Lambda	Custom remediation automation

Composite Alarms

Composite alarms combine multiple alarms with AND/OR logic — reduces alert fatigue:

bash

1aws cloudwatch put-composite-alarm   --alarm-name "ProductionOutage"   --alarm-rule "ALARM(HighErrorRate) AND ALARM(HighLatency)"   --alarm-actions arn:aws:sns:us-east-1:123456789012:pagerduty

Part 3 — CloudWatch Logs

Log Hierarchy

Rendering diagram…

Level	Scope	Retention set at
Log Group	Named container (e.g., `/aws/lambda/my-fn`)	Log Group
Log Stream	Sequence of events from one source instance	Inherited from group
Log Event	Single timestamped log line	N/A

Retention Policy

bash

1# Set 30-day retention on a log group (default = never expire = expensive!)
2aws logs put-retention-policy   --log-group-name /aws/lambda/my-function   --retention-in-days 30
3
4# Valid values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365,
5#               400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653

Metric Filters

Extract metrics from log text patterns — no code changes needed:

bash

1# Create a metric filter — count ERROR occurrences per minute
2aws logs put-metric-filter   --log-group-name /aws/lambda/my-function   --filter-name ErrorCount   --filter-pattern "[timestamp, requestId, level=ERROR, ...]"   --metric-transformations     metricName=LambdaErrors,metricNamespace=MyApp/Lambda,metricValue=1,defaultValue=0
3
4# JSON log filter — match specific field value
5aws logs put-metric-filter   --log-group-name /aws/lambda/my-function   --filter-name PaymentFailures   --filter-pattern '{ $.level = "ERROR" && $.event = "payment_failed" }'   --metric-transformations     metricName=PaymentFailures,metricNamespace=MyApp/Payments,metricValue=1

Subscription Filters (Real-time Log Streaming)

Rendering diagram…

bash

1# Stream logs matching ERROR to a Lambda function
2aws logs put-subscription-filter   --log-group-name /aws/lambda/my-function   --filter-name ErrorsToLambda   --filter-pattern "ERROR"   --destination-arn arn:aws:lambda:us-east-1:123456789012:function:log-processor

Each log group supports one subscription filter (unless using cross-account via Kinesis).

CloudWatch Logs Insights

Logs Insights is an interactive query engine for log data:

bash

1# Find the 10 slowest Lambda invocations in the last hour
2fields @timestamp, @duration, @requestId
3| filter @type = "REPORT"
4| sort @duration desc
5| limit 10
6
7# Count errors by error type
8fields @message
9| filter @message like /ERROR/
10| parse @message "ERROR * -" as errorType
11| stats count(*) as errorCount by errorType
12| sort errorCount desc
13
14# P99 latency over time (5-minute buckets)
15fields @timestamp, @duration
16| filter @type = "REPORT"
17| stats pct(@duration, 99) as p99 by bin(5m)
18
19# Find cold starts
20filter @message like /Init Duration/
21| stats count() as coldStarts, avg(@initDuration) as avgInitMs by bin(1h)

Supported log types	Auto-parsed fields
Lambda	`@timestamp`, `@message`, `@requestId`, `@duration`, `@billedDuration`, `@initDuration`, `@maxMemoryUsed`
VPC Flow Logs	`@srcAddr`, `@dstAddr`, `@srcPort`, `@dstPort`, `@protocol`, `@action`
CloudTrail	`@eventName`, `@userIdentity`, `@sourceIPAddress`
Any JSON	Fields auto-extracted from JSON keys

Part 4 — AWS X-Ray

How X-Ray Works

Rendering diagram…

Core Concepts

Concept	Definition
Trace	The complete end-to-end journey of a single request across all services
Segment	One service's contribution to a trace (e.g., Lambda, EC2, API Gateway)
Subsegment	Granular unit within a segment — DB call, HTTP call, custom block
Trace Header	`X-Amzn-Trace-Id` HTTP header — carries the trace ID between services
Annotations	Indexed key-value pairs (max 50) — searchable in X-Ray console/API
Metadata	Non-indexed key-value pairs — larger payloads, not searchable
Service Map	Visual graph of all services and their connections for a time window
Groups	Saved filter expressions for segmenting traces
Insights	Anomaly detection — automatically surfaces unusual fault/latency spikes

X-Ray SDK Usage (Node.js)

javascript

1import AWSXRay from 'aws-xray-sdk-core';
2import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
3import https from 'https';
4
5// Auto-instrument all AWS SDK v3 clients
6AWSXRay.captureAWSv3Client(new DynamoDBClient({}));
7
8// Auto-instrument all outbound HTTPS calls
9AWSXRay.captureHTTPs(https);
10
11export async function handler(event) {
12  // Create a custom subsegment for a logical operation
13  const segment = AWSXRay.getSegment();
14  const subsegment = segment.addNewSubsegment('processOrder');
15
16  try {
17    // Add searchable annotations (indexed — use for filtering)
18    subsegment.addAnnotation('orderId', event.orderId);
19    subsegment.addAnnotation('userId', event.userId);
20    subsegment.addAnnotation('tier', 'premium');
21
22    // Add metadata (not indexed — use for debugging details)
23    subsegment.addMetadata('orderPayload', event);
24    subsegment.addMetadata('processingConfig', { retries: 3, timeout: 5000 });
25
26    const result = await processOrder(event);
27
28    subsegment.close();
29    return result;
30  } catch (err) {
31    subsegment.addError(err);
32    subsegment.close();
33    throw err;
34  }
35}

Sampling Rules

X-Ray does not record every request — it samples to control cost and noise.

Rendering diagram…

Setting	Default	Meaning
Reservoir	1 req/sec per rule	First N requests per second always recorded
Fixed rate	5%	% of requests beyond reservoir that are sampled
Custom rules	Priority-ordered	Match by service name, URL path, HTTP method, etc.

bash

1# Create a custom sampling rule — always sample /health 0%, /orders 10%
2aws xray create-sampling-rule --cli-input-json '{
3  "SamplingRule": {
4    "RuleName": "OrdersHighSampling",
5    "Priority": 1,
6    "ReservoirSize": 5,
7    "FixedRate": 0.10,
8    "URLPath": "/orders*",
9    "ServiceName": "*",
10    "ServiceType": "*",
11    "Host": "*",
12    "HTTPMethod": "*",
13    "ResourceARN": "*",
14    "Version": 1
15  }
16}'

Enabling X-Ray per Service

Service	How to enable
Lambda	Set `TracingConfig.Mode = Active` (or `PassThrough`) in function config
API Gateway	Enable X-Ray tracing on the stage settings
EC2	Install and run the X-Ray daemon (`xray -b 127.0.0.1:2000`)
ECS	Add X-Ray daemon as a sidecar container in the task definition
Elastic Beanstalk	Enable in `.ebextensions/xray-daemon.config`
App Mesh	Built-in — Envoy proxy emits X-Ray segments automatically

json

1// ECS task definition — X-Ray daemon sidecar
2{
3  "containerDefinitions": [
4    {
5      "name": "app",
6      "image": "my-app:latest",
7      "environment": [
8        { "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000" }
9      ]
10    },
11    {
12      "name": "xray-daemon",
13      "image": "amazon/aws-xray-daemon",
14      "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
15      "cpu": 32,
16      "memoryReservation": 256
17    }
18  ]
19}

Trace Header

text

1X-Amzn-Trace-Id: Root=1-5e272ff5-1234abcd5678ef012345;Parent=53995c3f42cd8ad8;Sampled=1

Field	Meaning
`Root`	Trace ID — unique per request, same across all services
`Parent`	Segment ID of the upstream caller
`Sampled=1`	This request is being recorded
`Sampled=0`	This request is NOT being recorded

Part 5 — AWS CloudTrail

CloudTrail records every API call made in your AWS account — who called what, when, from where, and with what result.

Rendering diagram…

Trail type	Scope	Cost
Management events	Control-plane actions (CreateBucket, DescribeInstances, etc.)	Free (first copy)
Data events	S3 object operations, Lambda invocations, DynamoDB item-level	Charged per event
Insights events	Unusual API activity (automated anomaly detection)	Charged

CloudTrail Log Event Structure

json

1{
2  "eventVersion": "1.08",
3  "userIdentity": {
4    "type": "IAMUser",
5    "principalId": "AIDA1234567890EXAMPLE",
6    "arn": "arn:aws:iam::123456789012:user/alice",
7    "accountId": "123456789012",
8    "userName": "alice"
9  },
10  "eventTime": "2024-01-15T14:32:00Z",
11  "eventSource": "s3.amazonaws.com",
12  "eventName": "DeleteObject",
13  "awsRegion": "us-east-1",
14  "sourceIPAddress": "203.0.113.42",
15  "requestParameters": {
16    "bucketName": "my-critical-bucket",
17    "key": "production/config.json"
18  },
19  "responseElements": null,
20  "errorCode": "AccessDenied",
21  "errorMessage": "Access Denied"
22}

CloudTrail + EventBridge for Real-Time Alerts

bash

1# Alert when anyone calls DeleteBucket
2aws events put-rule   --name "S3BucketDeletion"   --event-pattern '{
3    "source": ["aws.s3"],
4    "detail-type": ["AWS API Call via CloudTrail"],
5    "detail": {
6      "eventSource": ["s3.amazonaws.com"],
7      "eventName": ["DeleteBucket"]
8    }
9  }'

Part 6 — CloudWatch Dashboards & Container Insights

Dashboards

javascript

1import { CloudWatchClient, PutDashboardCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutDashboardCommand({
6  DashboardName: 'ProductionOverview',
7  DashboardBody: JSON.stringify({
8    widgets: [
9      {
10        type: 'metric',
11        properties: {
12          title: 'Lambda Error Rate',
13          metrics: [
14            ['AWS/Lambda', 'Errors', 'FunctionName', 'my-function', { stat: 'Sum', period: 60 }],
15          ],
16          period: 300,
17          stat: 'Sum',
18          view: 'timeSeries',
19        },
20      },
21      {
22        type: 'alarm',
23        properties: {
24          title: 'Active Alarms',
25          alarms: ['arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate'],
26        },
27      },
28    ],
29  }),
30}));

Container Insights

Container Insights collects enhanced metrics and logs from ECS, EKS, and Kubernetes:

Service	What it collects
ECS	CPU, memory, network per task/service/cluster
EKS	Pod/node CPU+memory, cluster health
Kubernetes (EC2)	Same as EKS via CloudWatch agent DaemonSet

Enable on ECS cluster:

bash

1aws ecs update-cluster-settings   --cluster my-cluster   --settings name=containerInsights,value=enabled

CloudTrail vs CloudWatch vs X-Ray

	CloudTrail	CloudWatch	X-Ray
Purpose	API audit trail	Metrics + logs + alarms	Distributed request tracing
Answers	Who called what API, when?	Is the system healthy?	Where is the latency/error?
Data type	API call records (JSON)	Numeric time-series + log text	Traces, segments, subsegments
Latency	~15 min to S3	Near real-time	Near real-time
Retention	90 days (console), indefinite (S3)	Configurable (default: never expire)	30 days
Searchable	Athena queries on S3	Logs Insights, Metric Math	Annotations + filter expressions
Trigger alarms	Via EventBridge → CloudWatch	Native	Via CloudWatch on X-Ray metrics
Use in DVA-C02	Security/compliance questions	Performance + operational questions	Latency + distributed debugging

DVA-C02 Quick Reference

Topic	Key Fact
Custom metric API	`PutMetricData`
High-resolution metric granularity	1 second (StorageResolution=1)
EMF	Emit metrics from log lines — no extra API call
Alarm states	OK / ALARM / INSUFFICIENT_DATA
Composite alarm	Combines alarms with AND / OR logic
TreatMissingData options	`breaching`, `notBreaching`, `ignore`, `missing`
M of N alarms	`DatapointsToAlarm` — e.g., 3 of 5 periods must breach
Log Group retention default	Never expire (set a retention policy!)
Subscription filter limit	1 per log group (use Kinesis for more)
Logs Insights P99 function	`pct(@duration, 99)`
X-Ray trace header name	`X-Amzn-Trace-Id`
X-Ray annotations vs metadata	Annotations: indexed, searchable (max 50); Metadata: not indexed
X-Ray default sampling	1 req/sec reservoir + 5% of additional
X-Ray on Lambda	Enable `TracingConfig.Mode = Active`
X-Ray on ECS	Add X-Ray daemon as sidecar container
X-Ray daemon port	UDP 2000 on localhost
X-Ray trace retention	30 days
CloudTrail management events cost	Free for first copy per region
CloudTrail data events	Charged — S3 object ops, Lambda invokes
CloudTrail log delivery to S3	Up to 15 minutes delay
CloudTrail + real-time alerts	Route to EventBridge → SNS/Lambda
Container Insights	Enhanced metrics for ECS, EKS, Kubernetes

Practice Questions8

easy

Q1. A production Lambda function is intermittently failing. The developer needs to inspect the function's log output from recent invocations and review its error-count and invocation metrics to diagnose the issue. Which service holds this information?