/CloudWatch, X-Ray & Observability
Concept
Hard

CloudWatch, X-Ray & Observability

13 min read·CloudWatchX-RayCloudTrailObservabilityDVA-C02

A comprehensive deep dive into AWS observability — CloudWatch Metrics, Alarms, Logs, Logs Insights, Dashboards, EMF, AWS X-Ray distributed tracing, CloudTrail, and the three pillars of observability for the DVA-C02 exam.


Observability Mental Model

The three pillars of observability tell you what your system is doing, why it broke, and where the bottleneck is.

Rendering diagram…

Part 1 — Amazon CloudWatch Metrics

How Metrics Work

CloudWatch stores metrics as time-series data. Each metric is identified by a namespace, metric name, and zero or more dimensions.

Rendering diagram…
ConceptDetail
NamespaceContainer for metrics — e.g., AWS/EC2, AWS/Lambda, MyApp/Orders
Metric namee.g., CPUUtilization, Duration, ErrorCount
DimensionKey-value filter — e.g., FunctionName=my-fn, InstanceId=i-123
ResolutionStandard: 60s minimum; High-resolution: 1s (StorageResolution=1)
Retention3h for 1s, 15 days for 1m, 63 days for 5m, 15 months for 1h
StatisticsAverage, Sum, Min, Max, SampleCount, pNN.NN (percentile)

Publishing Custom Metrics

javascript
1import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({ region: 'us-east-1' });
4
5// Standard resolution (60s granularity) — free
6await cw.send(new PutMetricDataCommand({
7  Namespace: 'MyApp/Orders',
8  MetricData: [
9    {
10      MetricName: 'OrdersProcessed',
11      Value: 42,
12      Unit: 'Count',
13      Dimensions: [
14        { Name: 'Environment', Value: 'prod' },
15        { Name: 'Region', Value: 'us-east-1' },
16      ],
17      Timestamp: new Date(),
18    },
19    {
20      MetricName: 'OrderProcessingLatency',
21      Value: 234,
22      Unit: 'Milliseconds',
23      Dimensions: [{ Name: 'Environment', Value: 'prod' }],
24    },
25  ],
26}));
27
28// High-resolution metric (1s granularity) — $0.30/metric/month
29await cw.send(new PutMetricDataCommand({
30  Namespace: 'MyApp/Payments',
31  MetricData: [{
32    MetricName: 'PaymentErrors',
33    Value: 1,
34    Unit: 'Count',
35    StorageResolution: 1,     // 1 = high-resolution; 60 = standard
36  }],
37}));

Embedded Metric Format (EMF)

EMF lets you emit custom metrics directly from Lambda log lines — zero extra API calls, no cost per metric.

javascript
1// Lambda — emit metrics via structured log line (EMF format)
2export async function handler(event) {
3  const start = Date.now();
4
5  // ... do work ...
6
7  const duration = Date.now() - start;
8
9  // EMF structured log — CloudWatch automatically extracts the metrics
10  console.log(JSON.stringify({
11    '_aws': {
12      Timestamp: Date.now(),
13      CloudWatchMetrics: [{
14        Namespace: 'MyApp/Lambda',
15        Dimensions: [['FunctionName'], ['Environment']],
16        Metrics: [
17          { Name: 'ProcessingTime', Unit: 'Milliseconds' },
18          { Name: 'ItemsProcessed', Unit: 'Count' },
19        ],
20      }],
21    },
22    FunctionName: process.env.AWS_LAMBDA_FUNCTION_NAME,
23    Environment: process.env.NODE_ENV,
24    ProcessingTime: duration,
25    ItemsProcessed: event.Records?.length ?? 1,
26  }));
27}

Metric Math

Combine metrics with mathematical expressions:

bash
1# Error rate = errors / invocations * 100
2# In CloudWatch console or via API:
3# EXPRESSION: (m1/m2)*100
4# m1 = Errors metric
5# m2 = Invocations metric

Useful functions: SUM(), AVG(), MAX(), MIN(), RATE(), DIFF(), FILL(), SEARCH(), IF()


Part 2 — CloudWatch Alarms

Alarm States

Rendering diagram…
StateMeaning
OKMetric is within threshold
ALARMMetric has breached threshold
INSUFFICIENT_DATANot enough data to evaluate (startup, gaps)

Creating an Alarm

javascript
1import { CloudWatchClient, PutMetricAlarmCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutMetricAlarmCommand({
6  AlarmName: 'HighLambdaErrorRate',
7  AlarmDescription: 'Lambda error rate > 1% for 5 consecutive minutes',
8  Namespace: 'AWS/Lambda',
9  MetricName: 'Errors',
10  Dimensions: [{ Name: 'FunctionName', Value: 'my-function' }],
11  Statistic: 'Sum',
12  Period: 60,              // evaluation period in seconds
13  EvaluationPeriods: 5,    // number of periods to evaluate
14  DatapointsToAlarm: 3,    // only alarm if 3 out of 5 periods breach (M of N)
15  Threshold: 10,
16  ComparisonOperator: 'GreaterThanThreshold',
17  TreatMissingData: 'notBreaching',  // breaching | notBreaching | ignore | missing
18  AlarmActions: [
19    'arn:aws:sns:us-east-1:123456789012:ops-alerts',
20    'arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:...',
21  ],
22  OKActions: ['arn:aws:sns:us-east-1:123456789012:ops-alerts'],
23}));

Alarm Actions

Action typeUse case
SNS notificationEmail, Slack webhook, PagerDuty, on-call
Auto Scaling policyScale in/out EC2 fleet
EC2 actionReboot, stop, terminate, recover instance
Systems Manager OpsItemCreate incident ticket
LambdaCustom remediation automation

Composite Alarms

Composite alarms combine multiple alarms with AND/OR logic — reduces alert fatigue:

bash
1aws cloudwatch put-composite-alarm   --alarm-name "ProductionOutage"   --alarm-rule "ALARM(HighErrorRate) AND ALARM(HighLatency)"   --alarm-actions arn:aws:sns:us-east-1:123456789012:pagerduty

Part 3 — CloudWatch Logs

Log Hierarchy

Rendering diagram…
LevelScopeRetention set at
Log GroupNamed container (e.g., /aws/lambda/my-fn)Log Group
Log StreamSequence of events from one source instanceInherited from group
Log EventSingle timestamped log lineN/A

Retention Policy

bash
1# Set 30-day retention on a log group (default = never expire = expensive!)
2aws logs put-retention-policy   --log-group-name /aws/lambda/my-function   --retention-in-days 30
3
4# Valid values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365,
5#               400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653

Metric Filters

Extract metrics from log text patterns — no code changes needed:

bash
1# Create a metric filter — count ERROR occurrences per minute
2aws logs put-metric-filter   --log-group-name /aws/lambda/my-function   --filter-name ErrorCount   --filter-pattern "[timestamp, requestId, level=ERROR, ...]"   --metric-transformations     metricName=LambdaErrors,metricNamespace=MyApp/Lambda,metricValue=1,defaultValue=0
3
4# JSON log filter — match specific field value
5aws logs put-metric-filter   --log-group-name /aws/lambda/my-function   --filter-name PaymentFailures   --filter-pattern '{ $.level = "ERROR" && $.event = "payment_failed" }'   --metric-transformations     metricName=PaymentFailures,metricNamespace=MyApp/Payments,metricValue=1

Subscription Filters (Real-time Log Streaming)

Rendering diagram…
bash
1# Stream logs matching ERROR to a Lambda function
2aws logs put-subscription-filter   --log-group-name /aws/lambda/my-function   --filter-name ErrorsToLambda   --filter-pattern "ERROR"   --destination-arn arn:aws:lambda:us-east-1:123456789012:function:log-processor

Each log group supports one subscription filter (unless using cross-account via Kinesis).

CloudWatch Logs Insights

Logs Insights is an interactive query engine for log data:

bash
1# Find the 10 slowest Lambda invocations in the last hour
2fields @timestamp, @duration, @requestId
3| filter @type = "REPORT"
4| sort @duration desc
5| limit 10
6
7# Count errors by error type
8fields @message
9| filter @message like /ERROR/
10| parse @message "ERROR * -" as errorType
11| stats count(*) as errorCount by errorType
12| sort errorCount desc
13
14# P99 latency over time (5-minute buckets)
15fields @timestamp, @duration
16| filter @type = "REPORT"
17| stats pct(@duration, 99) as p99 by bin(5m)
18
19# Find cold starts
20filter @message like /Init Duration/
21| stats count() as coldStarts, avg(@initDuration) as avgInitMs by bin(1h)
Supported log typesAuto-parsed fields
Lambda@timestamp, @message, @requestId, @duration, @billedDuration, @initDuration, @maxMemoryUsed
VPC Flow Logs@srcAddr, @dstAddr, @srcPort, @dstPort, @protocol, @action
CloudTrail@eventName, @userIdentity, @sourceIPAddress
Any JSONFields auto-extracted from JSON keys

Part 4 — AWS X-Ray

How X-Ray Works

Rendering diagram…

Core Concepts

ConceptDefinition
TraceThe complete end-to-end journey of a single request across all services
SegmentOne service's contribution to a trace (e.g., Lambda, EC2, API Gateway)
SubsegmentGranular unit within a segment — DB call, HTTP call, custom block
Trace HeaderX-Amzn-Trace-Id HTTP header — carries the trace ID between services
AnnotationsIndexed key-value pairs (max 50) — searchable in X-Ray console/API
MetadataNon-indexed key-value pairs — larger payloads, not searchable
Service MapVisual graph of all services and their connections for a time window
GroupsSaved filter expressions for segmenting traces
InsightsAnomaly detection — automatically surfaces unusual fault/latency spikes

X-Ray SDK Usage (Node.js)

javascript
1import AWSXRay from 'aws-xray-sdk-core';
2import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
3import https from 'https';
4
5// Auto-instrument all AWS SDK v3 clients
6AWSXRay.captureAWSv3Client(new DynamoDBClient({}));
7
8// Auto-instrument all outbound HTTPS calls
9AWSXRay.captureHTTPs(https);
10
11export async function handler(event) {
12  // Create a custom subsegment for a logical operation
13  const segment = AWSXRay.getSegment();
14  const subsegment = segment.addNewSubsegment('processOrder');
15
16  try {
17    // Add searchable annotations (indexed — use for filtering)
18    subsegment.addAnnotation('orderId', event.orderId);
19    subsegment.addAnnotation('userId', event.userId);
20    subsegment.addAnnotation('tier', 'premium');
21
22    // Add metadata (not indexed — use for debugging details)
23    subsegment.addMetadata('orderPayload', event);
24    subsegment.addMetadata('processingConfig', { retries: 3, timeout: 5000 });
25
26    const result = await processOrder(event);
27
28    subsegment.close();
29    return result;
30  } catch (err) {
31    subsegment.addError(err);
32    subsegment.close();
33    throw err;
34  }
35}

Sampling Rules

X-Ray does not record every request — it samples to control cost and noise.

Rendering diagram…
SettingDefaultMeaning
Reservoir1 req/sec per ruleFirst N requests per second always recorded
Fixed rate5%% of requests beyond reservoir that are sampled
Custom rulesPriority-orderedMatch by service name, URL path, HTTP method, etc.
bash
1# Create a custom sampling rule — always sample /health 0%, /orders 10%
2aws xray create-sampling-rule --cli-input-json '{
3  "SamplingRule": {
4    "RuleName": "OrdersHighSampling",
5    "Priority": 1,
6    "ReservoirSize": 5,
7    "FixedRate": 0.10,
8    "URLPath": "/orders*",
9    "ServiceName": "*",
10    "ServiceType": "*",
11    "Host": "*",
12    "HTTPMethod": "*",
13    "ResourceARN": "*",
14    "Version": 1
15  }
16}'

Enabling X-Ray per Service

ServiceHow to enable
LambdaSet TracingConfig.Mode = Active (or PassThrough) in function config
API GatewayEnable X-Ray tracing on the stage settings
EC2Install and run the X-Ray daemon (xray -b 127.0.0.1:2000)
ECSAdd X-Ray daemon as a sidecar container in the task definition
Elastic BeanstalkEnable in .ebextensions/xray-daemon.config
App MeshBuilt-in — Envoy proxy emits X-Ray segments automatically
json
1// ECS task definition — X-Ray daemon sidecar
2{
3  "containerDefinitions": [
4    {
5      "name": "app",
6      "image": "my-app:latest",
7      "environment": [
8        { "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000" }
9      ]
10    },
11    {
12      "name": "xray-daemon",
13      "image": "amazon/aws-xray-daemon",
14      "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
15      "cpu": 32,
16      "memoryReservation": 256
17    }
18  ]
19}

Trace Header

text
1X-Amzn-Trace-Id: Root=1-5e272ff5-1234abcd5678ef012345;Parent=53995c3f42cd8ad8;Sampled=1
FieldMeaning
RootTrace ID — unique per request, same across all services
ParentSegment ID of the upstream caller
Sampled=1This request is being recorded
Sampled=0This request is NOT being recorded

Part 5 — AWS CloudTrail

CloudTrail records every API call made in your AWS account — who called what, when, from where, and with what result.

Rendering diagram…
Trail typeScopeCost
Management eventsControl-plane actions (CreateBucket, DescribeInstances, etc.)Free (first copy)
Data eventsS3 object operations, Lambda invocations, DynamoDB item-levelCharged per event
Insights eventsUnusual API activity (automated anomaly detection)Charged

CloudTrail Log Event Structure

json
1{
2  "eventVersion": "1.08",
3  "userIdentity": {
4    "type": "IAMUser",
5    "principalId": "AIDA1234567890EXAMPLE",
6    "arn": "arn:aws:iam::123456789012:user/alice",
7    "accountId": "123456789012",
8    "userName": "alice"
9  },
10  "eventTime": "2024-01-15T14:32:00Z",
11  "eventSource": "s3.amazonaws.com",
12  "eventName": "DeleteObject",
13  "awsRegion": "us-east-1",
14  "sourceIPAddress": "203.0.113.42",
15  "requestParameters": {
16    "bucketName": "my-critical-bucket",
17    "key": "production/config.json"
18  },
19  "responseElements": null,
20  "errorCode": "AccessDenied",
21  "errorMessage": "Access Denied"
22}

CloudTrail + EventBridge for Real-Time Alerts

bash
1# Alert when anyone calls DeleteBucket
2aws events put-rule   --name "S3BucketDeletion"   --event-pattern '{
3    "source": ["aws.s3"],
4    "detail-type": ["AWS API Call via CloudTrail"],
5    "detail": {
6      "eventSource": ["s3.amazonaws.com"],
7      "eventName": ["DeleteBucket"]
8    }
9  }'

Part 6 — CloudWatch Dashboards & Container Insights

Dashboards

javascript
1import { CloudWatchClient, PutDashboardCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutDashboardCommand({
6  DashboardName: 'ProductionOverview',
7  DashboardBody: JSON.stringify({
8    widgets: [
9      {
10        type: 'metric',
11        properties: {
12          title: 'Lambda Error Rate',
13          metrics: [
14            ['AWS/Lambda', 'Errors', 'FunctionName', 'my-function', { stat: 'Sum', period: 60 }],
15          ],
16          period: 300,
17          stat: 'Sum',
18          view: 'timeSeries',
19        },
20      },
21      {
22        type: 'alarm',
23        properties: {
24          title: 'Active Alarms',
25          alarms: ['arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate'],
26        },
27      },
28    ],
29  }),
30}));

Container Insights

Container Insights collects enhanced metrics and logs from ECS, EKS, and Kubernetes:

ServiceWhat it collects
ECSCPU, memory, network per task/service/cluster
EKSPod/node CPU+memory, cluster health
Kubernetes (EC2)Same as EKS via CloudWatch agent DaemonSet

Enable on ECS cluster:

bash
1aws ecs update-cluster-settings   --cluster my-cluster   --settings name=containerInsights,value=enabled

CloudTrail vs CloudWatch vs X-Ray

CloudTrailCloudWatchX-Ray
PurposeAPI audit trailMetrics + logs + alarmsDistributed request tracing
AnswersWho called what API, when?Is the system healthy?Where is the latency/error?
Data typeAPI call records (JSON)Numeric time-series + log textTraces, segments, subsegments
Latency~15 min to S3Near real-timeNear real-time
Retention90 days (console), indefinite (S3)Configurable (default: never expire)30 days
SearchableAthena queries on S3Logs Insights, Metric MathAnnotations + filter expressions
Trigger alarmsVia EventBridge → CloudWatchNativeVia CloudWatch on X-Ray metrics
Use in DVA-C02Security/compliance questionsPerformance + operational questionsLatency + distributed debugging

DVA-C02 Quick Reference

TopicKey Fact
Custom metric APIPutMetricData
High-resolution metric granularity1 second (StorageResolution=1)
EMFEmit metrics from log lines — no extra API call
Alarm statesOK / ALARM / INSUFFICIENT_DATA
Composite alarmCombines alarms with AND / OR logic
TreatMissingData optionsbreaching, notBreaching, ignore, missing
M of N alarmsDatapointsToAlarm — e.g., 3 of 5 periods must breach
Log Group retention defaultNever expire (set a retention policy!)
Subscription filter limit1 per log group (use Kinesis for more)
Logs Insights P99 functionpct(@duration, 99)
X-Ray trace header nameX-Amzn-Trace-Id
X-Ray annotations vs metadataAnnotations: indexed, searchable (max 50); Metadata: not indexed
X-Ray default sampling1 req/sec reservoir + 5% of additional
X-Ray on LambdaEnable TracingConfig.Mode = Active
X-Ray on ECSAdd X-Ray daemon as sidecar container
X-Ray daemon portUDP 2000 on localhost
X-Ray trace retention30 days
CloudTrail management events costFree for first copy per region
CloudTrail data eventsCharged — S3 object ops, Lambda invokes
CloudTrail log delivery to S3Up to 15 minutes delay
CloudTrail + real-time alertsRoute to EventBridge → SNS/Lambda
Container InsightsEnhanced metrics for ECS, EKS, Kubernetes

Practice Questions5

easy

Q1. A developer notices that a Lambda function's error rate is increasing but the default CloudWatch metrics do not show which specific code path is failing. Which AWS service provides request-level tracing to visualize the full call graph including downstream API calls?


Select one answer before revealing.

medium

Q2. A developer creates a custom CloudWatch metric to track the number of failed payment transactions per minute. The application should trigger an auto-scaling action when the metric exceeds 100 failures in 5 minutes. What should the developer create?


Select one answer before revealing.

medium

Q3. A developer uses X-Ray to trace a Lambda → DynamoDB call. The DynamoDB calls do not appear as subsegments in the trace. What is required to capture downstream DynamoDB calls?


Select one answer before revealing.

medium

Q4. A Lambda function logs thousands of lines per invocation. The developer wants to extract the count of "ERROR" occurrences per minute and display it on a CloudWatch dashboard. What is the most efficient approach?


Select one answer before revealing.

medium

Q5. A developer deploys a new Lambda version and immediately receives CloudWatch alarms on error rate. They want to instantly revert to the previous version with zero downtime. Which Lambda feature should the developer use?


Select one answer before revealing.