CloudWatch, X-Ray & Observability
A comprehensive deep dive into AWS observability — CloudWatch Metrics, Alarms, Logs, Logs Insights, Dashboards, EMF, AWS X-Ray distributed tracing, CloudTrail, and the three pillars of observability for the DVA-C02 exam.
Observability Mental Model
The three pillars of observability tell you what your system is doing, why it broke, and where the bottleneck is.
Part 1 — Amazon CloudWatch Metrics
How Metrics Work
CloudWatch stores metrics as time-series data. Each metric is identified by a namespace, metric name, and zero or more dimensions.
| Concept | Detail |
|---|---|
| Namespace | Container for metrics — e.g., AWS/EC2, AWS/Lambda, MyApp/Orders |
| Metric name | e.g., CPUUtilization, Duration, ErrorCount |
| Dimension | Key-value filter — e.g., FunctionName=my-fn, InstanceId=i-123 |
| Resolution | Standard: 60s minimum; High-resolution: 1s (StorageResolution=1) |
| Retention | 3h for 1s, 15 days for 1m, 63 days for 5m, 15 months for 1h |
| Statistics | Average, Sum, Min, Max, SampleCount, pNN.NN (percentile) |
Publishing Custom Metrics
1import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({ region: 'us-east-1' });
4
5// Standard resolution (60s granularity) — free
6await cw.send(new PutMetricDataCommand({
7 Namespace: 'MyApp/Orders',
8 MetricData: [
9 {
10 MetricName: 'OrdersProcessed',
11 Value: 42,
12 Unit: 'Count',
13 Dimensions: [
14 { Name: 'Environment', Value: 'prod' },
15 { Name: 'Region', Value: 'us-east-1' },
16 ],
17 Timestamp: new Date(),
18 },
19 {
20 MetricName: 'OrderProcessingLatency',
21 Value: 234,
22 Unit: 'Milliseconds',
23 Dimensions: [{ Name: 'Environment', Value: 'prod' }],
24 },
25 ],
26}));
27
28// High-resolution metric (1s granularity) — $0.30/metric/month
29await cw.send(new PutMetricDataCommand({
30 Namespace: 'MyApp/Payments',
31 MetricData: [{
32 MetricName: 'PaymentErrors',
33 Value: 1,
34 Unit: 'Count',
35 StorageResolution: 1, // 1 = high-resolution; 60 = standard
36 }],
37}));Embedded Metric Format (EMF)
EMF lets you emit custom metrics directly from Lambda log lines — zero extra API calls, no cost per metric.
1// Lambda — emit metrics via structured log line (EMF format)
2export async function handler(event) {
3 const start = Date.now();
4
5 // ... do work ...
6
7 const duration = Date.now() - start;
8
9 // EMF structured log — CloudWatch automatically extracts the metrics
10 console.log(JSON.stringify({
11 '_aws': {
12 Timestamp: Date.now(),
13 CloudWatchMetrics: [{
14 Namespace: 'MyApp/Lambda',
15 Dimensions: [['FunctionName'], ['Environment']],
16 Metrics: [
17 { Name: 'ProcessingTime', Unit: 'Milliseconds' },
18 { Name: 'ItemsProcessed', Unit: 'Count' },
19 ],
20 }],
21 },
22 FunctionName: process.env.AWS_LAMBDA_FUNCTION_NAME,
23 Environment: process.env.NODE_ENV,
24 ProcessingTime: duration,
25 ItemsProcessed: event.Records?.length ?? 1,
26 }));
27}Metric Math
Combine metrics with mathematical expressions:
1# Error rate = errors / invocations * 100
2# In CloudWatch console or via API:
3# EXPRESSION: (m1/m2)*100
4# m1 = Errors metric
5# m2 = Invocations metricUseful functions: SUM(), AVG(), MAX(), MIN(), RATE(), DIFF(), FILL(), SEARCH(), IF()
Part 2 — CloudWatch Alarms
Alarm States
| State | Meaning |
|---|---|
OK | Metric is within threshold |
ALARM | Metric has breached threshold |
INSUFFICIENT_DATA | Not enough data to evaluate (startup, gaps) |
Creating an Alarm
1import { CloudWatchClient, PutMetricAlarmCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutMetricAlarmCommand({
6 AlarmName: 'HighLambdaErrorRate',
7 AlarmDescription: 'Lambda error rate > 1% for 5 consecutive minutes',
8 Namespace: 'AWS/Lambda',
9 MetricName: 'Errors',
10 Dimensions: [{ Name: 'FunctionName', Value: 'my-function' }],
11 Statistic: 'Sum',
12 Period: 60, // evaluation period in seconds
13 EvaluationPeriods: 5, // number of periods to evaluate
14 DatapointsToAlarm: 3, // only alarm if 3 out of 5 periods breach (M of N)
15 Threshold: 10,
16 ComparisonOperator: 'GreaterThanThreshold',
17 TreatMissingData: 'notBreaching', // breaching | notBreaching | ignore | missing
18 AlarmActions: [
19 'arn:aws:sns:us-east-1:123456789012:ops-alerts',
20 'arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:...',
21 ],
22 OKActions: ['arn:aws:sns:us-east-1:123456789012:ops-alerts'],
23}));Alarm Actions
| Action type | Use case |
|---|---|
| SNS notification | Email, Slack webhook, PagerDuty, on-call |
| Auto Scaling policy | Scale in/out EC2 fleet |
| EC2 action | Reboot, stop, terminate, recover instance |
| Systems Manager OpsItem | Create incident ticket |
| Lambda | Custom remediation automation |
Composite Alarms
Composite alarms combine multiple alarms with AND/OR logic — reduces alert fatigue:
1aws cloudwatch put-composite-alarm --alarm-name "ProductionOutage" --alarm-rule "ALARM(HighErrorRate) AND ALARM(HighLatency)" --alarm-actions arn:aws:sns:us-east-1:123456789012:pagerdutyPart 3 — CloudWatch Logs
Log Hierarchy
| Level | Scope | Retention set at |
|---|---|---|
| Log Group | Named container (e.g., /aws/lambda/my-fn) | Log Group |
| Log Stream | Sequence of events from one source instance | Inherited from group |
| Log Event | Single timestamped log line | N/A |
Retention Policy
1# Set 30-day retention on a log group (default = never expire = expensive!)
2aws logs put-retention-policy --log-group-name /aws/lambda/my-function --retention-in-days 30
3
4# Valid values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365,
5# 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653Metric Filters
Extract metrics from log text patterns — no code changes needed:
1# Create a metric filter — count ERROR occurrences per minute
2aws logs put-metric-filter --log-group-name /aws/lambda/my-function --filter-name ErrorCount --filter-pattern "[timestamp, requestId, level=ERROR, ...]" --metric-transformations metricName=LambdaErrors,metricNamespace=MyApp/Lambda,metricValue=1,defaultValue=0
3
4# JSON log filter — match specific field value
5aws logs put-metric-filter --log-group-name /aws/lambda/my-function --filter-name PaymentFailures --filter-pattern '{ $.level = "ERROR" && $.event = "payment_failed" }' --metric-transformations metricName=PaymentFailures,metricNamespace=MyApp/Payments,metricValue=1Subscription Filters (Real-time Log Streaming)
1# Stream logs matching ERROR to a Lambda function
2aws logs put-subscription-filter --log-group-name /aws/lambda/my-function --filter-name ErrorsToLambda --filter-pattern "ERROR" --destination-arn arn:aws:lambda:us-east-1:123456789012:function:log-processorEach log group supports one subscription filter (unless using cross-account via Kinesis).
CloudWatch Logs Insights
Logs Insights is an interactive query engine for log data:
1# Find the 10 slowest Lambda invocations in the last hour
2fields @timestamp, @duration, @requestId
3| filter @type = "REPORT"
4| sort @duration desc
5| limit 10
6
7# Count errors by error type
8fields @message
9| filter @message like /ERROR/
10| parse @message "ERROR * -" as errorType
11| stats count(*) as errorCount by errorType
12| sort errorCount desc
13
14# P99 latency over time (5-minute buckets)
15fields @timestamp, @duration
16| filter @type = "REPORT"
17| stats pct(@duration, 99) as p99 by bin(5m)
18
19# Find cold starts
20filter @message like /Init Duration/
21| stats count() as coldStarts, avg(@initDuration) as avgInitMs by bin(1h)| Supported log types | Auto-parsed fields |
|---|---|
| Lambda | @timestamp, @message, @requestId, @duration, @billedDuration, @initDuration, @maxMemoryUsed |
| VPC Flow Logs | @srcAddr, @dstAddr, @srcPort, @dstPort, @protocol, @action |
| CloudTrail | @eventName, @userIdentity, @sourceIPAddress |
| Any JSON | Fields auto-extracted from JSON keys |
Part 4 — AWS X-Ray
How X-Ray Works
Core Concepts
| Concept | Definition |
|---|---|
| Trace | The complete end-to-end journey of a single request across all services |
| Segment | One service's contribution to a trace (e.g., Lambda, EC2, API Gateway) |
| Subsegment | Granular unit within a segment — DB call, HTTP call, custom block |
| Trace Header | X-Amzn-Trace-Id HTTP header — carries the trace ID between services |
| Annotations | Indexed key-value pairs (max 50) — searchable in X-Ray console/API |
| Metadata | Non-indexed key-value pairs — larger payloads, not searchable |
| Service Map | Visual graph of all services and their connections for a time window |
| Groups | Saved filter expressions for segmenting traces |
| Insights | Anomaly detection — automatically surfaces unusual fault/latency spikes |
X-Ray SDK Usage (Node.js)
1import AWSXRay from 'aws-xray-sdk-core';
2import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
3import https from 'https';
4
5// Auto-instrument all AWS SDK v3 clients
6AWSXRay.captureAWSv3Client(new DynamoDBClient({}));
7
8// Auto-instrument all outbound HTTPS calls
9AWSXRay.captureHTTPs(https);
10
11export async function handler(event) {
12 // Create a custom subsegment for a logical operation
13 const segment = AWSXRay.getSegment();
14 const subsegment = segment.addNewSubsegment('processOrder');
15
16 try {
17 // Add searchable annotations (indexed — use for filtering)
18 subsegment.addAnnotation('orderId', event.orderId);
19 subsegment.addAnnotation('userId', event.userId);
20 subsegment.addAnnotation('tier', 'premium');
21
22 // Add metadata (not indexed — use for debugging details)
23 subsegment.addMetadata('orderPayload', event);
24 subsegment.addMetadata('processingConfig', { retries: 3, timeout: 5000 });
25
26 const result = await processOrder(event);
27
28 subsegment.close();
29 return result;
30 } catch (err) {
31 subsegment.addError(err);
32 subsegment.close();
33 throw err;
34 }
35}Sampling Rules
X-Ray does not record every request — it samples to control cost and noise.
| Setting | Default | Meaning |
|---|---|---|
| Reservoir | 1 req/sec per rule | First N requests per second always recorded |
| Fixed rate | 5% | % of requests beyond reservoir that are sampled |
| Custom rules | Priority-ordered | Match by service name, URL path, HTTP method, etc. |
1# Create a custom sampling rule — always sample /health 0%, /orders 10%
2aws xray create-sampling-rule --cli-input-json '{
3 "SamplingRule": {
4 "RuleName": "OrdersHighSampling",
5 "Priority": 1,
6 "ReservoirSize": 5,
7 "FixedRate": 0.10,
8 "URLPath": "/orders*",
9 "ServiceName": "*",
10 "ServiceType": "*",
11 "Host": "*",
12 "HTTPMethod": "*",
13 "ResourceARN": "*",
14 "Version": 1
15 }
16}'Enabling X-Ray per Service
| Service | How to enable |
|---|---|
| Lambda | Set TracingConfig.Mode = Active (or PassThrough) in function config |
| API Gateway | Enable X-Ray tracing on the stage settings |
| EC2 | Install and run the X-Ray daemon (xray -b 127.0.0.1:2000) |
| ECS | Add X-Ray daemon as a sidecar container in the task definition |
| Elastic Beanstalk | Enable in .ebextensions/xray-daemon.config |
| App Mesh | Built-in — Envoy proxy emits X-Ray segments automatically |
1// ECS task definition — X-Ray daemon sidecar
2{
3 "containerDefinitions": [
4 {
5 "name": "app",
6 "image": "my-app:latest",
7 "environment": [
8 { "name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000" }
9 ]
10 },
11 {
12 "name": "xray-daemon",
13 "image": "amazon/aws-xray-daemon",
14 "portMappings": [{ "containerPort": 2000, "protocol": "udp" }],
15 "cpu": 32,
16 "memoryReservation": 256
17 }
18 ]
19}Trace Header
1X-Amzn-Trace-Id: Root=1-5e272ff5-1234abcd5678ef012345;Parent=53995c3f42cd8ad8;Sampled=1| Field | Meaning |
|---|---|
Root | Trace ID — unique per request, same across all services |
Parent | Segment ID of the upstream caller |
Sampled=1 | This request is being recorded |
Sampled=0 | This request is NOT being recorded |
Part 5 — AWS CloudTrail
CloudTrail records every API call made in your AWS account — who called what, when, from where, and with what result.
| Trail type | Scope | Cost |
|---|---|---|
| Management events | Control-plane actions (CreateBucket, DescribeInstances, etc.) | Free (first copy) |
| Data events | S3 object operations, Lambda invocations, DynamoDB item-level | Charged per event |
| Insights events | Unusual API activity (automated anomaly detection) | Charged |
CloudTrail Log Event Structure
1{
2 "eventVersion": "1.08",
3 "userIdentity": {
4 "type": "IAMUser",
5 "principalId": "AIDA1234567890EXAMPLE",
6 "arn": "arn:aws:iam::123456789012:user/alice",
7 "accountId": "123456789012",
8 "userName": "alice"
9 },
10 "eventTime": "2024-01-15T14:32:00Z",
11 "eventSource": "s3.amazonaws.com",
12 "eventName": "DeleteObject",
13 "awsRegion": "us-east-1",
14 "sourceIPAddress": "203.0.113.42",
15 "requestParameters": {
16 "bucketName": "my-critical-bucket",
17 "key": "production/config.json"
18 },
19 "responseElements": null,
20 "errorCode": "AccessDenied",
21 "errorMessage": "Access Denied"
22}CloudTrail + EventBridge for Real-Time Alerts
1# Alert when anyone calls DeleteBucket
2aws events put-rule --name "S3BucketDeletion" --event-pattern '{
3 "source": ["aws.s3"],
4 "detail-type": ["AWS API Call via CloudTrail"],
5 "detail": {
6 "eventSource": ["s3.amazonaws.com"],
7 "eventName": ["DeleteBucket"]
8 }
9 }'Part 6 — CloudWatch Dashboards & Container Insights
Dashboards
1import { CloudWatchClient, PutDashboardCommand } from '@aws-sdk/client-cloudwatch';
2
3const cw = new CloudWatchClient({});
4
5await cw.send(new PutDashboardCommand({
6 DashboardName: 'ProductionOverview',
7 DashboardBody: JSON.stringify({
8 widgets: [
9 {
10 type: 'metric',
11 properties: {
12 title: 'Lambda Error Rate',
13 metrics: [
14 ['AWS/Lambda', 'Errors', 'FunctionName', 'my-function', { stat: 'Sum', period: 60 }],
15 ],
16 period: 300,
17 stat: 'Sum',
18 view: 'timeSeries',
19 },
20 },
21 {
22 type: 'alarm',
23 properties: {
24 title: 'Active Alarms',
25 alarms: ['arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate'],
26 },
27 },
28 ],
29 }),
30}));Container Insights
Container Insights collects enhanced metrics and logs from ECS, EKS, and Kubernetes:
| Service | What it collects |
|---|---|
| ECS | CPU, memory, network per task/service/cluster |
| EKS | Pod/node CPU+memory, cluster health |
| Kubernetes (EC2) | Same as EKS via CloudWatch agent DaemonSet |
Enable on ECS cluster:
1aws ecs update-cluster-settings --cluster my-cluster --settings name=containerInsights,value=enabledCloudTrail vs CloudWatch vs X-Ray
| CloudTrail | CloudWatch | X-Ray | |
|---|---|---|---|
| Purpose | API audit trail | Metrics + logs + alarms | Distributed request tracing |
| Answers | Who called what API, when? | Is the system healthy? | Where is the latency/error? |
| Data type | API call records (JSON) | Numeric time-series + log text | Traces, segments, subsegments |
| Latency | ~15 min to S3 | Near real-time | Near real-time |
| Retention | 90 days (console), indefinite (S3) | Configurable (default: never expire) | 30 days |
| Searchable | Athena queries on S3 | Logs Insights, Metric Math | Annotations + filter expressions |
| Trigger alarms | Via EventBridge → CloudWatch | Native | Via CloudWatch on X-Ray metrics |
| Use in DVA-C02 | Security/compliance questions | Performance + operational questions | Latency + distributed debugging |
DVA-C02 Quick Reference
| Topic | Key Fact |
|---|---|
| Custom metric API | PutMetricData |
| High-resolution metric granularity | 1 second (StorageResolution=1) |
| EMF | Emit metrics from log lines — no extra API call |
| Alarm states | OK / ALARM / INSUFFICIENT_DATA |
| Composite alarm | Combines alarms with AND / OR logic |
| TreatMissingData options | breaching, notBreaching, ignore, missing |
| M of N alarms | DatapointsToAlarm — e.g., 3 of 5 periods must breach |
| Log Group retention default | Never expire (set a retention policy!) |
| Subscription filter limit | 1 per log group (use Kinesis for more) |
| Logs Insights P99 function | pct(@duration, 99) |
| X-Ray trace header name | X-Amzn-Trace-Id |
| X-Ray annotations vs metadata | Annotations: indexed, searchable (max 50); Metadata: not indexed |
| X-Ray default sampling | 1 req/sec reservoir + 5% of additional |
| X-Ray on Lambda | Enable TracingConfig.Mode = Active |
| X-Ray on ECS | Add X-Ray daemon as sidecar container |
| X-Ray daemon port | UDP 2000 on localhost |
| X-Ray trace retention | 30 days |
| CloudTrail management events cost | Free for first copy per region |
| CloudTrail data events | Charged — S3 object ops, Lambda invokes |
| CloudTrail log delivery to S3 | Up to 15 minutes delay |
| CloudTrail + real-time alerts | Route to EventBridge → SNS/Lambda |
| Container Insights | Enhanced metrics for ECS, EKS, Kubernetes |
Practice Questions5
Q1. A developer notices that a Lambda function's error rate is increasing but the default CloudWatch metrics do not show which specific code path is failing. Which AWS service provides request-level tracing to visualize the full call graph including downstream API calls?
Select one answer before revealing.
Q2. A developer creates a custom CloudWatch metric to track the number of failed payment transactions per minute. The application should trigger an auto-scaling action when the metric exceeds 100 failures in 5 minutes. What should the developer create?
Select one answer before revealing.
Q3. A developer uses X-Ray to trace a Lambda → DynamoDB call. The DynamoDB calls do not appear as subsegments in the trace. What is required to capture downstream DynamoDB calls?
Select one answer before revealing.
Q4. A Lambda function logs thousands of lines per invocation. The developer wants to extract the count of "ERROR" occurrences per minute and display it on a CloudWatch dashboard. What is the most efficient approach?
Select one answer before revealing.
Q5. A developer deploys a new Lambda version and immediately receives CloudWatch alarms on error rate. They want to instantly revert to the previous version with zero downtime. Which Lambda feature should the developer use?
Select one answer before revealing.