Cool Things You Can Do with Metrics on AWS

Public cloud environments are heavily instrumented and can give you metrics on practically any level of the infrastructure. AWS is no exception. Metrics are not only useful for monitoring and troubleshooting issues in a cloud environment - they can also be tied directly to automated actions. So you can leverage them to remediate issues instantly, as they happen.

Introduction

In this article, we shall describe CloudWatch, which is the central monitoring system used by all AWS services. We shall also go into detail the useful metrics that you can leverage on Amazon Elastic Compute Cloud (EC2) and EC2 Spot Fleet. The Amazon Elastic Compute Cloud is a system, used to manage Amazon spot instances, Amazon Elastic Block Storage (EBS), and the popular serverless runtime, AWS Lambda.

What Is CloudWatch Monitoring?

Amazon CloudWatch is a native monitoring tool designed to provide real-time monitoring of Amazon Web Services (AWS) resources and applications. The purpose of this tool is to provide visibility into application performance, operational health, and resource utilization.

Dashboards and metrics

CloudWatch lets you collect and track a wide range of metrics, each providing different information about your components. By default, the CloudWatch homepage displays metrics about all AWS services connected to your account.

It is also possible to create custom dashboards, in which you can view information about specific applications and resources, using custom metrics as needed.

Alarms and notifications

CloudWatch lets you create alarms that track specific metrics. Once you set this up, CloudWatch sends notifications when the event is triggered. You can also define certain thresholds per resource. When the threshold is breached, CloudWatch automatically performs changes to optimize utilization.

CloudWatch Metrics for EC2

Amazon Elastic Compute Cloud (EC2) provides scalable, cloud-based compute resources. EC2 lets you provision on-demand virtual servers, called EC2 instances. There are several types of EC2 instances, each providing different levels of storage, network capacity, memory, and central processing units (CPUs).

To ensure the health of your EC2 instances, you can monitor several components, including NetworkIn, NetworkOut, NetworkPacketsIn, NetworkPacketsOut, CPUCreditBalance, and CPUUtilization.

Recommended reading:How to reduce your EC2 costs.

NetworkIn, NetworkOut, NetworkPacketsIn, & NetworkPacketsOut

Here are several metrics that can help you track instance health and application behaviors:

NetworkInandNetworkOut—these two metrics measure network traffic in bytes.
NetworkPacketsInandNetworkPacketsOut—here, network traffic is measured as a number of packets.

You can use the above metrics to watch for sharp increases or decreases in the network traffic of your instance and detect anomalous application behavior.

Here are several aspects to consider when creating CloudWatch alarms for the metrics above:

Create one alarm per metric.
By default, you can track two data points of anomalous values.
If you enable enhanced metrics, you can search for up to five data points.

CPUCreditBalance

AWS lets you use CPU Credit Balance to temporarily burst EC2 instances above their baseline CPU threshold. This option is not available for all instance types, but for supported instances, CPU Credit Balance can be highly useful in meeting spikes in demands.

To do this, you need to track your CPU Credit Balance per instance, and this is what the CPUCreditBalance metric measures. You can also use this metric to check CPU utilization and get alerts when an instance is using too many CPU resources.

You can create a CloudWatch alarm for CPUCreditBalance by defining a minimum of 25% CPU Credit Balance as the Average statistic. If the utilization runs below this average, the alarm is triggered.

CPUUtilization

To monitor CPU utilization per instance, you can use the CPUUtilization metric, which tracks the percentage of the total CPU resource used by your instance. Note that this metric can often fluctuate, and then anomalous CPUUtilization results may not necessarily indicate there are issues.

You can still use this metric for correlation, and leverage CPUCreditBalance for detecting critical issues. To create a CloudWatch alarm for CPUUtilization, apply the approximate anomaly detection technique explained above. You can configure notifications to be sent as a low or lower priority.

CloudWatch Metrics for EC2 Spot Fleet

Spot Instances are unused EC2 instances, offered at highly reduced prices, but which can be interrupted at short notice. Amazon provides Spot Fleet, a service that can launch multiple spot instances, on-demand instances, and reserved instances automatically, helping you balance capacity across different instance types.

Amazon EC2 provides CloudWatch metrics you can use to monitor your Spot Fleet. Spot Fleet metrics provide valuable insights into how the Spot Fleet bidding process works:

AvailableInstancePoolsCount—the number of instance pools included in the Spot Fleet request. An instance pool is defined by an availability zone (AZ) and instance type.
EligibleInstancePoolsCount—the number of instance pools available and matching spot instance requests. Instance pools are considered available when the spot price is lower than the on-demand price, and the user’s bidding price is higher than the spot price.
BidsSubmittedForCapacity—the number of bids submitted for spot instances.
FulfilledCapacity—total number of instances provisioned by the Spot Fleet.
PercentCapacityAllocation—the percentage of capacity allocated to a specific dimension—can be combined with instance type, for example, to see what percentage of instances are spot instances vs. on-demand.
PendingCapacity—the difference between the target capacity of the Spot Fleet and the current provisioned capacity.
TargetCapacity—the target capacity requested for the current Spot Fleet.
TerminatingCapacity—the number of instances in the Spot Fleet that have been notified of spot instance termination.

These metrics allow you to see the overall health and performance of a Spot Fleet. You can see how the Spot Fleet is automatically bidding for resources, and how many instances of each type are fulfilled as a result of your request criteria.

EBS Metrics in CloudWatch

Amazon Elastic Block Store (EBS) provides durable block storage for EC2 instances. There are two main types of EBS volumes: solid state drives (SSDs) and hard disk drives (HDDs). EBS volumes are an important tier in many AWS deployments because they provide a reliable long-term storage solution for EC2 instances.

You can use the following EBS metrics to gain visibility over the storage aspects of your EC2 deployments.

VolumeTotalReadTime & VolumeTotalWriteTime

VolumeTotalReadTime and VolumeTotalWriteTime measure the total amount of time spent in read or write operations on the volume. By themselves, they are not particularly valuable but can be used to calculate the disk latency for your volume.

Latency is calculated with the following formula:

(VolumeTotalReadTime + VolumeTotalWriteTime) / (VolumeReadOps + VolumeWriteOps)

To monitor latency, you should create a CloudWatch alarm by using the approximate anomaly detection method on your calculated latency metric, using the Sum statistic for all metrics in the formula.

VolumeQueueLength

Measures the number of pending disk operations. If this metric has low, non-zero values, this is normal, because otherwise, the disk is completely idle. When there are spikes in this metric, this can indicate slower disk access, which can impact application performance.

Create a CloudWatch warning for this metric with a threshold reflecting abnormally high queue length, as this indicates an overloaded disk.

VolumeThroughputPercentage

Measures what proportion of provisioned IOPS your instance is receiving. It may not be equal to the number of IOPS the volume is currently using (this is the idea behind “provisioned” IOPS). The metric can help you identify if Amazon is meeting its commitment to allocate a certain performance level to your EBS volumes.

AWS guarantees provisioned throughput to be within 10% of expected values 99.9% of the time. If it occasionally goes outside the expected range, this can affect application performance. It is recommended to create a low-priority CloudWatch alarm on the Average statistic, to be notified when this metric goes under 90% of its expected value.

Recommended reading:Getting started with AWS Cloudwatch.

AWS Lambda Metrics

AWS Lambda is an event-driven computing service that allows you to run code without provisioning and managing server resources. You can configure Lambda functions to run in response to an event, or an API call through AWS API Gateway.

The following metrics can help you monitor Lambda performance and health, and prevent production issues that commonly affect serverless applications.

Duration

Measures the time in milliseconds it takes a Lambda function to execute. This measures the overall operation (from call to completion) and can be used to measure performance, similar to the latency metric for traditional applications.

Errors

Tracks the number of Lambda runs with errors. If you find an increase in the error rate, check the Lambda logs to investigate. The logs can indicate the exact cause of the problem (out of memory exception, timeout, permission error, etc).

Throttles

Measures the number of call attempts suppressed due to exceeding the concurrency limit. AWS Lambda has a configurable concurrency limit for each Lambda function, and if your application attempts to invoke too many instances of the same function, they are throttled.

Tracking this metric will help you adjust your concurrency limits to best suit your Lambda workloads. For example, a high number of throttles may indicate that the concurrency limit needs to be increased to reduce the number of failed innovations.

ProvisionedConcurrencyUtilization

You define a certain level of provisioned concurrency per AWS Lambda function, and there is a charge for provisioned concurrency. This metric monitor's the utilization of provisioned concurrency by your functions. It shows:

Overprovisioning—functions that never reach their concurrency limit, meaning provisioned concurrency should be reduced to conserve costs
Underprovisioning—functions that utilize concurrency to the max, meaning concurrency may need to be increased

Recommended reading:How to monitor AWS Lambda.

Conclusion

In this article, we have reviewed numerous metrics provided by CloudWatch for popular Amazon services: Amazon EC2, EC2 Spot Fleet, Amazon EBS, and AWS Lambda. We have also shown you how you can use these metrics to identify issues in production systems and act to remediate them - whether manually or automatically.

We also hope that this will be helpful as you build visibility and automation into your cloud deployments.

In conclusion, you must use MetricFire’s cloud-hosted solution to help you achieve visualizing your data without any setup hassles. Go ahead and avail yourfree trialto get started, or contact us for a quick and easydemo!

Get similar stories in your inbox weekly, for free

Share this story:

MetricFire

MetricFire provides a complete infrastructure and application monitoring platform from a suite of open source monitoring tools. Depending on your setup, choose Hosted Prometheus or Graphite and view your metrics on beautiful Grafana dashboards in real-time.