Graphite Dropping Metrics: MetricFire can Help!

Sometimes a seemingly well-configured and fully-functional monitoring system can malfunction and lose metrics. Subsequently, you get a distorted picture of what is happening with the monitoring object. In this article, we will look at the possible causes of Graphite dropping metrics and how to avoid it.

Introduction

MetricFire specializes in monitoring systems. You can use our product with minimal configuration to gain in-depth insight into your environments. If you would like to learn more about it, please book a demo with us, or sign up for the free trial today.

Dropping metrics: problems and solutions

In order to effectively manage your monitoring system, you need to understand its principle of operation, what parts it consists of, and how they interact with each other. This knowledge often helps to avoid most problems. In our article Monitoring with Graphite: Architecture and Concepts, we talked about how Graphite works and highlighted important concepts.

Correct installation and configuration of the monitoring system will also help you avoid some issues. Read our article Monitoring with Graphite: Installation and Setup in order not to miss out on anything important.

Use a Graphite architecture that suits your needs

To provide efficient ingesting capabilities, Graphite uses in-memory caching of metrics before storage. But, in an environment with a large number of clients and/or an environment that requires a high frequency of data ingestion, a single server instance may not be enough to support the load.

To solve this problem, Graphite has a special instance acting as a load-balancing front end for a set of other Graphite instances. This instance enables a special handler called carbon-relay, something that’s responsible for collecting data from clients and dispatching them to other instances without any additional processing.

Hence, it acts as a load balancer that redirects incoming data to other instances according to the rules predefined in its configuration. Since the carbon relay handler does not perform complex processing compared to the carbon cache, this architecture will allow you to easily scale your Graphite environment to support a huge number of metrics.

Set up the same metric intervals for Carbon and your metric collector

Let's imagine that you collect your metrics using StatsD. It has a parameter called flush interval. The default flush interval in StatsD is 10 seconds. This means that every ten seconds, StatsD sends its latest data to Graphite.

For a given metric name, the last value obtained during a given time interval will overwrite any previous values obtained during that time interval.

What happens if StatsD sends stats faster than Graphite's highest resolution?

Graphite will start dropping metrics! For example, you are tracking the number of visitors on your website. StatsD’s flush interval is set to 10 seconds and Graphite storing interval is set up in 1 minute. In this case, StatsD sends the number of visitors every 10 seconds - 15, 10, 17, 20, 37, 25.

Graphite will receive the value 15, then overwrite it to 10, then overwrite it to 17, and so on. Finally, it will store value 25 for a 1-minute interval, dropping all previous metrics.

To avoid this problem, all you have to do is ensure that your StatsD flush interval is at least as big as the metric's resolution in Graphite.

Use whitelist and blacklist

One of the common reasons for Graphite dropping metrics is the overload of the Graphite essence. This can happen if too many metrics are sent to Graphite. But in some cases, you don’t need to monitor all these metrics.

When metric senders send useless or invalid metrics, Graphite provides whitelist and blacklist. This functionality allows any of the carbon daemons to only accept metrics that are explicitly whitelisted and/or to reject blacklisted metrics.

This is enabled in carbon.conf with the USE_WHITELIST flag. GRAPHITE_CONF_DIR is searched for whitelist.conf and blacklist.conf. Each file contains one regular expression per line to match against metric values. If the whitelist configuration is missing or empty, all metrics will be passed through by default.

Use non-original implementations of some Graphite parts

Dropping metrics can also be related to high CPU usage.

The original Graphite is all written in Python. Python doesn’t perform multiprocessing (in a single process at least). This means that one process can only use one CPU core at a time. With a large server, you will need to add a load balancer that balances across multiple relays.

Some parts of Graphite are rewritten in programming languages which don’t have the multiprocessing limitation which Python has. Examples include carbon-c-relay, carbon-relay-ng, and go-carbon.

Configure your Graphite parts correctly

Sometimes, even when using powerful equipment in your monitoring system, it may not protect you from Graphite dropping metrics. This can happen due to the incorrect configuration of Graphite and its parts.

There are parameters in the Graphite configuration files that can impose some restrictions, which can help prevent an overload of your equipment. Sometimes you need to experiment with some parameters to find the correct configuration that will suit your needs.

Some of the important parameters in carbon.conf are MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE, MAX_QUEUE_SIZE.

Each time Carbon tries to write data to disk, it doesn't actually immediately go to disk; the kernel puts it in a buffer and it gets written to disk later. The kernel does this for efficiency and it results in very low write latency as long as there is free memory for the kernel to allocate buffers in. The reason this is so important to Carbon is that Carbon depends on I/O latency being very low. If the I/O latency increases significantly, the rate at which Carbon writes data drops dramatically. This causes the cache to grow and Carbon starts dropping metrics. Our task is to set up limitations that will ensure low I/O latency and will not lead to cache overflow.

Don't use MAX_CACHE_SIZE = inf because in case of a serious I/O latency problem it will lead to a crash. Yes, for some period of time we will lose the data, but the system will work.

Don’t use very high MAX_UPDATES_PER_SECOND. The idea of this limit is to actually slow down the rate of write calls to avoid causing I/O cache starvation. Essentially it lets you strike a balance between the use of Carbon's caching mechanism vs the kernel's buffering mechanism.

Each time a new metric is received by Carbon, it has to allocate a new whisper file. This creation process increases I/O latency. When you've got hundreds of new metrics, I/O latency increases rapidly.

Setting MAX_CREATES_PER_MINUTE value high (like "inf" for infinity) will cause Graphite to create the files quickly but at the risk of slowing I/O down considerably for a while. This leads to an increase in the size of the cache.

The MAX_QUEUE_SIZE setting is used by carbon-relay to determine how many data points to queue up when the carbon-cache is not receiving them quickly enough. This does not affect the performance of carbon-cache but might end up being the reason for losing data when carbon-cache is busy for a long time.

Use alerts to minimize data losses

Sometimes it is not possible to fully prevent the loss of metrics, or their loss is not related to the metrics monitoring system itself. For example, equipment failure, operating system freezing, and loss of communication.

Since the value of metrics generally lie in their historical analysis, a quick reaction to their absence is an important point. The sooner we can restore the supply of metrics, the less we will lose them.

In this situation, creating alerts for the loss of metrics will help. If you are using Graphite as backend and Grafana as frontend, please read our article Grafana alerting to find out how to create and configure Grafana alert rules.

About MetricFire

MetricFire provides a wide range of monitoring services through a platform built on the open-source Graphite, Prometheus, and Grafana. MetricFire can run on-premises or in the cloud and offers a whole suite of support options from monitoring to architecture, and design and analytics as well as overall monitoring.

MetricFire is a team of professionals that works hard to ensure effective monitoring of your systems. You won't have to build your own monitoring solution or hire your own team of engineers. MetricFire can develop an effective monitoring strategy that suits your business needs.

MetricFire will provide individual monitoring architecture for you and dropping metrics will be avoided. Also, MetricFire can help to set up great alerts to avoid loss of your data. By using MetricFire’s Hosted Graphite, it is not necessary to set up and supply it, MetricFire does everything for you, eliminating the possibility of a Graphite malfunction.

Final thoughts

In this article, we overviewed the possible reasons why Graphite can drop metrics. We also provided some solutions that will help you avoid these issues. Now you understand how the correct configuration and choice of Graphite architecture can affect the performance of your monitoring system.

At MetricFire, we provide a Hosted version of Graphite which includes storing your data for two years, a complete tool, Grafana, for data visualization, and much more. You can also use our product with minimal configuration to gain in-depth insight into your environments. If you would like to learn more about it, please book a demo with us, or sign up for the free trial today.

Get similar stories in your inbox weekly, for free

Share this story:

MetricFire

MetricFire provides a complete infrastructure and application monitoring platform from a suite of open source monitoring tools. Depending on your setup, choose Hosted Prometheus or Graphite and view your metrics on beautiful Grafana dashboards in real-time.