Prometheus Failure Modes

October 16, 2022

Preface

Having a clear understanding of your monitoring pipeline’s failure mode is as important as understanding how your business applications can fail.

Yes, you will only annoy your users/lose money directly when your actual business is down, however flying blind without observability will make it much harder (or even impossible) to fix these outages, losing more money and annoying your users more as a result.

Nowadays, most of the infrastructures have Prometheus running on them, dutifully scraping all of their targets. Depending on the infrastructure, Prometheus can be running as an Operator, deployed via the popular kube-prometheus-stack, or it can be also deployed by one of the more traditional/on-prem-like methods.

To add another option, since v2.32.0, Prometheus can be deployed as an Agent.

What does this mean?

This means that now we can run a stripped-down version of Prometheus, focusing on the remote_write capabilities of its original version. You cannot run queries against it, and you won’t have alerting capabilities, neither. What you will have is the same scraping logic and service discovery as before, but now with a new TSDB implementation that is focusing on removing the successfully delivered data, making it more lightweight.

It’s easy to focus only on the facts that Agent mode can consume less resources and its architecture is more suitable for remote_writing metrics to a remote location, but does that mean that you should turn all of your Prometheis into Agents?

Well, it depends. It’s all about trade-offs.

Understanding the limitations of each approach and their trade-offs is one of the most important things in software engineering.

Let’s take a look at various Prometheus failure modes, and see what will be different when running Prometheus with the new Agent mode enabled compared to the traditional server model!

Defining our environment and our goals

Let’s imagine an environment where it makes sense to forward metrics across various clusters. Having a centralized observability cluster that collects all the metrics from workload clusters is a good use case.

Our goal is simple: We want to know if our apps are working properly. That means consuming the metrics ingested by Prometheus instances. We can use the query interface of Prometheus, Grafana, or a Thanos Query.

We define failure as failing our goal. Basically being unable to access the metrics of a remote Prometheus from the centralized location.

For the regular use case, let’s have a plain old Prometheus that has a Thanos sidecar running next to it. For the Agent scenario, we will have Prometheus running in Agent mode, without sidecars.

In the former case, the sidecar is responsible for making Prometheus instances queryable by implementing Thanos’ Store API on top of Prometheus’ remote-read API. This makes running queries against the sidecar injected instances possible via Thanos Query. Additionally, the sidecar can optionally push these metrics into a remote object storage, e.g. an S3 bucket, Minior or other other compatible storage backend. This push happens in every 2h. The Prometheus itself is a fully functional Prometheus, meaning you can also use its query and alerting capabilities.

In the case of using the Agent mode, things are a bit different. Here, you are pushing your metrics to a remote storage, e.g. Thanos Receive, or other compatible backends via remote_write. It’s quite similar to running a regular Prometheus and using remote_write. Some of the differences include not having the option to query this particular instance by itself and not having the ability to have recording rules neither.

How Prometheus can fail?

Let’s look into some common failure modes that can happen anytime!

Running without persistent storage, the pod crashes

Prometheus w/ Thanos sidecar

Here, you are running a single Prometheus instance that is “Thanos-injected”, and you don’t have a persistent storage attached to this Prometheus pod. If your pod gets killed due to reaching its memory limit, for example, you can lose up to 2h (plus the time needed to spin up a new instance and start ingesting again) of metrics in your centralized system.

Note: customizing this 2h interval is not supported at the moment. A workaround might be creating snapshots via preStop hooks as mentioned here: https://github.com/prometheus-junkyard/tsdb/issues/346.

If you want to be realistic, you should be prepared to be losing up to 2h (+ boot up time) of metrics.

Prometheus in Agent mode

In this case, when Prometheus will crash, you will only potentially lose the metrics after the crash. Agent mode will forward all the metrics up until the moment of the crash, so you should have approximately all the metrics before that point safely stored at your remote location.

Conclusion: Agent mode can be better, but you should not run Prometheus without persistent storage in the first place if you can avoid it. I only inlcuded this scenario because it’s a use-case where Agent can be a better solution.

Running with persistent storage, the pod crashes

Prometheus w/ Thanos sidecar

Your Prometheus is crashing, but you have a persistent storage attached to it. Let’s say the last push to the object store might have happened the latest possible (2h, by default), but since you have a volume attached to the instance, it’s possible to replay the WAL, and restore the data. Once the WAL is replayed, it’s possible that the sidecar will be able to push all the data to the object store.

Prometheus in Agent mode

In this case, Prometheus is crashing, but the data forwarded up to the point of the crash is safely stored in the remote destination. You have a volume attached, so once the WAL is replayed, Prometheus can push the metrics to the remote endpoint. The WAL is also easier to recover, as once something is successfully delivered it can be removed from the TSDB.

Conclusion: As you can see, you can minimize the data loss to the timeframe when the (Prometheus) Server/Agent is not ingesting the data due to booting up by adding persistent storage. If Prometheus will be killed frequently before it can run again, Agent mode will be more robust, though.

Network outage, lost connection to the remote destination/object store

Prometheus w/ Thanos sidecar

When running a regular Prometheus and a sidecar, the object store becomes unavailable. In this case, the data is stored locally (with persistent volume attached), while the sidecar will try to push it to the store.

Prometheus in Agent mode

In the case of leveraging remote_write, things are a bit different.

Prometheus with/without agent mode can only buffer up to 2h data (or the value set in storage.tsdb.max-block-duration=2h and storage.tsdb.max-block-duration=2h).

Hopefully this will be unblocked soon. Additionally, out-of-order sample support is finally here, enabling historical issues to be solved finally.

Conclusion: this is a clear win for the sidecar use-case as you are able to store more data until the store becomes available again, plus you can also connect to the Server Prometheus instance (or a local Thanos Query instance), and query its store locally. However, with remote_write you cannot access the metrics of the Agent instance in this case, and the local buffer is also limited to 2h.

Summary

As I mentioned, it really is about trade-offs.

You have to take all of these scenarios into consideration when you are architecting your monitoring stack, and attach probabilities to each of these risks.

What is more likely? Having your Prometheus in a crashloopback state for hours, or having an AZ or region down for more than 2 hours? Let’s say your object store is not available, but you still have access to your Prometheus instance. Is this more valuable than having more fresh data available at the remote location?

You also have to have automation or at least runbooks for all of these scenarios, and you should know the limitations of the chosen solution.

Thanks, Wiard van Rij and Bartłomiej Płotka for the review!