Sizing Thanos Receive (and Prometheus) storage

Preface

Let’s say you are building a global monitoring platform with Thanos, but you have egress-only Prometheus instances or ones that are managed by other teams. If switching to push-based model is a feasible use-case, Thanos Receive is the component you are looking for.

Behind the scenes, Receive implements the Prometheus remote_write API, and it exposes StoreAPI (used across Thanos components), making time series data queryable for Thanos Query instances.

Local TSDB under the hood

As Receive uses the same TSDB under the hood as Prometheus, we can estimate its resource usage in a similar fashion.

We have the following flags to configure Receive’s TSDB.

usage: thanos receive [<flags>]
Accept Prometheus remote write API requests and write to local tsdb.
Flags:
...
      --tsdb.retention=15d       How long to retain raw samples on local
                                 storage. 0d - disables this retention.
      --tsdb.wal-compression     Compress the tsdb WAL.

By default, Receive is shipping blocks in every 2h to the object store by default, plus during shutdowns and rollout restarts, because it needs to flush the WAL to TSDB to avoid data loss.

This can be configured by these parameters. However, as Prometheus is well optimized, you really shouldn’t change these under normal circumstances.

func (rc *receiveConfig) registerFlag(cmd extkingpin.FlagClause) {
...
	rc.tsdbMinBlockDuration = extkingpin.ModelDuration(cmd.Flag("tsdb.min-block-duration", "Min duration for local TSDB blocks").Default("2h").Hidden())
	rc.tsdbMaxBlockDuration = extkingpin.ModelDuration(cmd.Flag("tsdb.max-block-duration", "Max duration for local TSDB blocks").Default("2h").Hidden())
...

The most recent 2h of data can be queried in-memory from Receive, and after the blocks are shipped, these can be accessed through StoreGateway from the object store of your choice.

Bytes on disk

When compaction happens every 2h, Prometheus writes in-memory data to disk.

Based on the official Prometheus docs

Prometheus stores an average of only 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula:

needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

we can use this estimation to size the storage properly.

After some research, I’ve made small adjustments and came up with this formula to estimate the storage needed for a Prometheus instance.

(time() - prometheus_tsdb_lowest_timestamp_seconds) // retention 
 * rate(prometheus_tsdb_head_samples_appended_total[1h]) // ingested samples per second
 * rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]) // bytes per sample
 * 1.2 // 10% + 10% because of retention + compaction
 * 1.05 // 5% for WAL

Note: can estimate the bytes needed without having reached the desired retention by substituting the size of the window in seconds, e.g. 1296000 for 15d.

Let’s take a closer look at my formula.

To get retention_time_seconds, we can use the age of the oldest time series data with (time() - prometheus_tsdb_lowest_timestamp_seconds), or we can use the desired retention in seconds.

Next one is the number of samples ingested per second. Fortunately, we can get this with this query rate(prometheus_tsdb_head_samples_appended_total[1h]), as mentioned in this article by Robust Perception.

The final piece of the original formula is bytes_per_sample.

We can use this snippet to calculate this value: rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]).

In my example, that’s ~1.87.

At this point, we have close estimations on the components of the expression mentioned in the Storage section of the docs.

We can go a bit further, and add the adjustments from Robust Perception’s post, for example, 2 * 10% to take compaction and retention into consideration, then and an additional 5% to estimate the size of the WAL, based on this other article by RP. WAL is about a 10% increase in a similar configuration, but since Receive is also able to compress the write-ahead-log when --tsdb.wal-compression flag is set, we can roughly take half of this, as the article also says.

Conclusion

The method above enables proper sizing for a single Receive instance (or a vanilla Prometheus), and you can repeat this strategy for all your remote write tenants to get the final size of persistent storage for a given retention window.

In (partially related) other news, the proposal to have a “split/dual-mode” for Receive is accepted, implemented and is available in the latest release.