Solving per-store TLS limitation in Thanos Query

Preface

As currently Thanos Query does not support configuring store endpoints with dedicated TLS configurations you will face problems if you’d want to alternate between TLS and plaintext in/cross-cluster.

The problem

Let’s take a simple example where you’d might like to have this functionality.

You are building a centralized monitoring platform, where you have an Observer cluster and you have multiple Observee clusters. If you would like to have meta-monitoring (that means in-cluster connections) integrated with this solution, you might want to use plaintext communication or self-signed certificates in-cluster rather than the ones you’d use cross-cluster.

Currently, when you try to mix endpoints of TLS and plaintext stores, and you have grpc-client-tls-secure set to false (either via passing the argument or setting it with helm config) you cannot access the stores behind TLS.

level=warn ts=2021-07-01T10:21:29.690134519Z caller=storeset.go:487 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from remote.thanos.cluster.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=remote.thanos.cluster.com:443

If you set it to true, you will be able to connect to the secure endpoint, however, you won’t be able to access the local ones without the same certificates in place, because specifying them via dnssrv+_grpc._tcp.

This is happening, because currently the gRPC client config cannot be configured per-store, as all of the clients will use the same configurations.

	dialOpts, err := extgrpc.StoreClientGRPCOpts(logger, reg, tracer, secure, skipVerify, cert, key, caCert, serverName)
	if err != nil {
		return errors.Wrap(err, "building gRPC client")
	}

The solution

To work this limitation around, you can either drop TLS and use plaintext for all integrations (not the best idea), move TLS termination to additional/edge proxies or stack Queries in-cluster.

Note: bear in mind that my example below is referencing configuration available at the current version of the bitnami/thanos chart, so if you’re using another deployment method you might need to make adjustments.

When implementing the third option, the basic idea is to have a dedicated Query deployment that will only contain the remote endpoint(s) of Observee Queries (or sidecars, if those are exposed), so TLS can be enabled freely.

# remote-tls-thanos-query
query:
  stores:
    - remote.thanos.cluster.com:443

  grpcTLS:
    client:
      secure: true
      servername: remote.thanos.cluster.com
      autoGenerated: false

Additionally, you can have a centralized (Query) instance, which will include all the local components and the sidecars running next to the meta-monitoring Prometheus instances.

# thanos-query
query:
  dnsDiscovery:
    enabled: true
    sidecarsService: kube-prometheus-stack-thanos-discovery
    sidecarsNamespace: monitoring

  stores:
    - remote-tls-thanos-query.monitoring.svc.cluster.local:10901

This is the final architecture.

thanos-per-tls

Conclusion

If you need this functionality as soon as possible this workaround can be useful, however, please note that in the last couple of days, a formal proposal was filed to address this issue, and there’s already a WIP PR implementing the functionality.