Prometheus, but bigger

Cluster capacity overview

Up until January 2021, I have been using an enterprise monitoring solution to monitor Kubernetes clusters, the same one used for APM.

It felt natural, the integration with Kubernetes was quite easy, only minor tweaks needed, and APM and Infrastructure metrics could be integrated, really nice and magical.

But despite the ease of collecting and storing data, creating alerts using metrics had huge query limitations, we would often end up with alerts that differed from what our dashboards were showing. Not to mention that with 6 clusters, the number of metrics being collected and stored was pretty big, adding a huge cost to our monthly expenses.

After some considerations, the downsides were bigger than the upsides. It was time to change our monitoring solution!

But, what to use? Grafana was the obvious choice for the visual part, but what could we use for the "backend" that had the resilience and availability we needed?

"Pure" OpenTSDB installations demanded too much work and attention, Standalone Prometheus did not offer replication and I would end up with multiple databases, TimeScaleDB looked nice, but I am no Postgres administrator.

After some experimentation with the above solutions, I looked to the CNCF website for more options and found Thanos! It ticked all the right boxes: Longtime retention, Replication, High Availability, microservice approach, and global view for all my clusters using the same database!

The Architecture

There is no persistent storage available on the clusters (a choice made to keep everything stateless), so the default Prometheus + Thanos sidecar approach was not an option, the metric storage must reside outside the clusters. Also, binding any Thanos component to a specific set of clusters is not possible, each cluster is isolated from the others, everything has to be monitored from "outside".

With the above said, high availability in mind, and Thanos possibility to run on virtual machines, the final architecture came to this:

Thanos architecture

As shown in the diagram, everything is running in a multi-datacenter architecture, where each site has its own set of Grafana + Query servers bundled together, its own set of store servers, and three receive servers (half of the cluster).

There is also an AWS RDS for our Grafana database. It doesn't need to be a huge database (low cost), and managing MySQL is out of our team's scope.

From all the components available with Thanos, 4 of them are implemented:

  • Receive: This component is responsible for the TSDB, it also manages replication between all the servers running receive and TSBD block uploads to S3.
  • Query: As the name implies, this component is responsible for querying the receive database.
  • Store: This component reads S3 for long-term metrics that are no longer stored in the receive.
  • Compactor: This component manages data downsampling and compaction for the TSDB blocks stored in S3

Data ingestion

All the cluster data ingestion is managed by a dedicated Prometheus POD running inside the clusters. It collects metrics from the Control Plate (API Server, Controller, and Scheduler), from the ETCD clusters, and from PODs inside the clusters that have metrics relevant to the infrastructure and Kubernetes itself (Kube-proxy, Kubelet, Node Exporter, State Metrics, Metrics Server, and other PODs that have the scraping annotation)

The Prometheus POD then sends the information to one of the receive servers managing the TSDB using the remote storage configuration.

Data ingestion

All data is sent to a single server and then replicated to the other servers. The DNS address that Prometheus uses is a DNS GSLB that probes each receive server and balances the DNS resolution between the healthy ones, sharing the load across all servers, since the DNS resolution only gives one IP per DNS query.

It is also important to mention that the data must be sent to a single receive instance and let it manage the replication, sending the same metric leads to replication failures and misbehaving.

At this level, metrics are also uploaded to the S3 Bucket for long time retention. Receive uploads the blocks every 2 hours (when each TSDB block is closed) and these metrics are available for querying using the Store component.

Here we can also set the retention for local data. In this case, all local data is kept for 30 days for daily use and troubleshooting, this allows faster querying.

Data older than 30 days is available exclusively on S3 for up to 1 year for long period evaluations and comparisons.

Data querying

With data being collected and stored in the receivers, everything is ready to be queried. This part is also set for multi-datacenter availability.

Each server is running both Grafana and Query, this makes it easier for us to identify and remove a server from the load balancer if one of them starts to malfunction (or both). In Grafana, the data source is configured as localhost, so it always uses the local Query for data.

For the query configuration, it has to know all the servers that have metrics stored (Receiver and Store). The query knows which server is online and is able to gather metrics from them.

Data querying

It also manages data deduplication, since it is querying all servers and with replication configured, all metrics have multiple copies. This is can be done using labels assigned to the metrics and a parameter for the query (--query.replica-label=QUERY.REPLICA-LABEL). With these configured, the query component knows if the metrics gathered from the receivers and stores are duplicated and uses only one data point.

Long term data

As mentioned, data is kept locally for a maximum period of 30 days, everything else is stored on S3. This allows for a reduced amount of space needed on the receivers and reduced costs, as block storage is more expensive than object storage. Not to mention that querying data older than 30 days is not very usual and is mostly used for resource usage history and projections.

The data stored on the S3 bucket can be queried using the store component. It is statically configured to only serve data older than 30 days (using the current time as reference) since the query component sends the queries to all data components (store and receiver).

Remote data querying

The store also keeps a local copy of the indexes of each TSDB block stored on the S3 bucket, so if a query needs data older than 30 days, it knows which blocks to download and use for serving the data.

Let's talk numbers

Considering all clusters there are:

  • 6 Kubernetes clusters being monitored;
  • 670 services have their metrics collected;
  • 246 servers monitored using node exporter;
  • ~270k metrics being collected each minute;
  • ~7.3 GB of data being ingested each day, or ~226.3 GB of data per month;
  • 40 dashboards created just for Kubernetes components;
  • 116 alerts created on Grafana;

For the monthly expenses, with most of the components running on-premises, there was a 90.61% cost reduction, going from US$ 38,421.25 monthly to US$ 3,608.99, including the AWS services cost.

Considerations

Configuring and setting up everything mentioned above took around a month or so, including testing some other solutions, validating the architecture, implementation, enabling all data collection from our clusters, and creating all dashboards.

In the first week, the benefits were obvious. Monitoring the clusters became much easier, dashboards could be built and customized in a fast manner, and collecting metrics is almost plug & play, with most applications exporting metrics in the Prometheus format and collection being automatic based on annotations.

Also, using Grafana's LDAP integration allowed access to different teams in a granular way. Developers and SREs have access to a huge set of dashboards, with relevant metrics about their namespaces, ingresses, and more.

And the best part, at a fraction of the cost.

Working with technology for +14 years, passionate about Open Source Software, 5 years working exclusively with containers (Mesos and Kubernetes).