Introducing Thanos: Prometheus at scale

05.18.18

Abstract circle cloud

Fabian Reinartz is a software engineer who enjoys building systems in Go and chasing tough problems. He is a Prometheus maintainer and co-founder of the Kubernetes SIG instrumentation. In the past, he was a production engineer at SoundCloud and led the monitoring team at CoreOS. Nowadays he works at Google.

Bartek Plotka is an Improbable infrastructure software engineer. He’s passionate about emerging technologies and Distributed System problems. With a low-level coding background at Intel, previous experience as a Mesos contributor, and with production, global-scale SRE experience at Improbable, he is focused on improving the world of microservices. His three loves are Golang, open-source software and volleyball.

As you might guess from our flagship product SpatialOS, Improbable has a need for highly dynamic cloud infrastructure at a global scale, running dozens of Kubernetes clusters. To stay on top of them, we were early adopters of the Prometheus monitoring system. Prometheus is capable of tracking millions of measurements in real time and comes with a powerful query language to extract meaningful insights from them.

Prometheus’s simple and reliable operational model is one of its major selling points. However, past a certain scale, we’ve identified a few shortcomings. To resolve those, we’re today officially announcing Thanos, an open source project by Improbable to seamlessly transform existing Prometheus deployments in clusters around the world into a unified monitoring system with unbounded historical data storage. It's available on Github here.

CCTV cameras blue sky

Our goals with Thanos

At a certain cluster scale, problems arise that go beyond the capabilities of a vanilla Prometheus setup. How can we store petabytes of historical data in a reliable and cost-efficient way? Can we do so without sacrificing responsive query times? Can we access all our metrics from different Prometheus servers from a single query API? And can we somehow merge replicated data collected via Prometheus HA setups?

We built Thanos as the solution to these questions. In the next sections, we describe how we mitigated the lack of these features prior to Thanos and explain the goals we had in mind in detail.

Global query view

Prometheus encourages a functional sharding approach. Even single Prometheus server provides enough scalability to free users from the complexity of horizontal sharding in virtually all use cases.

While this is a great deployment model, you often want to access all the data through the same API or UI – that is, a global view. For example, you can render multiple queries in a Grafana graph, but each query can be done only against a single Prometheus server. With Thanos, on the other hand, you can query and aggregate data from multiple Prometheus servers, because all of them are available from a single endpoint.

Previously, to enable global view at Improbable, we arranged our Prometheus instances in a multiple-level Hierarchical Federation. That meant setting up a single meta-Prometheus server that scraped a portion of the metrics from each “leaf” server.

Meta-Prometheus server

This has been proven to be problematic. It resulted in an increased configuration burden, added an additional potential failure point and required complex rules to expose only certain data on the federated endpoint. In addition, that kind of federation does not allow a truly global view, since not all data is available from a single query API.

Closely related to this is a unified view of data collected by high-availability (HA) pairs of Prometheus servers. Prometheus’s HA model independently collects data twice, which is as simple as it could be. However, a merged and de-duplicated view of both data streams is a huge usability improvement.

Undoubtedly, there is a need for highly available Prometheus servers. At Improbable we are really serious about monitoring every minute of data, but having a single Prometheus instance per cluster is a single point of failure. Any configuration error or hardware failure could potentially result in the loss of important insights. Even a simple rollout could be a small disruption in the metric collection because a restart can be significantly longer than the scraping interval.

Reliable historical data storage

One of our dreams (shared by most Prometheus users) is for a cheap, responsive, long-term metric storage. At Improbable, using Prometheus 1.8, we were forced to set our metric retention to an embarrassing nine days. This adds an obvious operator limitation on how much we can look back with our graphs.

Prometheus 2.0 helps a lot in this area, as a total number of time series no longer impact overall server performance (See Fabian’s KubeCon keynote about Prometheus 2). Still, Prometheus stores metric data to its local disk. While highly-efficient data compression can get significant mileage out of a local SSD, there is ultimately a limit on how much historical data can be stored.

Additionally, at Improbable, we care about reliability, simplicity and cost. Larger local disks are harder to operate and backup. They are more expensive and require more tooling around backup, which introduces unnecessary complexity.

Downsampling

Once we started querying historical data, we soon realized that there are fundamental big-O complexities that make queries slower and slower as we retrieve weeks, months, and ultimately years worth of data.

The usual solution to that problem is called downsampling), a process of reducing the sampling rate of the signal. With downsampled data, we can “zoom out” to a larger time range and maintain the same number of samples, thus keeping queries responsive.

Downsampling old data is an inevitable requirement of any long-term storage solution and is beyond the scope of vanilla Prometheus.

Additional goals

One of the initial goals of the Thanos project was to integrate with any existing Prometheus setups seamlessly. A second goal was that operations should be simple, with a minimal barrier to entry. If there are any dependencies, they should be easy to satisfy for small- and large-scale users alike, which also implies a negligible baseline cost.

Earth at night

The architecture of Thanos

Having our goals enumerated in the previous section, let’s work down that list and see how Thanos tackles these issues.

Global view

To get a global view on top of an existing Prometheus setup, we need to interconnect a central query layer with all our servers. The Thanos Sidecar component does just that and is deployed next to each running Prometheus server. It acts as a proxy that serves Prometheus’s local data over Thanos’s canonical gRPC-based Store API, which allows selecting time series data by labels and time range.

On the other end stands a horizontally scalable and stateless Querier component, which does little more than answering PromQL queries via the standard Prometheus HTTP API. Queriers, Sidecars and other Thanos components are communicating via gossip protocol.

Query sidecar

  1. When the Querier receives a request, it fans out to relevant Store API servers, i.e. our Sidecars, and fetches the time series data from their Prometheus servers.
  2. It aggregates the responses together and evaluates the PromQL query against them. It can aggregate disjoint data as well as duplicated data from Prometheus high-availability pairs.

This solves a central piece of our puzzle by unifying well-separated Prometheus deployments into a global view of our data. In fact, Thanos can be deployed like this to only make use of these features if desired. No changes to the existing Prometheus servers are necessary at all!

Unlimited retention!

Sooner or later, however, we will want to preserve some data beyond Prometheus’s regular retention time. To do this, we settled on an object storage system for backing up our historical data. It is widely available in every cloud and even most on-premise data centres, and is extremely cost efficient. Furthermore, virtually every object storage solution can be accessed through the well known S3 API.

Prometheus’s storage engine writes its recent in-memory data to disk about every two hours. A block of persisted data contains all the data for a fixed time range and is immutable. This is rather useful since the Thanos Sidecar can simply watch Prometheus’s data directory and upload new blocks into an object storage bucket as they appear.

Prometheus backup

An additional advantage of having Sidecar uploading metric blocks to the object store as soon as it is written to disk is an ability to keep the “scraper” (Prometheus with Thanos Sidecar) lightweight. This simplifies maintenance, cost and system design.

Backing up our data is easy. What about querying data from the object store again?

The Thanos Store component acts as a data retrieval proxy for data inside our object storage. Just like the Thanos Sidecars, it participates in the gossip cluster and implements the Store API. This way existing Queriers can treat it just like Sidecars as another source of time series data – no special handling is required.

Thanos store component

The blocks of time series data consist of several large files. Downloading them on-demand would be rather inefficient and caching them locally would require huge memory and disk space.

Instead, the Store Gateway knows how to deal with the data format of the Prometheus storage engine. Through smart query planning and by only caching the necessary index parts of blocks, it can reduce complex queries to a minimal amount of HTTP range requests against files in the object storage. This way it can reduce the number of naive requests by four to six orders of magnitude and achieve response times that are, in the big picture, hard to distinguish from queries against data on a local SSD.

Thanos querier

As shown in the diagram above, Thanos Querier significantly reduces the per-request costs from object storage offerings, by leveraging the Prometheus storage format that co-locates related data in a block file. Having that in mind, we can aggregate multiple byte fetches into a minimal number of bulk calls.

Compaction & downsampling

At the moment when a new block of time series data is successfully uploaded to the object storage, we treat it as “historical” data that’s immediately available via the Store Gateway.

However, after some time, these blocks from the single Source (i.e Prometheus with Sidecar) accumulate and simply do not use the full potential of the indexing. To solve this we introduced a separate singleton component called Compactor. It simply applies Prometheus’s local compaction mechanism to historical data in the object storage and can be run as a simple periodic batch job.

Prometheus compactor

Thanks to Prometheus’s efficient sample compression, querying many series from storage over a long time range is not problematic from a data size perspective. However, the potential cost of decompressing billions of samples and running them through query processing inevitably causes drastic increases in query latency. On the other hand, as there are hundreds of data points per available screen pixel, it becomes impossible to even render the full resolution data. Thus, downsampling is not only feasible but involves no noticeable loss of precision.

Downsampling Prometheus

To produce downsampled data, the Compactor continuously aggregates series down to five minute and one hour resolutions. For each raw chunk, encoded with TSDB’s XOR compression, it stores different types of aggregations, e.g. min, max, or sum in a single block. This allows Querier to automatically choose the aggregate that is appropriate for a given PromQL query.

From the user perspective, no special configuration is required to use downsampled data. Querier will automatically switch between different resolutions and raw data as a user zooms in and out. Optionally, the user can control it directly by specifying a custom “step” in the query parameters.

Since the storage cost per GB is marginal, by default Thanos always keeps the raw, five minute and one hour resolutions in storage, there is no need to delete the original data.

Recording rules

Even with Thanos, recording rules are an essential part of the monitoring stack. They reduce query complexity, latency and cost. They also provide users with convenient shortcuts for important aggregated views on metric data. Thanos builds upon vanilla Prometheus instances and, therefore, it is perfectly valid to keep recording and alerting rules in the existing Prometheus server. However, this might be not enough for the following cases:

  • Global alerts and rules (e.g: alert when service is down in more than two of three clusters).
  • Rules beyond a Prometheus’s local data retention.
  • The desire to store all rules and alerts in a single place.

Prometheus ruler

For all these cases, Thanos includes a separate component called Ruler that evaluates rules and alerts against Thanos Queriers. By exposing the well-known StoreAPI, the Query node can access fresh evaluated metrics. Later, they are also backed up in the object store and accessible via Store Gateway.

The power of Thanos

Thanos is flexible enough to be set up differently to fit your use cases. It is particularly useful in case of an actual migration from plain Prometheus. Let’s quickly recap what we learned about Thanos components, though a quick example. Here’s how to migrate your own vanilla Prometheus to reach our shiny ‘unlimited retention metric’ world:

Prometheus global diagram

  1. Add Thanos Sidecar to your Prometheus servers – for example, a neighbouring container in the Kubernetes pod.
  2. Deploy a few replica Thanos Queriers to enable data browsing. At this point, it’s easy to set-up gossip between your Scrapers and Queriers. Use the thanos_cluster_members metric to ensure all the components are connected.

Notably, these two steps alone are enough to enable a global view and seamless deduplication of the result from potential Prometheus HA replicas! Just connect your dashboards to Querier HTTP endpoint or use Thanos UI directly.

However, if you desire a metric data backup and long term retention, we need three more steps:

  1. Create either a AWS S3 or a GCS bucket. Simply configure your Sidecars to back up data there. You can now also reduce local retention to a minimum.
  2. Deploy a Store Gateway and connect it to your existing gossip cluster. Having that query can access backed-up data as well!
  3. Deploy Compactor to improve your long term query responsiveness by applying compactions and downsampling.
  4. If you want to learn more, feel free to check out our example kubernetes manifests and getting started page!

With just five steps, we transformed Prometheus servers into a robust monitoring system that gives you a global view, unlimited retention and potential metric high availability.

Pull request: we need you!

Thanos has been an open-source project from the very beginning. Seamless integration with Prometheus and the ability to use only part of the Thanos project makes it a perfect choice if you want to scale your monitoring system without a superhuman effort.

GitHub Pull Requests and Issues are very welcome. At the same time, do not hesitate to contact us via Github issues or Improbable-eng #thanos slack if you have any questions or feedback, or even if you want to share your use case! And, if you like what we do at Improbable, do not hesitate to reach out to us – we are always hiring!

Learn about Improbable

Discover more
Back of head looking at screens