Exporters: detect micro-incidents and improve storage performance

Sep 27, 2019  - written by  in Cloud

Open Source, transparency, and infrastructure are three things that are intrinsically tied together at Gandi. One of our recent objectives was more quickly and effectively detect localized problems impacting the quality of service offered to our users.

Nicolas Belouin, System Engineer in Gandi’s Simple Hosting team, has developed several tools for improving our tracking of storage unit performance.

For those interested, these open source tools are available on Gandi’s github.

Summary of the organization of Gandi’s storage infrastructure

Gandi’s storage infrastructure consists of two environments: one for IaaS and one for PaaS. Both are based on FreeBSD-based storage units (filers), that stock each volume (disk) as though it were a ZFS volume.

There are two different methods to expose it to users:

  • For Gandi Cloud (IaaS), we use iSCSI. That lets us directly expose a “block” type volume to the user. That way, our customers can use their volume as they see fit. On FreeBSD, the service (daemon) that manages this is called “CTLD”
  • For Simple Hosting (PaaS), we made the choice to export via NFS a file system that is made available to the user. Here, we use “Ganesha” as the NFS daemon

Why choose to work with these exporters?

For the maintenance of these two services (CTLD or Ganesha), Gandi already has tools for detecting major incidents (e.g. “the storage unit stopped responding”). On the other hand, there is no simple solution for detecting minor or localized incidents (e.g. “slow performance on a storage unit”). We needed a system to inform us of any abnormal drop in the quantity of client data transiting, and therefore of a probable incident. For the moment, the goal is to be able to detect the moment where there are no more read/writes on the storage unit. The daemon is still active but in an unstable state.

Of course, we don’t directly monitor each and every volume in a filer since a customer can have a volume with which they don’t do anything. The monitoring is done on the level of the filer if the amount of “actions” diminishes to a level that’s too low or does so too quickly.

And that’s exactly the goal of Ganesha and CTLD as exporters: facilitate better and quicker detection of localized problems in order to improve the quality of the service we offer to our customers.

They also bring a few bonus effects:

  • Having a finer vision of the capacity of our filers
  • Easily obtaining certain metrics that were more complex than before, like the precise number, and in real time, of volumes being used by a filer
  • More quickly predict when a filer will reach saturation

Finally, with regards to the choice of exporters, we choose Prometheus exporters because that’s also what we use internally at Gandi.

How do exporters work?

We coded the exporters we built in Golang. This is a language compiled statically that allows for a simplified deployment. Golang’s library is native to Prometheus and therefore is easily integrated with other technologies needed to query daemons.

In the case of the NFS server Ganesha, integration is relatively simple using D-Bus to export statistics.

For CTLD, the exporter uses system calls to get information.

Concretely, these exporters both work on the same principle: they will create an HTTP server on a specific and normalized port. The project Prometheus has a wiki page where each exporter creater will register the port they use for their project in order to coordinate amongst the community.

That way, it listens on the “reserved” port. When it gets a request, it goes to query the underlying system (Ganesha or CTLD), asking for statistics, then formating them and sending them to the requestor.

The exporter is installed on all storage units and communicates on the standard port selected.

The data is sent to the daemon Prometheus, which regularly pulls a list of network addresses. It then goes to query all the filers in order to gather the data on a single storage unit.

Finally, these data are shown on our Grafana, that lets teams see a visual dashboard.

Now we’re working on refining the alert treshholds. Since their implementation three months ago, this has already helped with analyzing incidents as they’ve arisen.

Xen exporter

In order to improve tracking of the use of “Host” serveurs for our PaaS and IaaS solutions, the hosting team also worked on a Xen exporter. We already had data regarding the VMs, among other reasons in order to run our billing service for Cloud resources, but we didn’t have enough data about the health status of our Host servers.

Due to security concerns, Gandi doesn’t use Hyperthreading on the processors in our IaaS park. We also needed to follow performance in a more precise way in order to measure the impact.

The role of the Xen exporter is to gather all the data regarding the Host and the VMs.

There is already an existing Xen exporter but it’s specifically for Xen server (the commercial version of Xen distributed by Citrix) and we use libXL, which is Xen’s base-level interface.

We therefore developed the Xen exporter to interface with this base-level interface.

This exporter was also created in Go, since Xen provided Go interfaces for libXL. This interface, while currently rather limited, is sufficient for the needs of the exporter.

Finally, for those interested in using it, the Xen exporter doesn’t currently compile with a standard Xen since it requires certain additions that we suggested to the Xen project, which are still waiting to be published.

Gandi projects on github

Opensource is in our DNA

By making our projects public, other companies and individuals that have the same problem can use the work we’ve done and contribute to improving its functions. Another objective is to build a community of contributors that lets a project continue to live independently of Gandi.

Generally, on these exporters, we’ve tried to always be the closest to what’s done in the community. You can find a lot of exporters on the internet that don’t necessarily follow the rules, which inhibits the ability to reuse and share them.

That’s why to make it easy to re-use, we tried to be as standard as possible, and always tried to maintain maximum flexibility (so that it wouldn’t be specific in the way that we use CTLD or Ganesha).

Glossary

Do you have questions for the Hosting team? Leave a comment on this article or contact our team at https://help.gandi.net !

Find out more about our hosting services.

Leave a Reply