Exporters: detectar micro-incidentes y mejorar el rendimiento del almacenamiento

El código abierto, la transparencia y la infraestructura son tres cosas que están intrínsecamente vinculados en Gandi. Uno de nuestros objetivos recientes es detectar con mayor rapidez y eficacia los problemas localizados que afecten la calidad del servicio brindado a nuestros usuarios.

Nicolas Belouin, ingeniero de sistemas del equipo de Simple Hosting de Gandi, ha desarrollado varias herramientas para mejorar nuestro seguimiento del rendimiento de las unidades de almacenamiento.

Para quienes estén interesados, estas herramientas de código abierto están disponibles en el github de Gandi.

Resumen de la organización de la infraestructura de almacenamiento de Gandi

La infraestructura de almacenamiento de Gandi consta de dos entornos: uno para IaaaS y otro para PaaS. Ambos se basan en unidades de almacenamiento basadas en FreeBSD (archivadores), que almacenan cada volumen (disco) como si fuera un volumen ZFS.

Existen dos métodos diferentes para exponerlo a los usuarios:

Para Gandi Cloud (IaaaS), utilizamos iSCSI. Esto nos permite exponer directamente un volumen de tipo «bloque» al usuario. De esta manera, nuestros clientes pueden utilizar su volumen como mejor les parezca. En FreeBSD, el servicio (daemon) que gestiona esto se llama «CTLD».
Para Simple Hosting (PaaS), hemos optado por exportar a través de NFS un sistema de archivos que se pone a disposición del usuario. Aquí, usamos «Ganesha» como demonio de NFS

¿Por qué trabajar con estos exporters?

Para el mantenimiento de estos dos servicios (CTLD o Ganesha), Gandi ya dispone de herramientas para detectar incidentes importantes (por ejemplo, «la unidad de almacenamiento dejó de responder»). Por otra parte, no existe una solución sencilla para detectar incidentes menores o localizados (por ejemplo, «rendimiento lento en una unidad de almacenamiento»). Hacía falta un sistema que nos informara de cualquier descenso anormal en la cantidad de datos de los clientes en tránsito y, por lo tanto, de un probable incidente. Por el momento, el objetivo es poder detectar el momento en el que no hay más lecturas/escrituras en la unidad de almacén. El daemon sigue activo, pero en un estado inestable.

Por supuesto, no controlamos directamente todos y cada uno de los volúmenes de un archivador, ya que un cliente puede tener un volumen con el cual no hace nada. El seguimiento se realiza a nivel del declarante si la cantidad de «acciones» disminuye demasiado rápido o alcanza un nivel demasiado bajo.

Y ese es exactamente el objetivo de Ganesha y CTLD como exporter: facilitar una mejor y más rápida detección de problemas localizados para mejorar la calidad del servicio que le ofrecemos a nuestros clientes.

También traen algunos beneficios adicionales:

Tener una visión más precisa de la capacidad de nuestros archivadores
Obtener fácilmente ciertas métricas que eran más complejas que antes, como el número exacto, y en tiempo real, de los volúmenes utilizados por un archivador
Predecir más rápidamente cuándo un archivador alcanzará la saturación

Finalmente, con respecto a la elección de los exporters, optamos por exporters de Prometheus porque también son aquellos que utilizamos internamente en Gandi.

¿Cómo trabajan los exporters?

Codificamos los exporters que construimos en Golang. Este es un lenguaje compilado estáticamente que permite una implementación simplificada. La biblioteca de Golang es originaria de Prometheus y por lo tanto se integra fácilmente con otras tecnologías necesarias para consultar daemons.

En el caso del servidor NFS Ganesha, la integración es relativamente simple puesto a que utiliza D-Bus para exportar estadísticas.

Para el CTLD, el exporter utiliza llamadas al sistema para obtener información.

Concretamente, estos dos exporters trabajan sobre el mismo principio: crearán un servidor HTTP en un puerto específico y normalizado. El proyecto Prometheus tiene una página wiki donde cada exporter creador registrará el puerto que utiliza para su proyecto con el fin de coordinar entre la comunidad. De esta manera, escucha en el puerto «reservado». Cuando recibe una petición, va a consultar el sistema subyacente (Ganesha o CTLD), pidiendo estadísticas, luego las formatea y las envía al solicitante.

El exporter se instala en todas las unidades de almacenamiento y se comunica con el puerto estándar seleccionado.

Los datos se envían al daemon Prometheus, que extrae regularmente una lista de direcciones de red. A continuación, va a consultar a todos los archivadores para reunir los datos en una sola unidad de almacén.

Finalmente, estos datos se muestran en nuestra Grafana, la cuál permite a los equipos ver un tablero visual.

Ahora estamos trabajando en refinar los umbrales de alerta. Desde su implementación hace tres meses, esto ya ha ayudado a analizar los incidentes a medida que han ido surgiendo.

Exporter Xen

Con el fin de mejorar el seguimiento del uso de los servidores «Host» para nuestras soluciones PaaS e IaaaS, el equipo de hosting también trabajó en un exporter Xen. Ya teníamos datos sobre las VMs, entre otras razones para ejecutar nuestro servicio de facturación de los recursos de la nube, pero no teníamos suficientes datos sobre la salud de nuestros servidores de Hosting.

Por razones de seguridad, Gandi no utiliza Hyperthreading en los procesadores de nuestro parque IaaaS. También necesitábamos seguir el desempeño de una manera más precisa para poder medir el impacto.

El papel del exporter Xen es de reunir todos los datos relativos al Host y a las VMs.

Ya existe un exporter Xen pero es específicamente para el servidor Xen (la versión comercial Xen distribuida por Citrix) y usamos libXL, que es la interfaz básica de Xen.

Por lo tanto, hemos desarrollado el exporter Xen para que interactúe con esta interfaz de nivel básico.

Este exporter también se creó en Go, ya que Xen proporciona interfaces Go para libXL. Esta interfaz, aunque actualmente bastante limitada, es suficiente para las necesidades del exporter.

Finalmente, para aquellos interesados en usarlo, el exporter Xen no compila actualmente con un Xen estándar ya que requiere ciertos suplementos que sugerimos al proyecto Xen, los cuales aún están esperando ser publicados.

Proyectos Gandi en github

Exporter Xen: https://github.com/Gandi/xenlight_exporter
Exporter CTLD: https://github.com/Gandi/ctld_exporter
Exporter de Ganesha: https://github.com/Gandi/ganesha_exporter

El código abierto está en nuestro ADN

Al hacer públicos nuestros proyectos, otras empresas y personas que tienen el mismo problema pueden utilizar el trabajo que hemos realizado y contribuir a mejorar sus funciones. Otro objetivo es construir una comunidad de colaboradores que le permita a un proyecto seguir viviendo independientemente de Gandi.

En general, hemos tratado de ser siempre los más cercanos a lo que se hace en la comunidad en estos exporters. Puede encontrar muchos exporters en Internet que no necesariamente siguen las reglas, lo que inhibe la capacidad de reutilizarlos y compartirlos.

Por eso, para facilitar la reutilización, tratamos de ser lo más estándar posible, y siempre tratamos de mantener la máxima flexibilidad (para que no fuera específica en la forma en que usamos CTLD o Ganesha).

Glosario

NFS: Sistema de archivos de red. Véase https://es.wikipedia.org/wiki/Network_File_System
ZFS: Zettabyte File System, un sistema de archivos. Véase https://es.wikipedia.org/wiki/ZFS_(sistema_de_archivos)
iSCSI: Internet Small Computer System Interface. Véase https://es.wikipedia.org/wiki/ISCSI
CTLD: CAM Target Layer Daemon, software en FreeBSD que le permite gestionar conexiones ISCSI. Véase https://www.freebsd.org/cgi/man.cgi?query=ctld&sektion=8&manpath=freebsd-release-ports
Ganesha: Servidor NFS en espacio de usuario. Voir https://nfs-ganesha.github.io/.
FreeBSD: Sistema operativo BSD. Véase https://www.freebsd.org/

¿Tiene preguntas para el equipo de Hosting? Deje un comentario sobre este artículo o póngase en contacto con nuestro equipo en https://help.gandi.net!

Obtenga más información sobre nuestros servicios de hosting.

Gandi’s storage infrastructure consists of two environments: one for IaaS and one for PaaS. Both are based on FreeBSD-based storage units (filers), that stock each volume (disk) as though it were a ZFS volume.

There are two different methods to expose it to users:

For Gandi Cloud (IaaS), we use iSCSI. That lets us directly expose a «block» type volume to the user. That way, our customers can use their volume as they see fit. On FreeBSD, the service (daemon) that manages this is called «CTLD»
For Simple Hosting (PaaS), we made the choice to export via NFS a file system that is made available to the user. Here, we use «Ganesha» as the NFS daemon

Why choose to work with these exporters?

For the maintenance of these two services (CTLD or Ganesha), Gandi already has tools for detecting major incidents (e.g. «the storage unit stopped responding»). On the other hand, there is no simple solution for detecting minor or localized incidents (e.g. «slow performance on a storage unit»). We needed a system to inform us of any abnormal drop in the quantity of client data transiting, and therefore of a probable incident. For the moment, the goal is to be able to detect the moment where there are no more read/writes on the storage unit. The daemon is still active but in an unstable state.

Of course, we don’t directly monitor each and every volume in a filer since a customer can have a volume with which they don’t do anything. The monitoring is done on the level of the filer if the amount of «actions» diminishes to a level that’s too low or does so too quickly.

And that’s exactly the goal of Ganesha and CTLD as exporters: facilitate better and quicker detection of localized problems in order to improve the quality of the service we offer to our customers.

They also bring a few bonus effects:

Having a finer vision of the capacity of our filers
Easily obtaining certain metrics that were more complex than before, like the precise number, and in real time, of volumes being used by a filer
More quickly predict when a filer will reach saturation

Finally, with regards to the choice of exporters, we choose Prometheus exporters because that’s also what we use internally at Gandi.

How do exporters work?

We coded the exporters we built in Golang. This is a language compiled statically that allows for a simplified deployment. Golang’s library is native to Prometheus and therefore is easily integrated with other technologies needed to query daemons.

In the case of the NFS server Ganesha, integration is relatively simple using D-Bus to export statistics.

For CTLD, the exporter uses system calls to get information.

Concretely, these exporters both work on the same principle: they will create an HTTP server on a specific and normalized port. The project Prometheus has a wiki page where each exporter creater will register the port they use for their project in order to coordinate amongst the community.

That way, it listens on the «reserved» port. When it gets a request, it goes to query the underlying system (Ganesha or CTLD), asking for statistics, then formating them and sending them to the requestor.

The exporter is installed on all storage units and communicates on the standard port selected.

The data is sent to the daemon Prometheus, which regularly pulls a list of network addresses. It then goes to query all the filers in order to gather the data on a single storage unit.

Finally, these data are shown on our Grafana, that lets teams see a visual dashboard.

Now we’re working on refining the alert treshholds. Since their implementation three months ago, this has already helped with analyzing incidents as they’ve arisen.

Xen exporter

In order to improve tracking of the use of «Host» serveurs for our PaaS and IaaS solutions, the hosting team also worked on a Xen exporter. We already had data regarding the VMs, among other reasons in order to run our billing service for Cloud resources, but we didn’t have enough data about the health status of our Host servers.

Due to security concerns, Gandi doesn’t use Hyperthreading on the processors in our IaaS park. We also needed to follow performance in a more precise way in order to measure the impact.

The role of the Xen exporter is to gather all the data regarding the Host and the VMs.

There is already an existing Xen exporter but it’s specifically for Xen server (the commercial version of Xen distributed by Citrix) and we use libXL, which is Xen’s base-level interface.

We therefore developed the Xen exporter to interface with this base-level interface.

This exporter was also created in Go, since Xen provided Go interfaces for libXL. This interface, while currently rather limited, is sufficient for the needs of the exporter.

Finally, for those interested in using it, the Xen exporter doesn’t currently compile with a standard Xen since it requires certain additions that we suggested to the Xen project, which are still waiting to be published.

Gandi projects on github

Xen exporter: https://github.com/Gandi/xenlight_exporter
CTLD exporter: https://github.com/Gandi/ctld_exporter
Ganesha exporter: https://github.com/Gandi/ganesha_exporter

Opensource is in our DNA

By making our projects public, other companies and individuals that have the same problem can use the work we’ve done and contribute to improving its functions. Another objective is to build a community of contributors that lets a project continue to live independently of Gandi.

Generally, on these exporters, we’ve tried to always be the closest to what’s done in the community. You can find a lot of exporters on the internet that don’t necessarily follow the rules, which inhibits the ability to reuse and share them.

That’s why to make it easy to re-use, we tried to be as standard as possible, and always tried to maintain maximum flexibility (so that it wouldn’t be specific in the way that we use CTLD or Ganesha).