Stateful Monitoring, netps

A year ago we started reaching the limits of what our Icinga 2 monitoring solution could do with the architecture I'd chosen. While Icinga does offer up the ability to run it in a scalable ‘cluster’, I made the decision to not use these features, and instead limited ourselves to just multiple disconnected single instances of Icinga 2. I'd like to outline the reasoning that went into this decision, and how we made it scale as far as possible.

Disclaimer

※ I'm using the word ‘state’ here quite inappropriately. Icinga is stateful, and there's no getting around that. On a fundamental level, Icinga has to be stateful to know when a service has moved from a healthy state to a non-healthy state. My use of ‘state’ is rather more about the use of a configuration that defines a persistent topology where hosts are not expected to move or change.

State

With an Icinga cluster, you have the concepts of ‘masters’, ‘satellites’ and ‘agents’. Briefly, ‘masters’ form an active-backup pair that act as your source of truth. ‘Satellites’ act as execution agents that run checks, or delegate checks to other agents that report their results upwards. ‘Agents’ run checks on the local host.

Masters

The active master is where your check results end up, and it serves the important task of translating those results into notifications. You only have one active master, and which master is active is ultimately determined by a mutex on the database the masters connect to. If the active master dies, the backup master siezes the lock.

Satellites

Satellites report results from either their own checks or the checks of their children to the active master (or theoretically, masters, if you have multiple clusters). Losing a satellite loses visibility of everything the satellite is a parent of. You can have an agent (or another satellite) report to multiple satellites.

Agents

Agents run checks on the local host and report their results to their parent. That parent can be a satellite or a master. It's worth being clear that you do not need a satellite to have an agent. In our migration to containerised applications, this did make Icinga a potential solution.

Why we stopped using a distributed setup

Our configuration was initially a lot like the official documentation. Although we had two satellites per datacentre. We ran into a handful of issues that, while entirely resolvable, provided the grounds to change things dramatically. These issues were:

  1. We ended up with a lot of checks on the master anyway. Things like certificates and domain expiration, the presence of DNS records, and checks to third party services we depend on aren't easily applicable to a specific satellite.
  2. Detecting exactly what was happening in large events such as network partitions became increasingly difficult. If the satellites in DC A can talk to the active master in DC F, but nothing else, we'd still see OK results from DC A. We'd either have to wait for customers to report the issue, or see the issue ourselves in the process of investigating seemingly unrelated checks in DC B.
  3. It's a lot of hosts, and a non-trivial topology with a lot of moving pieces. This is where the title of the post comes in. We had a lot of ‘state’ in play, and that made reasoning about what was going on difficult, and misconfiguration deadly.

As little state as possible

While all of the problems above are (again) solvable, I wanted something simpler. This initially started out as two masters, each running a clustered MariaDB. This worked, but introduced concerns such as replication health, replication delay, and didn't deal with problems such as wanting to ensure the health of a specific service from multiple locations. So I went further, and split the two masters entirely. In technical terms, we went from a distributed cluster to running multiple individual, disconnected clusters.

Advantages

We had visibility of all of our services as seen from multiple locations, of which failure would be reported by each other location. The configuration became entirely flat, with every host running the same configuration pulled from git. While we lost the ability to have a single IcingaWeb2 interface that reflected the overall state of our services, we gained the ability to view the state of every service as seen from DC X, DC Y, or DC Z alone.

Perhaps the biggest advantage is that we went from a configuration tied to specific hosts, to something we could either deploy on an arbitrary host, or even inside a pod on kubernetes. Setting up a new Icinga host became “install the packages, initialize the database, pull the config from git into /etc and done”.

More abstract than that, we also had something was was very easy to expand, maintain, and explain to new staff. “This host runs these scripts on a loop and alerts if one fails” is decidedly simpler than “This host (if it's the active master) talks to these satellites in other datacentres who then perform checks provided by the master, and in turn provide checks from the master to child agents”.

Disadvantages

We lost the single instance of IcingaWeb2 that provided a unified interface to results from each datacentre. Although we didn't lose a unified interface as we already used the lovely Nagstamon.

We lost scalability. While the single-active-master-notifying architecture has inevitable limits on scaling, it's far better than the single-active-master-checking-and-notifying in that respect. Checking is arguably the heavy part of monitoring, and we had a lot of checks. There's only two solutions to this. Run less checks, or run “cheaper” checks.

Once scaling became a concern, we reviewed what checks we actually had. Our process for handling a fault ends in the question “why didn't something alert about this?”, and is normally followed up with writing a new check. This builds up, edge cases end up getting their own checks, and when these checks apply to a class of host that you have a lot of, they really add up. So eventually, you have to take the time to review what you're actually checking, and collate checks that serve similar purposes. As an easy working example, there's no benefit checking if a service is running and responding. The latter by necessity confirms the former.

Making checks “faster” can be harder. We found ourselves using a lot of SSH in our checks. While that's great in some ways (guaranteed encryption being one of them), the establishment of an SSH connection is not a cheap process and touches on several systems such as PAM, NSS, potentially Kerberos, the shell initialisation process, and more.

One example I can provide is checking of a process is running. While some processes listen and provide a networked service, not all processes do, and we do need to check if that process is running. Setting up an SSH connection every few minutes to run ps is overkill. So lets not use SSH.

netps

netps is a tiny service that does one thing - serve the current process list (and the start time of each process) in json, over HTTPS, to clients that provide a valid client certificate. It's written in go, so it's easy to slap on any (Linux) host provided you've built it on at least one other. It comes in at less than 100 lines, with whitespace and comments, thanks to the wonderfully broad set of standard libraries that come with go.

I should thank Mitchell Hashimoto for his work on go-ps, which provides the bulk of the code for this service. I have semi-forked it to expose more of the information contained in /proc/pid/stat - specifically for process start time so I can detect when a process has restarted (although converting this value into a timestamp does require knowing how many ticks per second your kernel is running at - which may be why the original author didn't include it). This made writing netps (with little knowledge of golang) over an afternoon possible.

[0] nilgiri.piconet.co.uk:checks> curl --cacert ca.pem --cert cert.pem --key key.pem -s https://ruhuna.piconet.co.uk:1846/v1/json | jq .
{
  "1": {
    "executable": "init",
    "parent": 0,
    "starttime": 1614418533
  },
  "10": {
    "executable": "watchdog/2",
    "parent": 1,
    "starttime": 1614418533
  },