rs-collector

Release Status Build Status GitHub release license Ansible Role

rs-collector is a Bosun compatible collector for various services that are not covered by scollector and that we use at CenterDevice.

Attention: Please be advised, even though we are running rs-collector on our production systems successfully for months, this is no stable software.

Table of Contents generated with DocToc

Collectors

  1. Galera - Collects metrics about the cluster status and cluster sync performance of a Percona Galera MySQL cluster.
  2. HasIpAddr - Checks if a host has bound specific IPv4 addresses.
  3. JVM - Collects garbage collection statistics.
  4. MongoDB - Collects replicaset metrics.
  5. Postfix - Collects queue lengths for all postfix queues.
  6. rs-collector - Collects internal metrics for rs-collector.

See below for details about the collectors.

Galera

The Galera collector collects metrics about the cluster status and cluster sync performance of a Percona Galera MySQL cluster. We use it to monitor cluster split brain and general degradation situation. There is a full list of all available metrics in galera.rs, function metadata.

Example Alarms

``` alert galera.cluster.state.uuid.no.consensus { template = ... critNotification = default

$metric = avg:galera.wsrep.cluster.state.uuid{domain=wildcard(*)} $q=q("$metric", "5m", "") $a = avg($q) $f = first($q) $qalert = ($a - $f) != 0 crit = $qalert }

alert galera.cluster.state.not.primary { template = ... critNotification = default

$metric = sum:galera.wsrep.cluster.status{host=wildcard(),domain=wildcard()} $q = q("$metric", "5m", "") $t = t(last($q), "domain") $qalert = sum($t) $primaryValue = 0 crit = $qalert != $primaryValue }

alert galera.local.state.not.synced { template = ... critNotification = default

$metric = zimsum:5m-avg:galera.wsrep.local.state{domain=wildcard(*)} $q = q("$metric", "5m", "") $qalert = last($q) $syncedValue = 12 crit = $qalert != $syncedValue }

alert galera.cluster.size.degraded { template = ... critNotification = default

$metric = avg:galera.wsrep.cluster.size{domain=wildcard(*)} $q = q("$metric", "5m", "") $qalert = last($q) $critValue = 3 crit = $qalert != $critValue } ```

HasIpAddr

The HasIpAddr collector sends either 1 or 0 if a host has bound a specific IPv4 address or not, respectively. This is helpful in cases where hosts bind or release IPv4 addresses dynamically. For example, in a keepalived VRRP cluster it allows Bosun to check if and on how many hosts a virtual, high available IP address is bound.

In our production clusters we have observed situations when none of the cluster members has bound the virtual IP address. This collector allows us to define an alarm for such cases.

Example Alarm

``` alert os.net.vrrp-vip-failed { template = ... critNotification = default

$metric = sum:os.net.has_ipv4s{ipv4=wildcard(*)}

$q_alert = sum(t(last(q("$metric", "5m", "")), "ipv4"))

$expected = 1 $critValue = $expected crit = $q_alert != $critValue } ```

JVM

The JVM collector collects garbage collection statistics, i.e., those that jstat -gc reveals for each specified, running JVM. This collector has been tested with OpenJDK "7u51-2.4.6-1ubuntu4" and Oravle JDK "1.8.0_121". JVMs are identified by a regular expression that matches the class name or the command line arguments and ass

This collector only collects statistics for specified JVM; cf. example configuration. It currently does not distinguish between multiple instances of the same identified JVM.

Mongo

The Mongo collector collects MongoDB replicaset and cluster metrics. We use it to monitor cluster split brain and general degradation situation. There is a full list of all available metrics in galera.rs, function metadata.

Especially the following two metrics are helpful:

Example Alarms

``` alert mongo.replicaset.state.unexpected { template = ... critNotification = default

$metric = sum:mongo.replicasets.members.mystate{host=wildcard(),replicaset=wildcard()} $q = q("$metric", "5m", "") $t = t(last($q), "replicaset") $qalert = sum($t) $critValue = 5 crit = $qalert != $critValue } ```

Postfix

The Postfix collector collects metrics about Postfix' queues. This is helpful to monitor how the queues fill and empty over time as well as if the queues are emptied at all in order to alarm when mail delivery stalls. There is a full list of all available metrics in galera.rs, function metadata.

Example Alarms

``` alert postfix.mailqueue.deferred.too.long { template = ... critNotification = default warnNotification = default

$metric = sum:5m-min:postfix.queues.deferred{domain=wildcard(*)} $q = q("$metric", "5m", "") $t = t(last($q), "domain") $q_alert = sum($t) }

alert postfix.mailqueue.deferred.unchanged { template = ... warnNotification = default

$period = 4h $metric = postfix.queues.deferred{domain=wildcard(*)} $qmin = q("min:$metric", "$period", "") $qmax = q("max:$metric", "$period", "")

$minqueuelen = min($qmin) $maxqueuelen = max($qmax)

$qalert = $minqueuelen > 0 && $maxqueuelen == $minqueuelen warn = $qalert } ```

rs-collector Internal Metrics

These metrics can also be used to check the liveliness of rs-collector and as a heartbeat.

Configuration

Please see this example.

Installation

Ubuntu

Pleae add my [PackageCloud] open source repository and install rs-collector via apt.

bash curl -s https://packagecloud.io/install/repositories/lukaspustina/opensource/script.deb.sh | sudo bash sudo apt-get install rs-collector

From Source

Please install Rust via rustup and then run

bash cargo install rs-collector

Ansible

There is also an Ansible role available at Ansible Galaxy that automates the installation of rs-collector.

Know Issues

Roadmap

Please see Todos.