rs-collector is a Bosun compatible collector for various services that are not covered by scollector and that we use at CenterDevice.
Attention: Please be advised, even though we are running rs-collector on our production systems successfully for months, this is no stable software.
Table of Contents generated with DocToc
See below for details about the collectors.
The Galera collector collects metrics about the cluster status and cluster sync performance of a Percona Galera MySQL cluster. We use it to monitor cluster split brain and general degradation situation. There is a full list of all available metrics in galera.rs, function metadata
.
``` alert galera.cluster.state.uuid.no.consensus { template = ... critNotification = default
$metric = avg:galera.wsrep.cluster.state.uuid{domain=wildcard(*)} $q=q("$metric", "5m", "") $a = avg($q) $f = first($q) $qalert = ($a - $f) != 0 crit = $qalert }
alert galera.cluster.state.not.primary { template = ... critNotification = default
$metric = sum:galera.wsrep.cluster.status{host=wildcard(),domain=wildcard()} $q = q("$metric", "5m", "") $t = t(last($q), "domain") $qalert = sum($t) $primaryValue = 0 crit = $qalert != $primaryValue }
alert galera.local.state.not.synced { template = ... critNotification = default
$metric = zimsum:5m-avg:galera.wsrep.local.state{domain=wildcard(*)} $q = q("$metric", "5m", "") $qalert = last($q) $syncedValue = 12 crit = $qalert != $syncedValue }
alert galera.cluster.size.degraded { template = ... critNotification = default
$metric = avg:galera.wsrep.cluster.size{domain=wildcard(*)} $q = q("$metric", "5m", "") $qalert = last($q) $critValue = 3 crit = $qalert != $critValue } ```
The HasIpAddr collector sends either 1 or 0 if a host has bound a specific IPv4 address or not, respectively. This is helpful in cases where hosts bind or release IPv4 addresses dynamically. For example, in a keepalived
VRRP cluster it allows Bosun to check if and on how many hosts a virtual, high available IP address is bound.
In our production clusters we have observed situations when none of the cluster members has bound the virtual IP address. This collector allows us to define an alarm for such cases.
``` alert os.net.vrrp-vip-failed { template = ... critNotification = default
$metric = sum:os.net.has_ipv4s{ipv4=wildcard(*)}
$q_alert = sum(t(last(q("$metric", "5m", "")), "ipv4"))
$expected = 1 $critValue = $expected crit = $q_alert != $critValue } ```
The JVM collector collects garbage collection statistics, i.e., those that jstat -gc
reveals for each specified, running JVM. This collector has been tested with OpenJDK "7u51-2.4.6-1ubuntu4" and Oravle JDK "1.8.0_121". JVMs are identified by a regular expression that matches the class name or the command line arguments and ass
This collector only collects statistics for specified JVM; cf. example configuration. It currently does not distinguish between multiple instances of the same identified JVM.
The Mongo collector collects MongoDB replicaset and cluster metrics. We use it to monitor cluster split brain and general degradation situation. There is a full list of all available metrics in galera.rs, function metadata
.
Especially the following two metrics are helpful:
mongo.replicasets.members.mystate
collects the "myState" variable from each replica set member. This allows to compute if the particular replica set is in a sane state.mongo.replicasets.oplog_lag.[min,avg,max]
collects the min, avg, and max oplog replication lag between a replica set's primary and the corresponding secondaries. These values are measured only on the currently active primary.``` alert mongo.replicaset.state.unexpected { template = ... critNotification = default
$metric = sum:mongo.replicasets.members.mystate{host=wildcard(),replicaset=wildcard()} $q = q("$metric", "5m", "") $t = t(last($q), "replicaset") $qalert = sum($t) $critValue = 5 crit = $qalert != $critValue } ```
The Postfix collector collects metrics about Postfix' queues. This is helpful to monitor how the queues fill and empty over time as well as if the queues are emptied at all in order to alarm when mail delivery stalls. There is a full list of all available metrics in galera.rs, function metadata
.
``` alert postfix.mailqueue.deferred.too.long { template = ... critNotification = default warnNotification = default
$metric = sum:5m-min:postfix.queues.deferred{domain=wildcard(*)} $q = q("$metric", "5m", "") $t = t(last($q), "domain") $q_alert = sum($t) }
alert postfix.mailqueue.deferred.unchanged { template = ... warnNotification = default
$period = 4h $metric = postfix.queues.deferred{domain=wildcard(*)} $qmin = q("min:$metric", "$period", "") $qmax = q("max:$metric", "$period", "")
$minqueuelen = min($qmin) $maxqueuelen = max($qmax)
$qalert = $minqueuelen > 0 && $maxqueuelen == $minqueuelen warn = $qalert } ```
rs-collector.stats.rss
collects the resident set size (physical memory) in KB consumed by rs-collector; only supported on Linux.rs-collector.stats.samples
collects the number of transmitted samples.rs-collector.versio
collects the version 'x.y.z' of rs-collector as x * 1.000.0000 + y * 1000 + z.These metrics can also be used to check the liveliness of rs-collector and as a heartbeat.
Please see this example.
Pleae add my [PackageCloud] open source repository and install rs-collector via apt.
bash
curl -s https://packagecloud.io/install/repositories/lukaspustina/opensource/script.deb.sh | sudo bash
sudo apt-get install rs-collector
Please install Rust via rustup and then run
bash
cargo install rs-collector
There is also an Ansible role available at Ansible Galaxy that automates the installation of rs-collector.
General: Minor memory leak in chan::tick -- cf. Roadmap.
JVM: Does not distinguish between JVMs with the same name assigned via configuration, i.e., multiples instances of the same Java application.
Please see Todos.