Incident description
A missing upgrade to the DB backend of Sensu started causing misbehavior in our monitoring backend and agents and drove us to investigate the issue on several VMs, comparable to our Sensu server.
To solve this issue an upgrade of the puppet module was required, with the problem being, this module is shared among several applications and the upgrade procedure involves either Sensu, Poller, and Brian.
Brian is using its implementation of Sensu, and I had to verify if it was showing the same issue and if the new puppet module would have impacted Sensu. The effort that was put in to solve the issue was huge (the new puppet module had several differences and the perimeter firewall was denying access to Sensu API) and while attempting to fix the issue on several servers, InfluxDB received a major upgrade (from version 1.8 to version 2.0).
The second cause, was the lack of pinning of the application.
Incident severity: CRITICAL
Data loss: YES
Timeline
Time (CET) | |
---|---|
27 Jan, 00:11 | this is the time reported in /var/log/yum.log when the update was triggered |
27 Jan, 03:00 | at about 3 am the problem was solved. |
27 Jun, 09:17 | This is the time reported in /var/log/yum.log when Influx version was rolled back |
Total downtime: 09:06 hours.
Proposed Solution
Package pinning is always a good practice for our core applications.
The Influx DB package is now pinned everywhere and further upgrades must be deliberately applied, by changing the version number in puppet.
Further investigation on other packages will be conducted with the action described in the following ticket: SWD-31 - Getting issue details... STATUS