...
- Open one or more of the following RabbitMQ management consoles. (Credentials are in the "GÉANT Dashboard v3" LastPass folder)
- httphttps://netprod-noc-alarms01.geant.org:15672/http
- https://netprod-noc-alarms02.geant.org:15672/
- httphttps://prod-dashboardnoc-storage01alarms03.geant.org:15672/
- Scroll down to the "Nodes" section
- There should be 3 rows in the table and all status icons should be green (currently - there is a red bar showing a deprecated node - this will be removed when possible). The expected node hostnames names are:
- prodrabbit@prod-noc-alarms01.geant.orgprod
- rabbit@prod-noc-alarms02.geant.orgprod
- rabbit@prod-dashboard-storage01.geant.orgnoc-alarms03
Solution
- If one of the 3 nodes is failing or missing from the list, log into the failing server via ssh and restart the RabbitMQ service:
sudo systemctl restart rabbitmqdashboard-serverdocker.service
- After a minute or two the management consoles should show the cluster is restored.
...
Solution #2
- If all 3 nodes appear in the list, but if the state of the nodes is different when logging into their respective administration gui's
- follow these instructions to restart/rebootstrap the cluster
Collectors have stopped working
Analysis
- Open https://net-alarms-monitoring.geant.org/d/hESYQotZz/correlation-services?orgId=1 this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows a nonzero rate of traps being processes
...
- On each of the following servers:
- net-alarms01.geant.org
- net-alarms02.geant.org
- net-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_collector
Possible Cause: Correlators have stopped working
Analysis
- Open https://net-alarms-monitoring.geant.org/d/hESYQotZz/correlation-services?orgId=1 this Correlation status dashboard
- Scroll down to the "Collectors" panel
- Check that the graph shows one of the leader collector processes processing a non-zero rate of traps.note that it is normal for only one of the collectors to be processing traps, the other line should remain at zero The current leader can be identified by the FORWARDER with state 2 in the "Raft States" panel.
Solution
- On each of the following servers:
- net-alarms01.geant.org
- net-alarms02.geant.org
- net-alarms03.geant.org
- Log in via ssh and execute the following command:
sudo systemctl restart trap_correlator
...