OpsDB Recovery Troubleshooting Guide

General Information

Usual configuration - Points to primary instance.
Primary Instance - http://prod-opsdb01.geant.net/,
Secondary Instance - http://prod-opsdb02.geant.net/

OpsDB runs on two VMs on each environment (Prod, UAT and Test)

xxxx-opsdb01.geant.net and xxxxx-opsdb02.geant.net where xxxxx = prod, uat, or test

OpsDB is written using PHP 5.3.3, HTML, JavaScript, and runs in a Linux system environment (Centos).

Centos - CentOS-6 updates until November 30, 2020

PHP 5.3.3 FINISHED being officially supported, but being supported via centos back porting of PHP security releases – end of life same as centos 6 system.

HTML / Javascript are currently supported and have no future planned support end dates, in fact older versions are more supported than the latest ones!.

First Steps

If for any reason the system becomes unavailable:

Check if the primary instance available by going to: http://prod-opsdb01.geant.net/
- Suggests DNS issue with opsdb.dante.net - OC should be able to deal with it.
If Primary instance of OPSDB is not available then check if the secondary instance of the OPSDB available by going at http://prod-opsdb02.geant.net/
- If yes, switch the DNS entry for OPSDB from the ‘Primary(01)’ instance to the ‘Secondary(02)’ instance. This will allow the general user to continue working on OpsDB whist we continue with our investigations as to why it initially went down.

Change the Domain Name System (DNS) entry for OpsDB

Change the CNAME opsdb.dante.net in Infoblox, to point to prod-opsdb02.geant.net

Once this has been done the system should then be available to the users once again whilst more detailed investigation takes place into why the Primary instance has become unavailable.

Please do not forget to inform the users that OpsDB is back up once this has been done.

Further Investigation

Check the VM is running

If out of hours, log into VCentre (please use win/adm-xxxx account) and check if the VMs are running. If the server can't be pinged :

eg: log into Frankfurt select top level (fra-prd-vc01.win.dante.org.uk). Select VM's from tab use searchbar at top to search for the VM.

If status of VM is stopped restart it using green button.

If there are networking issues the OC will be able to troubleshoot this.

If the machine is running follow steps below:

Check Apache.

Has apache failed? Is it running?

Log into the appropriate VM

As Root user issue the following command at the command line:

systemctl status httpd

(or)

service httpd status

(If no root user, prefix both commands by sudo)

You should see output something like this:

[mark.golder@test-opsdb01 ~]$ sudo service httpd status

httpd (pid 18768) is running...

[mark.golder@test-opsdb01 ~]$

Start / Restart Apache

If you need to Start / Restart the httpd (apache) server issue the following command at the command line:

systemctl restart httpd

(or)

service httpd restart

(If not root user, prefix both commands by sudo)

This should start or restart the http server (apache) on the VM – please perform this on both VMs separately.

Check MySQL.

Is the MySQL instance running?

Log into the appropriate VM

As Root user issue the following command at the command line:

systemctl status mysqld

(or)

service mysqld status

(If no root user, prefix both commands by sudo)

Start / Restart MySQL

If you need to Start / Restart MySQL issue the following command at the command line:

systemctl restart mysqld

(or)

service mysqld restart

(If not root user, prefix both commands by sudo)

This should start or restart MySQL on the VM – please perform this on both VMs separately.

Check Disk Usage

Follow the steps here: clean up big files

This should already be being monitored and reported upon if it is becoming full , so this scenario should never occur.

Final Step

Please raise ticket with Software Development Support and include the details of the steps taken out of business hours so that detailed analyses of the failure can be carried out.

Page tree

OpsDB Recovery Troubleshooting Guide