Chronological Time Line (UK Time - GMT)
25-10-2020 12:00 - Node5 (server) is rebooted for planned and scheduled maintenance. This was routine maintenance which had been subsequently completed succesfully on other nodes in previous days.
25-10-2020 12:15 - The server does not power back online.
25-10-2020 12:45 - Fosshost engineers raise NOC (network operations center) incident with FDCServers.net (FDC) to investigate
25-10-2020 13:10 - FDC engineers respond to advise that the server has entered grub rescue mode.
25-10-2020 14:00 - IPMI (virtual KVM over IP / console) connectivity to the node is established, following delays connecting due to legacy java web applet technology.
25-10-2020 14:30 - Server commands such as reboot, and reading the file systems and partitions are undertaken to understand if data integrity has been impacted.
25-10-2020 15:00 - It is determined that no hard drives have been impacted however the server is failing to boot into Proxmox OS.
25-10-2020 15:30 - Third line support case is raised with Proxmox Support Team.
25-10-2020 16:00 - Fosshost engineers try to boot server using Linux live CD to repair grub boot records.
25-10-2020 18:00 - This is unsuccesful due to the IPMI technology timing out.
25-10-2020 20:00 - Full investigation logs are written up thus far and shared with FDC engineers to investigate
25-10-2020 23:00 - Independant advice is sought from the Grub community - this suggests that the issue could be related to software config.
Further delays are experienced outside of our control (IPMI, third party delays, SLAs)
26-10-2020 02:00 - We attempt to recover zpool pool but due to the debian live image not supporting ZFS this exercise failed.
26-10-2020 08:00 - Third line support case is raised with Proxmox Support Team.
26-10-2020 20:00 - Following further contact with FDC engineers - Fosshost engineers boot with an ubuntu live image that supports ZFS file systems. The ZFS pool is intact and offsite backups are taken to preserve data. This is succesful.
26-10-2020 21:00 - Tests prove succesful when restoring VMDK images on another live node
26-10-2020 23:00 - Node5 is rebuilt from the ground up and Proxmox reinstalled
27-10-2020 01:00 - All VMs are recovered that we reconstructed the VMs from their images and confirmed that they were operational.
27-10-2020 03:00 - US Mirrors are in the process of rebuilding due to becoming outdated and this will take time
27-10-2020 06:00 - All VMs (virtual machines) restored. Node5 is fully operational
This issue was opened retrospectively.