Major Incident - Postmortem Report


Chronological Time Line (UK Time - GMT)

25-10-2020 12:00 - Node5 (server) is rebooted for planned and scheduled maintenance. This was routine maintenance which had been subsequently completed succesfully on other nodes in previous days.

25-10-2020 12:15 - The server does not power back online.

25-10-2020 12:45 - Fosshost engineers raise NOC (network operations center) incident with (FDC) to investigate

25-10-2020 13:10 - FDC engineers respond to advise that the server has entered grub rescue mode.

25-10-2020 14:00 - IPMI (virtual KVM over IP / console) connectivity to the node is established, following delays connecting due to legacy java web applet technology.

25-10-2020 14:30 - Server commands such as reboot, and reading the file systems and partitions are undertaken to understand if data integrity has been impacted.

25-10-2020 15:00 - It is determined that no hard drives have been impacted however the server is failing to boot into Proxmox OS.

25-10-2020 15:30 - Third line support case is raised with Proxmox Support Team.

25-10-2020 16:00 - Fosshost engineers try to boot server using Linux live CD to repair grub boot records.

25-10-2020 18:00 - This is unsuccesful due to the IPMI technology timing out.

25-10-2020 20:00 - Full investigation logs are written up thus far and shared with FDC engineers to investigate

25-10-2020 23:00 - Independant advice is sought from the Grub community - this suggests that the issue could be related to software config.

Further delays are experienced outside of our control (IPMI, third party delays, SLAs)

26-10-2020 02:00 - We attempt to recover zpool pool but due to the debian live image not supporting ZFS this exercise failed.

26-10-2020 08:00 - Third line support case is raised with Proxmox Support Team.

26-10-2020 20:00 - Following further contact with FDC engineers - Fosshost engineers boot with an ubuntu live image that supports ZFS file systems. The ZFS pool is intact and offsite backups are taken to preserve data. This is succesful.

26-10-2020 21:00 - Tests prove succesful when restoring VMDK images on another live node

26-10-2020 23:00 - Node5 is rebuilt from the ground up and Proxmox reinstalled

27-10-2020 01:00 - All VMs are recovered that we reconstructed the VMs from their images and confirmed that they were operational.

27-10-2020 03:00 - US Mirrors are in the process of rebuilding due to becoming outdated and this will take time

27-10-2020 06:00 - All VMs (virtual machines) restored. Node5 is fully operational


This issue was opened retrospectively.