Cause of today’s outage found

Print Friendly, PDF & Email

It appears that today’s outage may have been caused by a network connection failure in the internal private switch network used by DRBD to transfer data. The internal virtual bridge is connected to a channel bonded interface which is in turn connected to 2 physical interfaces. This is so that if a physical network connection was lost or if the adapter failed the slave interface would take over. Due to there not being the switch infrastructure present to support this when leda and himalia were installed the secondary interface was not plumbed in. At 9:32pm last night the only active connection was lost on leda which caused DRBD sync to stop between leda and himalia. I am unsure why this then caused the I/O errors I saw on the screen before rebooting (this morning at 10:30) which I unfortunately neglected to log.

To prevent this from happening again I have dual connected both the internal and external bonded interfaces on leda, himalia, oberon and titania. All other machines are dual connected, with the exception of abbott and costello which cannot be used to host VMs (have been marked as “drained nodes” in ganeti) and will be redeployed as dev machines.