Feedback about last night outage
Written by Jerome Granados on
From 2:00 AM to 09:30 AM CEST, GoodBarber and WMaker services were strongly altered.
Our team in charge of server hosting, 5 people (Greg, Pierre-Laurent, Sébastien, Jérôme and Dumè), was in Paris all the week to set up new hardware in a second data center, Global Switch, located on the suburbs of Paris. This work is part of a project to extend our infrastructure, initiated several months ago by the technical team, on which we plan to communicate once the full deployment is achieved. This operation is not related to the problem we encountered last night.
Paradoxically, however, the presence of our engineers in Paris slowed down our responsiveness, as they were on their way back to Ajaccio during the incident. In addition, to carry out the intervention at Global Switch, we suspended part of our alert system. This created several hours of delay in identifying the trouble. Our customers in the Pacific reported the problem to us via private message in Facebook and Twitter.
Alongside the operation at Global Switch, we carried out a routine visit in DC1, our data center located in the 19th district of Paris. When inspecting a machine, we discovered that APC-21, one of the power distribution unit (PDU), was experiencing a malfunction in its remote monitoring and control management system.
We ordered new equipment from our supplier and installed it to replace APC-21. We reconnected all the machines that were powered by APC-21 to this new equipment, APC-24, except for switch-nas11.
PDUs are systems designed to continue to power machines even though their management system is out of order. That's why we didn't disconnect switch-nas11 from APC-21. If we had, it would have led to a significant downtime. Without planning the intervention and notifying our users, doing such manipulation in an emergency situation was not an option.
At night, for some unknown reason, APC-21 stopped powering switch-nas11. When the OVH operator came to change switch-nas11 power supply from APC-21 to APC-24, the switch did not boot. This is a Cisco switch. This equipment is renowned for its reliability. We do not yet have an explanation as to why it has not worked properly.
We told the operator to use a back-up switch that was on reserve in the bay. The installation of this switch extended the maintenance because it was necessary to rewire all the machines at first. When the backup switch has been turned on, we discovered a problem on two network cards of the main server (master sql). In a second step, we had to rewrite all the routing rules. It is very likely that the problem on APC-21 caused the cascade breakdowns on switch-nas11 and the 2 network cards.
Since 9:30AM, all services are up. If we hadn't replaced APC-21 yesterday morning, the failure it experienced last night could have caused even more serious consequences. A great part of the bay would have stopped being powered brutally. This could have been very bad (temporary data loss, broken down machines,...) and could have caused an even longer downtime (machine replacement, reconfiguration, data backups recovery,...).
In the coming weeks, we are going to plan an additional operation to replenish the backup hardware inventory in the bay. We are also going to replace ahead of time the hardware that has the same seniority as those that were defective last night.
Paradoxically, however, the presence of our engineers in Paris slowed down our responsiveness, as they were on their way back to Ajaccio during the incident. In addition, to carry out the intervention at Global Switch, we suspended part of our alert system. This created several hours of delay in identifying the trouble. Our customers in the Pacific reported the problem to us via private message in Facebook and Twitter.
Alongside the operation at Global Switch, we carried out a routine visit in DC1, our data center located in the 19th district of Paris. When inspecting a machine, we discovered that APC-21, one of the power distribution unit (PDU), was experiencing a malfunction in its remote monitoring and control management system.
We ordered new equipment from our supplier and installed it to replace APC-21. We reconnected all the machines that were powered by APC-21 to this new equipment, APC-24, except for switch-nas11.
PDUs are systems designed to continue to power machines even though their management system is out of order. That's why we didn't disconnect switch-nas11 from APC-21. If we had, it would have led to a significant downtime. Without planning the intervention and notifying our users, doing such manipulation in an emergency situation was not an option.
At night, for some unknown reason, APC-21 stopped powering switch-nas11. When the OVH operator came to change switch-nas11 power supply from APC-21 to APC-24, the switch did not boot. This is a Cisco switch. This equipment is renowned for its reliability. We do not yet have an explanation as to why it has not worked properly.
We told the operator to use a back-up switch that was on reserve in the bay. The installation of this switch extended the maintenance because it was necessary to rewire all the machines at first. When the backup switch has been turned on, we discovered a problem on two network cards of the main server (master sql). In a second step, we had to rewrite all the routing rules. It is very likely that the problem on APC-21 caused the cascade breakdowns on switch-nas11 and the 2 network cards.
Since 9:30AM, all services are up. If we hadn't replaced APC-21 yesterday morning, the failure it experienced last night could have caused even more serious consequences. A great part of the bay would have stopped being powered brutally. This could have been very bad (temporary data loss, broken down machines,...) and could have caused an even longer downtime (machine replacement, reconfiguration, data backups recovery,...).
In the coming weeks, we are going to plan an additional operation to replenish the backup hardware inventory in the bay. We are also going to replace ahead of time the hardware that has the same seniority as those that were defective last night.