The OVHCloud datacentre campus fire in Strasbourg, France, sent shockwaves through the hyperscale cloud community when it happened in early March 2021, but the industry-wide after-effects of the event could be transformational. In terms of addressing shortcomings in enterprise attitudes towards cloud backups and disaster recovery, while also changing the way that datacentre operators worldwide approach fire suppression.
The fire occurred in the early hours of Wednesday 10 March 2021, with the firm’s five-story SBG2 datacentre destroyed outright during the blaze, while another facility – dubbed SBG1 – incurred some damage. Two other datacentres at the site – known as SBG3 and SBG4 – were switched off as a post-fire precaution and were reportedly undamaged by the incident.
Even so, OVHCloud customers across Europe were affected by service interruptions and downtime by the incident, and in the weeks that have followed the firm has been racing to bring their applications and workloads back online again.
These efforts have included embarking on a widescale clean-up of the datacentre campus, but – simultaneously – the firm has been drawing on the fact it builds all its own servers in-house to rapidly replace the server capacity destroyed during the fire.
The company operates 15 datacentres in Europe, and also moved to make any spare capacity within these sites available to affected customers as well. At the time of writing, OVHCloud’s service status page for the Strasbourg facility stated that it is still in the throes of rolling out replacement server capacity at alternative datacentre locations for customers who had workloads housed in SBG2 and the partially destroyed parts of SBG1.
Both facilities housed a mix of public cloud, bare metal and virtual private services (VPS), with the company confirming that 80% of the public cloud-hosted virtual machines these datacentres hosted are back online, as of Tuesday 6 April 2021. Meanwhile, 25% of its bare metal services have been restored, and 34% of its bare metal-based VPS service are also back online.
In SBG1 specifically, 35% of the bare metal cloud servers were back online as of Tuesday 6 April 2021, the company’s service status site confirmed, with OVHCloud stating its hope to have 95% of services back in action by the end of this week.
Availability for customers
The update further confirmed that SBG4 and SBG3 are operating at 99% availability for customers.
In a video update, posted on 22 March 2021, OVHCloud founder and chairman Octave Klaba shared details of the how efforts to restore services for affected customers were progressing, but also confirmed the root cause of the fire is still the subject of an ongoing investigation that is set to run for a while yet.
“The investigation is ongoing,” he said, and involves law enforcement, insurance personnel and other assorted financial experts. “It will take a few months to have the conclusion of this investigation, and once we have it all, we’ll share it with you.”
Initial reports in the wake of the event, however, have suggested the onset of the blaze may have been linked to work carried out on an Uninterruptible Power Supply (UPS) at the site on the day leading up to the fire.
“Early indicators point to the failure of a UPS, causing a fire that spread quickly,” said Andy Lawrence, executive director of research at the datacentre resiliency think tank, the Uptime Institute, in a March 2021 blog post. “At least one of the UPSs had been extensively worked on earlier in the day, suggesting maintenance issues may have been a main contributor.”
Although there is no way of knowing for sure at this point, it is possible the UPS in question may have been deployed next to a battery cabinet that may have overheated and caused a fire, offered Lawrence.
“Although it is not best practice, battery cabinets (when using vent-regulated lead acid or VRLA batteries) are often installed next to the UPS units themselves,” he wrote. “This may not have been the case at SBG2, [but] this type of configuration can create a situation where a UPS fire heats up batteries until they start to burn and can cause fire to spread rapidly.”
Raising standards for fire detection
While the investigation into the cause of the fire continues, Klaba said during the video update that the company is committed to using the incident to develop new industry standards, setting out how best to tackle fires within datacentres.
Presently, best practice techniques and standards for fire detection, suppression and extinguishment within datacentres vary according to the location of the datacentre itself, but also what type of equipment is deployed in each room, he said.
“[There are] different kinds of fire [extinguishment techniques] for an electrical fire and a different kind for a fire coming from the servers. Whatever the standard is… we [have] decided to over secure all our datacentres,” said Klaba.
In addition to this, he continued, OVHCloud has set itself a goal of creating a fire testing laboratory, within which the firm will test how fires progress within different datacentre settings, and has committed to sharing the findings from that work with the wider industry.
“We decided to create a lab where I want to test. I want to see how the fire is going in the different kinds of the rooms, and to find the best way to extinguish the fire in all kinds of these situations. I want to also to share the conclusion that we will have in this lab with all industry,” he said.
“Because we we don’t want to have this kind of the incident in our datacentre, but also nobody wants to have this kind of an incident in [their] datacentre at all, and the industry has to evolve, and to evolve their standards.”
Datacentre fires are a mercifully rare occurrence in the datacentre industry, but that does not stop them being anything less than a constant concern for operators, stated the Uptime Institute’s Lawrence in an April 2021 blog post about the frequency of such incidents.
“Uptime Institute’s database of abnormal incidents, which documents over 8,000 incidents shared by members since its inception in 1994, records 11 fires in datacentres – less than 0.5 per year,” wrote Lawrence. “All of these were successfully contained, causing minimal damage and disruption.”
Lawrence goes on to share an observation in the post that it tends to be the systems put in place to suppress fires that tend to do more damage than actual fires in datacentres.
“In recent years, accidental discharge of fire suppression systems, especially high pressure clean agent gas systems, has actually caused significantly more series disruption than fires, with some banking and financial trading datacentres affected by this issue,” wrote Lawrence.
He also offers operators some fire prevention advice, in terms of the steps they should take to ensure the relatively low incidence of fires reported in the sector continues.
“Responsibility for fire regulation is covered by the local authority having jurisdiction, and requirements are usually strict, but rules may be stricter for newer facilities, so good operational management is critical for older datacentres,” he said.
“Uptime Institute advises that all datacentres use very early smoke detection apparatus systems and maintain appropriate fire barriers and separation of systems. Well-maintained water sprinkler or low-pressure clean agent fire suppression systems are preferred. Risk assessments primarily aimed at reducing the likelihood of outages will also pick up obvious issues with these systems.”
Moving data to the cloud is not the same as backing it up
While the OVHCloud datacentre fire can serve as a cautionary tale for other operators about how to avoid their facilities befalling a similar fate, what about the firm’s customers who have experienced a prolonged period of service disruption as a result of the incident? What lessons can they learn from all this?
According to Christophe Bertrand, senior analyst at TechTarget-owned Enterprise Strategy Group, the number one lesson that enterprises need to learn from this incident – regardless of whether they are an OVHCloud customer or not – is the importance of backing up their data.
“Whatever you do as a business, you are always responsible for your data. From a compliance and governance standpoint, you – as a business – are responsible for securing the ability to recover your own data,” he told Computer Weekly.
“Just because you have placed data with a third-party software as a service (SaaS) or cloud infrastructure provider, you’re still responsible for your data,” said Bertrand. “If something happens, and anything could happen, on your premises or with the cloud service you use, you should always be in a position to recover your data.
“What we have [with OVHCloud] is possibly a situation where maybe people thought, because it was with a third-party provider, it was automatically protected and backed-up,” he said. “[So] tough luck, because the data is your data and it’s on you – as a business if you don’t have a backup somewhere else.”
For some of the firms affected by the fire, the lack of backup could be fatal, said Bertrand. “I really feel for the small companies that were affected by it, because [the fire] is certainly not their fault, but if they didn’t have a backup that was strategically thought through and placed somewhere where they could recover their data, then they made a mistake. And it maybe fatal one. I think some businesses will close based on that.
“They may also now incur some additional issues as well,” he said. “They have a liability to their end users, or maybe some business partners, and maybe some compliance exposures to? Compliance exposures, for sure, because you’re not really supposed to lose data.”
A common misconception that IT buyers often have about cloud is that they mistake the fact their data is accessible from anywhere as proof that it is backed-up and will always be available in the event of an outage, said Bertrand.
“My research shows this big disconnect in terms of protection of data that’s in cloud environments… because somehow people conflate availability with protection,” he said.
OVHCloud’s Klaba made a similar observation during one of his post-fire video updates, where he made a public commitment to provide the firm’s customers with free data backups in future as standard, rather than as a paid-for add-on.
“It seems globally, the customers understand what we are delivering, but some customers don’t understand exactly what they have bought, so we don’t want to jump into this discussion by saying we will explain better what we are delivering. What we are doing is we will increase security, and we will deliver the higher security of backups for all customers in different datacentres,” he said.
And, in OVHCloud’s Klaba’s view, this could lead other cloud firms to follow suit in due course. “This incident will change our way of delivering the services, but I believe it will also change the standards of the industry and the market,” he said, in a video update to customers dated 16 March 2021.
Jon Healy, operations director at datacentre management services provider Keysource, said the entire incident serves to reinforce why disaster recovery is something neither datacentre operators nor cloud users can afford to overlook.
“One hundred percent service availability is an expected standard today but putting this in place for some requires comprehensive planning and can have both technical and commercial implications which need to be considered in order for it to be effective,” he said.
Given the average lifespan on a datacentre, there is every chance that – while fires might be scarce now – that could change in the future.
“Given the exponential increase in facilities built in the early noughties, the core infrastructure reaches end of life in 10-to-20 years, and the capital investment to replace or upgrade remains high, will we see more events like this and what will this mean for the industry?”
One area that ESG’s Bertrand and others have commended OVHCloud on is the transparency and openness of its communications with customers in the wake of the fire, which have included regular video updates from Klaba, as well as daily despatches on the situation via his Twitter feed and service status updates from the company directly from its web pages.
“They seem to have been very transparent, communications-wise, which is a real sign of maturity,” he said. “There is probably only so much they can share, and they have to be cautious because of this process in place to figure out what happened, but you don’t get the sense that they’re hiding anything.”