What enterprises can learn from the extended Facebook outage

A recent blackout of Facebook and its associated services for six hours created massive chaos among users and pushed businesses to reflect on alternate real-time collaboration methods

What enterprises can learn from the extended Facebook outage - CIO&Leader

Early this month, social media giant, Facebook, and its associated services Instagram, and WhatsApp, suffered massive outages that halted its services for almost six hours at a stretch and sent many in a frenzy. The users could not send or receive any messages and were unable to make or receive calls either. Facebook had earlier experienced a prolonged outage in 2019 when its services and applications were down for about 24 hours.

While lengthy outages of technology companies are not unusual, they could create massive chaos among users due to the unprecedented digital-first environment we have recently transitioned into. The event also demonstrated our excessive dependence on a single company to connect and collaborate with colleagues and families in real-time. Facebook's top-notch IT talent's failure to resolve the issue promptly or the absence of a solid backup backbone strategy to get things moving is a concern that the social media giant would like to address.

The disruption caused a dramatic impact on the fortunes of Facebook's share market appeal, where its stocks dropped significantly by nearly 5% within a single day. The blackout of WhatsApp, specifically, which is the most popular messaging service in over 100 countries with over 2 billion active users, caused more anguish for both enterprises and users alike. The blackout also affected several apps that use Facebook login and restricted users to use its much-touted payments service. 

Enterprises that use WhatsApp and Facebook Messenger were clueless about the time social media giant will restore its services.

What went wrong?

Facebook attributed the changes in Facebook's server configuration as a reason for the deadlock of its services. It appears that due to the defective configuration changes, the network and backbone routers that coordinate network traffic between Facebook's data centers could not identify the location of their data centers. Many technology experts were quick to suggest that the problem was associated with the Border Gateway Protocol (BGP) route that comprises the IP addresses of its DNA servers.

During the routine maintenance, Facebook explains, these configuration changes caused the DNA servers to go offline. Since the blackout also impacted Facebook's internal systems, even its employees could not collaborate effectively through Facebook Workplace or log in with their work email in many cases.

"During one of the routine maintenance jobs, a command was issued to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command," explains Santosh Janardhan, VP - Engineering and Infrastructure, Facebook, in the company's official blog post.

This is akin to the study reflected in the IT Process Institute's Visible Ops Handbook that says that 80% of unplanned outages are caused due to ill-planned changes made by administrators (operations staff or developers).

Santosh adds that the change caused a complete disconnection of Facebook's server connections between its data centers and the internet. And that total loss of connection caused a second issue that made things worse.  

Many speculative theories pointed that the services of Facebook were down due to the result of hacking or some malicious activities, a claim that the social media major refuted vehemently. While Facebook mentioned that there was no evidence to suggest the user data was compromised due to the downtime, the company has a history of failing to protect its customers' data, and that's enough to ring the alarm bells in the minds of many users.

Significant impact on enterprises

In recent years, many businesses across the globe have transitioned from email to WhatsApp as the primary instant messaging service to champion two-way communication seamlessly with their customers and employees, replacing erstwhile phone lines and website ways. It has become a preferred medium for a significant population to connect, collaborate and share ideas instantly. There has been an increasing trend where grocery apps, online retailers, and financial institutions reach out to their customers through WhatsApp messages to take timely feedback and verify the delivery of products/services.

WhatsApp Business API has also significantly grown, allowing enterprises to develop more impactful customer engagements through a range of customization. Many in the organizational workforce are also heavily dependent upon WhatsApp groups for real-time internal deliberations and operations management. Many web-goers in India use these social media platforms but do not access the internet, indicating the influence Facebook and WhatsApp hold in the Indian market.

"The disruption was panicking for me as a CEO of a mid-level recruitment firm. For several minutes, I was under the impression that there was a problem with our internet connectivity. When I spoke with my team over a call, I realized that it was a mass outage," says Gaurav Kumar, Managing Partner, G9 Recruiters.

Most small enterprise users did not plan to circumvent such downtime and mitigate the severity of its impact. Even for operational brand-building efforts, they rely heavily on our social media channels. "Our people depend significantly on WhatsApp, Facebook, and LinkedIn to collaborate, generate leads, answer queries, and schedule meetings. During the disruption, we decided to leverage traditional mobile SMS service and emails until the services get restored, but that's not real-time and time taking," echoed Deepak Pandey, Director - Strategy and Technology of Propack Electronics, an Electrostatic Discharge and Clean Room hazard solution provider.

A robust strategy to prevent business paralysis

The outage has again highlighted that despite substantial efforts to strengthen IT infrastructure, enterprises cannot evade downtime incidents and need a robust backup strategy to escape significant losses. On the face of it, using a free app such as WhatsApp for internal or stakeholder communication is certainly not a viable long-term solution for businesses.

As the usage of collaboration tools is growing faster and employees are increasingly sharing documents through these apps, it becomes imperative for organizations who wish to have a greater level of privacy and security, to deploy enterprise-grade messaging apps for enhanced security and business continuity. Such apps also enable organizations to schedule maintenance and updates at a time convenient for them.

Additionally, relatively small companies may not be immediately ready to deploy an alternate messaging strategy but should consider fallback options that can reduce their dependency on a standalone tool and reduce the overall risk in the longer term. For Facebook, too, this could be significant learning and compelled them to revisit their strategy to put their entire DNS servers within their internal network.


Add new comment