Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years”

Facebook was down for two and a half hours earlier today for many of its 500-some million users around the world, in what the company describes as “the worst outage we’ve had in over four years.” As part of the downtime, social plugins such as the Like button, and the developer platform, were also not accessible. The site also went down yesterday, but apparently for less time and fewer people.

As the engineering team details in a post this afternoon following the outage, a cache configuration problem cascaded into a major system failure, and ended up with Facebook having to turn off the site for many if not all users. The company tells us it doesn’t “have exact numbers, but this very widespread.” From the post:

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

While Facebook has had occasional site performance problems, in general it has managed to stay up for almost all users almost all of the time, with performance improving in the most recent years.

Marketing with Facebook Insights

Mediabistro Course

Marketing with Facebook Insights

Starting October 2, use Facebook’s analytics tool track and optimize your marketing efforts! Taught by the group marketing manager of social media at Microsoft/BingAds, Geoffrey Colon will show you how to measure key performance indicators and make your data actionable. Register now!

 

Leave a Reply

2 Responses to “Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years””

  1. John Johnston says:

    Nasty business. Must have been some pretty stressed engineers there having to take the site down for that long.

  2. essay_writing says:

    As I understand, the errors have been accumulating and this led to the system breakdown? If yes, I guess more tests and better monitoring the system all the time it was working could have avoided the breakdown.

Get the latest news in your inbox
interested in advertising with inside facebook?

Social Media Jobs
of the Day

Marketing Associate

Wainscot Media
Montvale, NJ

Assistant/Associate Professor - Social Media

SYRACUSE UNIVERSITY
Syracuse, NY

Campaign Manager

Interactive One
New York, NY

Digital Marketing Director

McMurry/TMG
Phoenix, AZ

Social Media and Communications Assistant

Carnegie Endowment for International Peace
Washington, DC

Featured Company

Join leading companies like this one and recruit from the nation's top media job seekers on the Mediabistro Job Board. Every job post comes with our satisfaction guarantee. Learn More
 

Our Sponsors

Mediabistro A division of Prometheus Global Media home | site map | advertising/sponsorships | careers | contact us | help courses | browse jobs | freelancers | content | member benefits | reprints & permissions terms of use | privacy policy Copyright © 2014 Mediabistro Inc. call (212) 389-2000 or email us