Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years”

Facebook was down for two and a half hours earlier today for many of its 500-some million users around the world, in what the company describes as “the worst outage we’ve had in over four years.” As part of the downtime, social plugins such as the Like button, and the developer platform, were also not accessible. The site also went down yesterday, but apparently for less time and fewer people.

As the engineering team details in a post this afternoon following the outage, a cache configuration problem cascaded into a major system failure, and ended up with Facebook having to turn off the site for many if not all users. The company tells us it doesn’t “have exact numbers, but this very widespread.” From the post:

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

While Facebook has had occasional site performance problems, in general it has managed to stay up for almost all users almost all of the time, with performance improving in the most recent years.

Tumblr Marketing

Mediabistro Course

Tumblr Marketing

Starting December 1, learn how to market using the most popular visual blog! In this course, you’ll learn how to develop a strategy for your own Tumblr account, get people to read and share your content, and integrate your marketing efforts with other social platforms. Register now!


Leave a Reply

2 Responses to “Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years””

  1. John Johnston says:

    Nasty business. Must have been some pretty stressed engineers there having to take the site down for that long.

  2. essay_writing says:

    As I understand, the errors have been accumulating and this led to the system breakdown? If yes, I guess more tests and better monitoring the system all the time it was working could have avoided the breakdown.

Get the latest news in your inbox
interested in advertising with inside facebook?

Social Media Jobs
of the Day

Social Media Strategist

A Luxury Real Estate Developer
New York, NY

Newswire Editor

New York City / Long Island, NY

Senior Social Media Brand Manager-Los Angeles Area

Brigade Marketing
Los Angeles, CA

Featured Company

Join leading companies like this one and recruit from the nation's top media job seekers on the Mediabistro Job Board. Every job post comes with our satisfaction guarantee. Learn More

Our Sponsors

Mediabistro A division of Prometheus Global Media home | site map | advertising/sponsorships | careers | contact us | help courses | browse jobs | freelancers | content | member benefits | reprints & permissions terms of use | privacy policy Copyright © 2014 Mediabistro Inc. call (212) 389-2000 or email us