Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years”

Facebook was down for two and a half hours earlier today for many of its 500-some million users around the world, in what the company describes as “the worst outage we’ve had in over four years.” As part of the downtime, social plugins such as the Like button, and the developer platform, were also not accessible. The site also went down yesterday, but apparently for less time and fewer people.

As the engineering team details in a post this afternoon following the outage, a cache configuration problem cascaded into a major system failure, and ended up with Facebook having to turn off the site for many if not all users. The company tells us it doesn’t “have exact numbers, but this very widespread.” From the post:

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.

While Facebook has had occasional site performance problems, in general it has managed to stay up for almost all users almost all of the time, with performance improving in the most recent years.

Facebook Marketing

Mediabistro Event

Facebook Marketing

Starting January 13, work with the group marketing manager of social media at Microsoft/BingAds to grow your business on Facebook! In this course, you’ll learn how to set up your company page, understand Facebook best practices, and execute a monthly content strategy. Register now!


Leave a Reply

2 Responses to “Facebook Engineering Explains “Worst Outage We’ve Had in Over Four Years””

  1. John Johnston says:

    Nasty business. Must have been some pretty stressed engineers there having to take the site down for that long.

  2. essay_writing says:

    As I understand, the errors have been accumulating and this led to the system breakdown? If yes, I guess more tests and better monitoring the system all the time it was working could have avoided the breakdown.

Get the latest news in your inbox
interested in advertising with inside facebook?

Social Media Jobs
of the Day

Assistant Editor

8 Inc.
New York, NY

Copywriter & Editor

Santa Monica, CA

Director of Marketing & Communications

Neumans' Kitchen
New York, NY

Social Community Manager

Tallahassee, FL

Editorial Director

Phoenix House
New York, NY

Featured Company

Join leading companies like this one and recruit from the nation's top media job seekers on the Mediabistro Job Board. Every job post comes with our satisfaction guarantee. Learn More

Our Sponsors

Mediabistro A division of Prometheus Global Media home | site map | advertising/sponsorships | careers | contact us | help courses | browse jobs | freelancers | content | member benefits | reprints & permissions terms of use | privacy policy Copyright © 2014 Mediabistro Inc. call (212) 389-2000 or email us