Facebook’s “Worst Outage in Over Four Years”

One social networking site and 500 million friends later, Facebook got the gigantic downtown of its life yesterday morning. Read on to see the post-mortem report from the network's Software Engineering department.
Earlier yesterday, the Facebook network experienced a downtime that began at around 11:30 am PST. According to Facebook Software Engineering Director, Robert Johnson, the downtime was caused by a mis-handled error condition.
Such instance involves and automated system, designed to verify configuration values in the cache. Meaning, every single client saw an invalid value and attempted to fix it. Due to this, with the fix involving a query to a cluster of databases, the database cluster was overwhelmed. Even worse, after the real flaw has been solved, the stream of queries goes on as it is interpreted the configuration as an invalid value.
Read the rest of the article »