Cloudflare CEO Matthew Prince Explains, in Detail, and Apologizes for Yesterday’s Global Outage

Cloudflare CEO Matthew Prince:

The issue was not caused, directly or indirectly, by a cyber
attack or malicious activity of any kind. Instead, it was
triggered by a change to one of our database systems’ permissions
which caused the database to output multiple entries into a
“feature file” used by our Bot Management system. That feature
file, in turn, doubled in size. The larger-than-expected feature
file was then propagated to all the machines that make up our
network.

The software running on these machines to route traffic across our
network reads this feature file to keep our Bot Management system
up to date with ever changing threats. The software had a limit on
the size of the feature file that was below its doubled size. That
caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing
were caused by a hyper-scale DDoS attack, we correctly
identified the core issue and were able to stop the propagation
of the larger-than-expected feature file and replace it with an
earlier version of the file. Core traffic was largely flowing as
normal by 14:30. We worked over the next few hours to mitigate
increased load on various parts of our network as traffic rushed
back online. As of 17:06 all systems at Cloudflare were
functioning as normal.

We are sorry for the impact to our customers and to the Internet
in general. Given Cloudflare’s importance in the Internet
ecosystem any outage of any of our systems is unacceptable. That
there was a period of time where our network was not able to route
traffic is deeply painful to every member of our team. We know we
let you down today.

This post is an in-depth recount of exactly what happened and what
systems and processes failed. It is also the beginning, though not
the end, of what we plan to do in order to make sure an outage
like this will not happen again.

Everything about this incident exemplifies why Cloudflare is one of my favorite companies in the world. Ideally, it wouldn’t have happened, but shit does happen. Among the things to note about Cloudflare’s response:

They identified and fixed the issue quickly.
They issued frequent updates to their status site while the incident remained ongoing.
They published this postmortem within 24 hours. (That’s remarkable, given the technical breadth of the postmortem. Publishing this tomorrow, within 48 hours of the incident, would have been a praise-worthy accomplishment.)
The postmortem starts with a cogent, well-written layperson’s explanation of what happened and why.
The postmortem expands to include very specific technical details, including source code.

Lastly, it’s worth noting that Prince put his own name on the postmortem (and wrote much of it himself, using BBEdit), and closed with this apology, taking personal responsibility:

An outage like today is unacceptable. We’ve architected our
systems to be highly resilient to failure to ensure traffic will
always continue to flow. When we’ve had outages in the past it’s
always led to us building new, more resilient systems.

On behalf of the entire team at Cloudflare, I would like to
apologize for the pain we caused the Internet today.

This is how it’s done.

★

Leave a Reply Cancel reply