| |

Amazon’s unwisely lets the nerds apologize for its AWS outage

Image: Kinvey

Much of the Internet was down for three hours last week. Because of an outage at Amazon Web Services (AWS), Amazon’s cloud-based service that powers Internet companies, hundreds of services like Medium, Slack, Quora, Reddit, and Kickstarter stopped working or worked poorly. Amazon’s Web geeks then bungled the response, focusing on jargon rather than clarity and empathy.

Here’s what actually happened: at 9:37 Pacific Time on February 28, an AWS tech entered a command with a typo in it. Instead of taking a just a few cloud servers offline for maintenance, this took down an entire AWS data center in Virginia, causing the outage. One system that stopped working was the AWS health dashboard, which continued to show green “working” status across the board for services that were failing or had failed.

Only other nerds can understand Amazon’s explanation

There is an AWS blog, but it has been mute on the topic of the outage. So was the main Amazon blog. Amazon’s press room has no release about the outage, and it’s missing from the the “AWS in the News” page, even though it the media was all over it. The only communciation was a 900-word announcement on the AWS messaging system, written in technical language from a nerd’s perspective. When analyzing this communication, keep in mind that there are two audiences. One is the technical users of the AWS system, whom this announcement serves, albeit poorly. The second is the users at AWS’s customers (for example, people reading Medium blogs or contributing to Kickstarter). For that second, much larger group, this explanation is completely worthless.

Let’s look at that announcement. I’ve put the jargon in bold, the passive voice in bold italic, and the weasel words in italic. I have deleted some passages that are similar to the ones here, because I know your jargon tolerance has limits. My own translation follows each section:

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

Translation: We typed the wrong thing and mistakenly brought down the system.

[Similarly techie passage that explains how they restarted things deleted.]

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. [More jargon about the fix follows.]

Translation: We’ll fix things so this won’t happen again.

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Translation: We were so screwed up even the dashboard that shows how screwed up we were was screwed up.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

Translation: Let’s throw an apology in at the end.

What’s wrong with this “apology” and how to fix it

This is an example of terrible communications. It’s not in the normal communications channels like the blog. It fails to apologize up-front to the millions of users affected. It is self-centered, because it focuses on the details of what happened rather than the effects. It is so full of jargon that it is incomprehensible to a nontechnical reader, such as those millions of users of Quora, Kickstarter, and the other services.

Passive voice has infected this apology, as it does so many technical explanations. We learned that the inputs were entered incorrectly, the servers were removed, the placement subsystem is used during PUT requests, and the subsystems were restarted. What we don’t know is who did these things. This lack of actors makes the explanation fuzzy and hard to read and contributes to the clarity problems.

The technical readers of this blog may be saying, “Wait, we need jargon to be precise here.” I don’t disagree. But always state what happened in clear and simple language first, then use jargon to explain the details to more technical readers. That way, the nontechnical readers can understand what’s happening, while the technical readers can get a clearer explanation.

The final problem here is the order. Don’t talk about yourself first. Start with the apology. Only then explain what happened and how you will fix it.

With these concepts in mind — and using subheadings and bullets to make this easier to parse, here’s what AWS should have posted.

We’re sorry about the outage in AWS. Here’s what happened

Due to a a mistake one of our team members made, the AWS service stopped functioning properly in our Northern Virginia data center (US-EAST-1). Significant parts of our service were down for nearly three hours, from 9:37AM PT to 12:26PM PT. We know that hundreds of our customers — and millions of their users — depend on AWS, and that our service failure disrupted their businesses, messed up their users’ experiences, and cost them money. We apologize for causing this problem.

Here’s an explanation of what happened, how we fixed it, and the steps we’re taking to prevent future, similar problems.

What caused the outage and how we recovered

One of our team members used a command intended to take a small number of servers offline for a routine service issue. Because they typed a parameter incorrectly, the command ended up taking larger set of systems down. While this error originally affected only the S3 billing process, it also ended up cascading into problem with several other subsystems that depend on that process:

  • The index subsystem, which manages the metadata and location information of all S3 objects in the region.
  • All GET, LIST, PUT, and DELETE requests, which depend on that index subsystem.
  • The placement subsystem, which manages allocation of new storage
  • PUT requests to allocate storage for new objects, which depends on the placement subsystem.
  • The AWS Service Health Dashboard, which depends on part on these systems.
  • The S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes, and AWS Lambda, which were unavailable while we were restarting the other systems.

Normally, a problem with a few servers on one system like this wouldn’t cause such a major impact, but we’d never had to restart so many systems at once, and it took longer than we expected. Some systems came back on line by 12:26PM PT, and they were all working by 1:54PM PT.

How we’re fixing things

We’ve fixed the immediate cause of the problem by modifying the tool we use to take capacity off line; it will will now remove capacity more slowly and won’t take any subsystem below its minimum required capacity. We will also:

  • Improve the recovery time of key S3 subsystems.
  • Further partition the index subsystem to reduce the “blast radius” of failures like this. This is part of a continuing effort to limit effects of failures to as small a set of servers and systems as possible.

We know that you depend on the constant, consistent availability of our systems. That is our goal as well, and we continue to work hard to improve it.

What you can learn from this

There are so many lessons here. Here are a few of the big ones:

  • When apologizing for a problem you caused, put the apology first, not the explanation.
  • For a problem that affects a large number of people, respond in a widely read channel, like a blog.
  • Team communications professionals (PR people) with technical people to explain technical failures. Don’t let the nerds twist in the wind alone.
  • Use a simple explanation in plain language first, even if it is imprecise. After that, explain further with technical detail that is more accurate, using jargon if necessary.
  • Use headings to break up difficult communications into easily parsed pieces.
  • Use bullets to make it easier to read (or skip) technical details in a list or sequence.

Your users will thank you for it. So will your fellow nerds, both coworkers and customers.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

One Comment