Fastly runs a CDN — a content delivery network that makes delivery of Web pages faster globally. Due to a bug, major Fastly clients including Amazon, Reddit, Spotify, eBay, and Pinterest were unable to serve pages to customers for about an hour Tuesday. Fastly apologized — and its apology is a clinic in how to apologize for a technical problem.
Here’s Fastly’s full blog post published Tuesday.
Summary of June 8 outage
Published June 8, 2021
Nick Rockwell, Senior Vice President of Engineering and Infrastructure
We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change. We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.
This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.
On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.
Here’s a timeline of the day’s activity (all times are in UTC):
09:47 Initial onset of global disruption
09:48 Global disruption identified by Fastly monitoring
09:58 Status post is published
10:27 Fastly Engineering identified the customer configuration
10:36 Impacted services began to recover
11:00 Majority of services recovered
12:35 Incident mitigated
12:44 Status post resolved
17:25 Bug fix deployment began
Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25.
Where do we go from here?
In the short term:
- We’re deploying the bug fix across our network as quickly and safely as possible.
- We are conducting a complete post mortem of the processes and practices we followed during this incident.
- We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.
- We’ll evaluate ways to improve our remediation time.
We have been — and will continue to — innovate and invest in fundamental changes to the safety of our underlying platforms. Broadly, this means fully leveraging the isolation capabilities of WebAssembly and Compute@Edge to build greater resiliency from the ground up. We’ll continue to update our community as we make progress toward this goal.
Even though there were specific conditions that triggered this outage, we should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support. Customers should always feel free to email firstname.lastname@example.org for more information.
Why this is the perfect apology
You can’t control when and how problems with your company affect customers. What you can control is what you do when there is a problem. Communication with customers should be fast, clear, contrite, and not overly defensive. Fastly’s blog post is practically a template for how to do it right. Take note of these elements of the post:
- The title is clear and neutral — it’s about the outage.
- The lede describes what happened in neutral terms: “We experienced a global outage” — and what happened when Fastly fixed it “Within 49 minutes, 95% of our network was operating as normal.”
- There is a clear apology. It doesn’t evade who is responsible, and is directed to those who were harmed — Fastly’s customers and their customers’ customers: “We’re truly sorry for the impact to our customers and everyone who relies on them.” It’s neither minimizes the issue nor blubbers — it’s just a sincere apology.
- There is a detailed description of what happened and in what order, starting with introducing a bug and then the way that a single customer caused that bug to affect customers globally. Notice that there is no attempt to avoid blame: “We began a software deployment that introduced a bug.”
- The post includes a public plan describing follow up, including how to avoid similar bugs in the future and speed recovery.
- It ends with taking responsibility and outreach to customers: “[W]e should have anticipated it. We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support.”
If you’re going to screw up, this is how to do it
Practice writing apologies and post mortems like this.
It will give your customers confidence in your worst moments.