Amazon’s Prime Day 2018 outage lasted just 75 minutes, but cost them around $90 million in lost revenue – about $1.2 million per minute.
No service or platform is immune to the realities or limitations of technology, or the engineers that service it. Mistakes, incidents, and failures are a fact of (tech) life.
At Fast, we prioritize learning from every issue – and creating a safe place for our engineers to learn, too.
It’s not just about the people involved; it’s the process
Engineering, like every other department, is made up of humans. But giving credit for good work (which we do generously at Fast) means nothing if we point fingers angrily when something goes wrong, instead of analyzing why it happened at all.
Often, problems develop because it’s easy for human error to affect the system. When a system is reliable, it’s because the humans involved factored their own capabilities for error enough that they removed themselves at every possible turn.
At Fast, we focused on creating a lightweight, outlined process for analyzing incidents that was not only complete, but could scale with us as we continue to expand our engineering teams.
Enter blameless culture
Omer Malik, one of Fast’s senior engineering managers, introduced blameless culture to Fast.
“The concept around blameless incidents and postmortems is an industry standard, but executed in very different ways,” said Malik. “For Fast, we wanted to build a tradition of running incident ceremonies with a Team, Selfless, and Customers First attitude.”
For example, Customer Support should see what’s happening in real time – and a remediation timeline – so they can communicate to customers. Sales or Product can use insights to roadmap their own features. And Engineering can evaluate if it’s time to shift priorities to reliability for a while, instead of focusing on shipping new features.
But if there’s a high price to pay for taking responsibility or vocalizing concerns, human nature says engineers are more likely to stay quiet.
“We leverage these pillars during the incident, its postmortem, and the action items that come out of it. At every stage of the incident management process, we create a safe zone for engineers to share their experiences, exchange opportunities, and receive feedback to create a favorable outcome for our users and customers,” Malik said.
Here are the core questions Fast’s Engineering team now asks when an issue arises:
- Who are the people most affected and relevant to the issue?
- How do we communicate what happened, and did we resolve it for those affected?
- Are we actually learning from our mistakes, or merely fixing them?
Our engineers appreciate an environment with a focus on neutral resolution. They quickly adjusted to the process of engaging in a true analysis phase for an incident, versus what companies typically expect engineers to do: code a fix and move on to the next user story.
Net results for Fast’s engineering teams
Communication = transparency = insight
Open, real-time communication builds transparency into the overall company process, including roadmapping, and gives every department outside Engineering the ability to self-service their knowledge of what’s happening at Fast.
For example: what kind of impact did a certain outage have on the buyer experience? With the open communication a blameless culture embraces, engineers may learn there was actually no negative impact.
Customer empathy beyond the support team
Fast engineers don’t just build and deploy: they’re on-call for our releases. This means they see and hear firsthand how our product affects our users – and what those users feel. While our engineering culture prioritizes customer empathy, every opportunity to strengthen that empathy also strengthens the feedback loop, allowing our engineers to build even better (and faster).
The Fast metric for blameless culture success
We continue to ask two questions of our processes:
Is the process fair?
Incidents almost always have a clear business impact.
“When something difficult occurs, like a system going down, you want to feel like the processes are in place to work on the right things,” said Tyler Julian, software engineer on Fast’s Site Reliability Engineering Team. “A blameless culture gives engineers a place to voice concerns regarding what the problem areas are in the technology, for example. It’s one thing for an engineer to point out an instability, but another to not have any qualitative way to describe the business impact.”
Does it feel like the right thing is being done?
“Implementing a blameless culture empowers us to link incidents and impact and articulate engineering needs for operational excellence – which we focus on quite a bit as a team here – because we now have the data to back it up. Instead of fixing the code and carrying on, we complete the postmortem cycle by digging through the problems and roadmapping solutions to prevent them from happening again,” said Julian.
A clear business and engineering management process makes it easier for engineers to articulate and justify the kinds of operational excellence and reliability work that’s so important to Fast.
“A strong, complete process to understand exactly how and where you need to refine your product secures the link between users and engineers, providing everyone with a voice,” said Julian.
“And that’s a powerful thing.”
Fast is hiring! Join us.