This week, Anurag spoke at the CTO Summit on Reliability to share his new talk “Why systems fail and what you can do about it.” The talk covers four categories of system failures and mitigation approaches for each based on Anurag’s background at AWS running analytic and database services.
At AWS, operations leaders met weekly to discuss the prior week's issues, the cause of errors, and to discuss how to mitigate errors in the future. In these meetings they would categorize the previous week’s failures, and propose solutions.
Deployments are the most common source of outage minutes for most companies.
Since deployment challenges are a process problem you can also solve them with process.
We have a strong belief at Amazon about moving from good intentions towards mechanisms, because mechanisms can be iteratively improved and maintain a collective memory.
Via process improvements, AWS was able to reduce deployment failures by ~50x.
One artifact of this process was a deployment doc that would be reviewed by a skilled operator outside the service team.
This created a virtuous cycle. As deployments became more reliable, they were done more often, which made them smaller and therefore more reliable.
There’s a debate between the auto-rollback and only roll-forward camps. Anurag’s take is if you can rollback automatically, why wouldn’t you? Many companies make rollbacks work by doing things like splitting a change into multiple deployments:
This is a general application of the methodology where you make an initial interface change in the provider, then the consumer, then remove the stale interface in the provider. Distributed systems require this type of thinking in any case. You can’t update everything simultaneously - you need to support old and new interfaces and make gradual transitions to new versions.
The largest outages Anurag has seen were either operator errors or cascading failures with bad remediations.
Humans intrinsically have a 1% error rate, particularly when doing repetitive mundane tasks.
Ops orchestration tools should handle the above by default.
25% of Large-Scale Events Anurag saw involved databases.
This is true of a lot of things other than databases, like edge routers, cloud services, etc, so take a look inside your environment for systems with similar characteristics and take appropriate actions.
AWS likes to avoid relational databases in their control planes, opting for DynamoDB or other home-grown tech. For example, use a NoSQL system instead of a relational database, not because NoSQL is intrinsically more reliable, but because it fails in pieces, one table at a time rather than as a whole. It's less functional, it's less expressive, but that means that the remaining functionality is expressed in your own code, which gives you control.
Try to build “escalators” not “elevators”. Escalators are systems that reliably perform at a lower rate but also degrade to a lower rate rather than elevators. Elevators perform better in normal cases, but fail absolutely and degrade under load.
In 2020, Google reported 200 minutes of downtime across 150 large-scale events and GCP. That's a lot of downtime for a well-run, SRE-driven organization.
Everything eventually fails, and this is where we separate failures into commonplace failures and first-time failures.
Runbooks reduce human error during commonplace failure scenarios. But runbooks still leave humans in the loop where the time to respond is an hour or two, even for a well understood issue like a full disk.
For well-understood problems, Anurag found the only way around long remediation periods and downtime is to automate remediation. Every week they chose a problem or set of problems that would yield the greatest gains in productivity or availability and automated them.
Solving first time failures is challenging because observability tools have lag, and dashboards and logs often lack data for a new event. To debug, operators often end up opening up a blizzard of SSH windows.
Production ops is a real-time distributed systems problem, and requires a platform that:
The challenge with automating remediation common failures is that each remediation tends to be a custom, multi-month project. Shoreline’s platform makes it easy to build automated remediations with only shell scripting skills. In the same time it takes to debug the system, you can create an automation that handles the problem forever.
Shoreline is the product Anurag wishes he’d had managing large fleets at AWS.