Runbooks reduce toil and standardize processes across a team. But creating your runbooks is only the first step. Automatingrunbook execution based on an alarm, without human intervention, is the real goal.
Runbooks that kick off automated remediations give operators the ability to get ahead of ticket counts and incidents, even while environments scale and grow more complex. Eliminating the need for humans to wake up and press a button or copy/paste a script increases availability, even as your fleets expand and applications grow in complexity.
So let's look at the how and why of automating runbooks.
Runbooks are a critical part of the third stage of operational maturity - they are a “defined” and proactive documentation of your tasks, problems, and procedures. Runbooks generally happen after stage two - creating a general repeatability of tasks - and just before stage four - implementation of automation. Here are some advantages to using runbooks:
Runbooks requiring manual intervention are a great step away from the pain of memorizing the functions and "isms" of an organization’s technologies and technical debts. Runbooks reliant on manual intervention usually come in the form of a wiki with a search bar and a hope (and/or dream) that a team member can find the documentation when it’s needed.
But as you mature, you'll start looking closer at operational maturity. ITIL defines the final level of maturity as an optimized process that is repeatable, easily defined, and widely automated.
Without manager-speak, this means working toward automating yourself out of a job.
Let's look at an example of this journey to maturity with runbooks:
This is a world where an operator identifies an opportunity to improve the infrastructure one server at a time. This person articulates how to do this process, democratizes the process to help offload the toil to make better use of time, and then entirely removes the team from the toil altogether by moving the execution to automation.
The operator has now sped up a process, introduced a system where automation can manage these units of focus, and as a result has gained back working hours.
One of the promises of Kubernetes is that it automates the deployment of our infrastructure. Once we have Kubernetes properly configured, we’ve removed a certain class of problems. But there’s a whole class of problems that restarting a container or rolling back a configuration won’t solve. For example rotating security certificates, cordoning and draining a bad node, and other issues that may involve state.
Expanding the range of maintenance issues and incidents we can resolve automatically beyond restarts expands our operational capability and gives operators superpowers. Automating the creation of runbooks is a good first step, but the end goal should be to offload more and more of the maintenance and incident remediation to our machines.
So let’s take a look at various tools that get us here.
Several products exist on the market to help you automate your runbooks. They can simplify the identification of problems and trigger prescribed steps for the solutions needed.
These products are the robots that we want taking over our work. Specifically, we want them taking over the work of humans repeatedly hammering on uninteresting problems that have already been solved.
These products break down into a few categories.
Confluence and Wiki.js serve as great tools for documenting processes as wikis, but don’t offer automation capabilities.
Rundeck and Transposit take metric signals from monitoring and observability platforms and then provide predefined runbooks for a human to follow when remediating an incident.
Shoreline collects real-time resource metrics across your fleet and can trigger configured remediation actions based on these metrics to resolve incidents without human intervention. Whereas an automatically created runbook still requires a human to process, execute, and review, an automatically executed runbook takes an input of data to analyze and make a decision, just like a human would but faster and more consistently.
Different teams might take different approaches or gain different benefits from such systems.
When you’re able to automate task execution, it’s easier to justify taking those extra steps to ensure stability or to provide analytics that might otherwise seem like a waste of time. These extra steps often reduce the time needed to execute, save incident impact length, reduce service requests (by trusting teams to self-service and execute complex tasks), and ultimately protect your workforce from the trials and tribulations of technology.
Runbooks do reduce effort, but creating them they are only a stop on the road. Automating runbook execution gives operators leverage to eliminate redundant tickets and maintenance work, even as environments scale in size and complexity.
Shoreline is building out-of-the-box remediations based on our beta customer feedback, and you can join the beta here. We’d love to learn about the challenges you’re facing in your infrastructure as we continue to build.