tl;dr -- The increasing fleet size and complexity of production environments has created an explosion in on-call incidents. You can dramatically reduce on-call fatigue and improve availability using Shoreline’s incident automation platform.
Waterfall. Back in those days, I’d spend two months writing a design doc, four months coding, and maybe wait another six months for the release.
The Internet and SaaS changed all that. Suddenly, we were releasing once a week. Faster releases meant testing needed to be done faster and better, leading to QA automation using Agile and pipelines.
The cloud moved us to micro-services, 10x larger fleets, and 10x more deployments. This required automating configuration and deployment, using CI/CD, GitOps and Infrastructure-as-Code.
At AWS, where I ran transactional database and analytic services, on any given day, we’d be doing 6-12 production deployments of services to the hosts in our AZs and Regions. In 2014, Werner Vogels disclosed that Amazon was doing 50M deployments each year to development, testing, and production hosts. By now, everyone knows how to approach automating deployments.
So, why does production operations feel harder than ever?
You can’t choose when a disk fills up, a JVM goes into a hard GC loop, a certificate expires, or any of the thousand issues that happen in production operations. Keeping the lights on is a 24x7 problem.
It’s really tough. At AWS, I’d see our fleets growing far faster than the service teams operating them. Without automation to squash tickets, on-call would grow longer and ticket queues deeper. It’s not just AWS - my friends at GCP and Azure describe the same thing. If they’re struggling with this, what chance do the rest of us have?
There are many observability and incident management tools out there. They’re good at telling you what’s going on in your systems and helping you bring together the people to fix them. These are necessary parts of your production ops toolchain.
But I never got excited by one more dashboard to look at or one more process optimization tool telling me what to do next. I did get excited when someone told me that an issue we would see again and again had now been automated away.
To automate incidents, we need to solve two big problems. For new issues, we often lack telemetry and SSHing into node after node to find the needle in the haystack takes time we don’t have. With repetitive issues, safely automating the repair can be a months-long dev project and who has time when there are hundreds of such issues out there?
We start with the belief that operators know how to administer a single box - the challenge is extending that to diagnose and repair a large fleet. We created an elegant DSL called Op that provides a simple pipe delimited syntax, integrating real-time resources and metrics with the ability to execute anything you can run at the Linux command prompt. Now, you can run simple one-liners to debug and fix your fleet in about the same time as a single box. You tell us what to do, we figure out how to run it in parallel, distributed across your fleet.
We also made it easy to create remediation loops that check for issues, collect diagnostics, and apply repairs automatically in the background on your hosts - each and every second! These are defined using the same Op language used in incident debugging. This eliminates the difference between in-the-moment debugging and automation. That matters because, to make a meaningful dent in repetitive incidents, it can’t take longer to fix something once and for all than it takes to fix it once.
Over time, you’ll build a collective memory for production ops since the resource queries, metric queries, and actions are all named and easily accessible right in the UI and CLI your operators use to debug, repair, and automate.
When I describe what we’re doing here at Shoreline, people often ask me why it doesn’t exist already. I get it. Shoreline is the tool I wish I’d had at AWS - it would have saved endless hours of repetitive work.
It’s actually a really hard problem to solve. Probably the hardest I’ve worked on in my career! Let’s look at some of the problems you need to solve...
There’s a lot more inside Shoreline, but that should give you a sense of the platform.
Automated remediation is no substitute for root cause analysis. But, you need to alleviate the immediate problem faced by your customers while you schedule root cause repair by your dev team. If you go to the ER with a heart attack, it’s the wrong time to hear about your diet or high cholesterol.
And, operator fatigue is a real problem. Repetitive work has a 1% error rate done by hand. Doing this work manually while waiting for a code fix has meaningful risk. That’s why we automated testing, configuration, and deployment - that’s why we also need to automate repetitive incidents.
Shoreline also makes root-causing transient issues much easier by capturing all the debugging information whenever a bad condition is observed.
Our goal at Shoreline is to radically increase system availability and reduce operator toil through incident automation. Today, we’re launching publicly, but we’re just getting started. We’ll keep improving our automation engine and language. We have support for Kubernetes and AWS VMs and are currently developing Azure and GCP support - we’re eager to provide cross-cloud debugging and automation! We are developing out-of-the-box Op Packs to help with the commonplace issues seen by operators - we have several already and look forward to developing a hundred more so operators at one company benefit from what others have already seen.
Want to learn more or try it out? Reach out to me at anurag@shoreline.io.