Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.Google SRE Handbook, Ch 5.
The Automation Stack
As you can see in the diagram below, there are many layers in the automation stack found in a modern operations engineers toolkit. In this blog, we are going to going to focus on the top layer, an area which is rapidly emerging as a focus for innovation, ‘event-driven automation’.
Firstly, what do we mean by an “event”? In ITIL terminology an “event” is a “change of state that has significance for the management of an IT service or other configuration item (CI)“. This is just a fancy way of saying, “something happened that you probably need to be aware of…”. In itself, an event isn’t good or bad. You’ll often see a hierarchy of events from informational to warning to error and so on.
Monitoring events (perhaps with a monitoring tool like DataDog) is a key part of day-to-day operations to ensure that these events don’t lead to “incidents” i.e. “an unplanned interruption to or quality reduction of an IT service”. Monitoring tools have the ability to set all sorts of rules to determine what events are and aren’t incidents, and whether to raise an “alert”. An alert normally involves some sort of visual notification on a dashboard (turning an icon amber or red) and sending some type of notification (email, SMS, Slack message, mobile push notification etc). The alert normally means someone has to investigate and take corrective action if necessary.
Eliminating toil via automation
What if we can automate that action, and eliminate some of the toil for the support teams? Enter event-driven automation like Puppet Relay.
Relay, and similar tools, use the concepts of triggers and workflows to automate routine activities. When an event occurs (the trigger) it starts a workflow. The workflow will be a series of automated steps one after another that perform the desired actions in response. (for a wider overview of the event-driven automation market read this blog post from Puppet’s Kenaz Kwa).
Relay, like other business-focused automation tools before it, Zapier or IFTTT (“IF-This-Then-That”), provides a library of re-usable triggers and actions. In Relay’s case, these triggers and actions are specific to the needs of DevOps & Cloud teams. These building blocks can be combined into Workflows via a drag&drop interface (or via YAML code). For example, a common event might be a Datadog alert for AppX occurring. The common action might be “Send a message to Slack”, then “scale-out AppX by adding a new server” (autoscaling) and then “close the Alert in Datadog”.
Automating these actions ensures that emerging issues are handled quickly and effectively, with minimal downtime or disruption for users. This eases the burden of 24×7 availability, reducing the number of events that require out of hours engineering support. Fewer call-outs lead to less stress and burnout for IT staff and increases job satisfaction.
Puppetize Digital 2020
If you want to learn more about event-driven automation, join us as we sponsor the Puppetize Digital 2020 on Nov 19th. Come along to the DevOpsGroup booth and learn more about DevOps and how we can accelerate your Cloud & DevOps journey.