The Top 10 DevOps Operational Requirements

Introduction

One of the key tenets of DevOps is to involve operations teams in the full software development lifecycle (SDLC) and ensure that ‘operational requirements’ (ORs, formerly known as ‘non-functional requirements/NFRs’) are incorporated into the design and build phases.

But how does this work in practice? To help answer that question we have compiled this list of the Top Ten DevOps Operational Requirements. We’re pretty sure that if you deliver on them your operations, people will be very happy indeed.

10 – Instrumentation

Would you drive a car with a blacked-out windscreen and no speedometer? No, didn’t think so, but operations are often expected to run applications in production in pretty much the same fashion.

Instrumenting your application with metrics and performance counters gives the operations people a way to know what’s happening before the application drives off a cliff. Some basic counters include things like ‘transactions per second’ (useful for capacity) and ‘transaction time’ (useful for performance).

9 – Keep Track of Dependencies

“Oh yeah, I forgot to mention that it needs [dependency XYZ] installed first,” or “Yes, the system relies on [some third-party web service] can you just open up firewall port 666 right away”.

We all understand that modern web apps rely on lots of third-party controls and web services, so why re-invent the wheel if someone’s already done it, right? But please keep track of the dependencies and ensure they are clearly documented (and ideally checked into source control along with your code where possible). Nothing derails live deployments like some dependency that wasn’t documented and has to be installed or configured at the last moment. It’s a recipe for disaster.

8 – Code Defensively and Degrade Gracefully

Related to #9 above; don’t always assume the dependencies are present, particularly when dealing with network resources like databases or web services. Even more so in cloud environments where entire servers are known to vanish in the blink of an eye.

Make sure the system copes with missing dependencies, logs the error and degrades gracefully should this situation arise.

7 – Backward/Forward Compatibility

Existing codebase with new database schema or stored procedure? New codebase with existing database schema or stored procedures?

Either way, forwards or backwards, it should work just fine because if it doesn’t you introduce ‘chicken and the egg’ dependencies. What this means for operations is that one part of the system has to be taken offline in order to upgrade the other part…which can impact customers and probably involves reams of paperwork to gain approval.

6 – Configurability

Where possible, avoid hard-coding values into the code; they should be in external configuration files that are loaded (and cached) at system initialisation. This is particularly important as we move the application between environments (Dev, Test, Staging etc) and need to configure the application for each environment.

That said, an ‘overly configurable’ system can also create a support nightmare as tracking down a setting which has been misconfigured can be extremely painful.

5 – Feature Flags a Special Case of Configurability that Deserves its Own Rule is ‘Feature Flags’.

We’re big fans of feature flags because they give operations a lot of control over how the application works. This can be used to (1) easily back out something that isn’t working without having to roll back the entire code base and (2) to help control performance and scalability.

4 – Horizontal Scalability (for all Tiers)

Everyone wants the product to be a success with customers, but it’s important not to waste money by over-provisioning the infrastructure upfront (we also want to be able to scale up/down if we have a spiky traffic profile).

So, the application needs to support horizontal scalability, which should be considered during application design. For example:

Don’t tie user/session state to a particular web/application server (use a shared session state mechanism).
Support for read-only replicas of the database (e.g. a separate connection string for ‘read’ versus ‘write’).
Support for multi-master or peer-to-peer replication (to avoid a bottleneck on a single master server if the application is likely to scale beyond a reasonable server specification). Think very carefully about how the data could be partitioned across servers, use of IDENTITY/@Auto_Increment columns etc.

3 –Automation and ‘Scriptability’

One of the key tenets in the CALMS DevOps Model is A for Automation (Culture-Automation-Lean-Metrics-Sharing if you want to know the others).

We want to automate the release process as much as possible, for example by packaging the application into versionable released or the infrastructure as code approach using tools like Puppet and Chef for the underlying hardware. But this means that things need to be scriptable.

When designing the solution (and choosing dependencies) constantly ask yourself “Can these easily be automated for installation and configuration?”

2 – Robust Regression Test Suite

Something we love almost as much as feature flags is a decent set of regression test scripts that can be run on-demand to check/verify/validate that everything is running correctly in production.

Maintaining automated test scripts can be onerous and painful BUT automated testing is vital to an automation strategy. We need to be able to verify that an application has been deployed correctly, either as part of a software release or scaling out onto new servers, in a way that doesn’t involve laborious manual testing. Manual testing doesn’t scale!

The ideal test suite will exercise all the key parts of the application and provide helpful diagnostic messaging if something isn’t working correctly. We can combine this with instrumentation (remember #10 above), synthetic monitoring, application performance management (APM) tools (e.g. AppDynamics), infrastructure monitoring (e.g. SolarWinds) etc. This creates a comprehensive alerting and monitoring suite for the whole system.

The goal is to ensure that we know something is wrong before the customer does.

1 – Documentation

Ideally, developers and operations teams should collaborate on product-centric documentation. Using a wiki platform like Atlassian’s Confluence can work well as it gives everyone the easiest and best way to create – and maintain – documentation that’s relevant to everyone.

As a minimum, operations teams need:

A high-level overview of the system (the big picture), probably in a diagram
Details on every dependency
Details on every error message
Details on every configuration option/switch/flag/key etc.
Instrumentation hooks, expected values
Assumptions, default values, etc.

Our top ten list of operational requirements is by no means exhaustive, but hopefully, it gives you a good place to start.