One of the key tenets in DevOps is to involve the Operations teams in the full software development life cycle (SDLC) and in particular to ensure that “operational requirements” (“OR’s”, formerly known as “non-functional requirements”, “NFR’s”) are incorporated into the design&build phases.
In order to make your life easier the DevOpsGuys have scoured the internet to compile this list of the Top Ten DevOps Operational Requirements (ok, it was really just chatting with some of the guys down the pub BUT we’ve been doing this a long time and we’re pretty sure that if you deliver on these your Ops people will be very happy indeed!).
#10 – Instrumentation
Would you drive a car with a blacked out windscreen and no speedo? No, didn’t think so, but often Operations are expected to run applications in Production in pretty much the same fashion.
Instrumenting your application with metrics and performance counters gives the Operations people a way to know what’s happening before the application drives off a cliff.
Some basic counters include things like “transactions per second” (useful for capacity) and “transaction time” (useful for performance).
#9 – Keep track of the Dependencies!
“Oh yeah, I forgot to mention that it needs [dependency XYZ] installed first” or “Yes, the system relies on [some 3rd party web service] can you just open up firewall port 666 right away”.
Look, we all understand that modern web apps rely on lots of 3rd party controls and web services – why re-invent the wheel if someone’s already done it, right? But please keep track of the dependencies and make sure that they are clearly documented (and ideally checked into source control along with your code where possible). Nothing derails live deployments like some dependency that wasn’t documented and that has to be installed/configured/whatever at the last moment. It’s a recipe for disaster.
#8 – Code defensively & degrade gracefully
Related to #9 above – don’t always assume the dependencies are present, particularly when dealing with network resources like databases or web services and even more so in Cloud environments where entire servers are known to vanish in the blink of an Amazon’s eye.
Make sure the system copes with missing dependences, logs the error and degrades gracefully should the situation arise!
#7 – Backward/Forward Compatibility
Existing code base with new database schema or stored procedure?
New code base with existing database schema or stored procedures?
Either way, forwards or backwards, it should work just fine because if it doesn’t you introduce “chicken and the egg” dependencies. What this mean for Operations is that we have to take one part of the system offline in order to upgrade the other part… and that can mean an impact on our customers and probably reams of paperwork to get it all approved.
#6 – Configurability
I once worked on a system where the database connection string was stored in a compiled resource DLL.
Every time we wanted to make a change to that connection string we had to get a developer to compile that DLL and then we had to deploy it… as opposed to simply just editing a text configuration file and re-starting the service. It was, quite frankly, a PITA.
Where possible avoid hard-coding values into the code; they should be in external configuration files that you load (and cache) at system initialisation. This is particularly important as we move the application between environments (Dev, Test, Staging etc) and need to configure the application for each environment.
That said, I’ve seen systems that had literally thousands of configuration options and settings, most of which weren’t documented and certainly were rarely, if ever, changed. An “overly configurable” system can also create a support nightmare as tracking down which one of those settings has been misconfigured can be extremely painful!
#5 – “Feature Flags”
A special case of configurability that deserves its own rule – “feature flags”.
We freakin’ love feature flags.
Because they give us a lot of control over how the application works that we can use to (1) easily back out something that isn’t working without having to roll-back the entire code base and (2) we can use it to help control performance and scalability.
#4 – Horizontal Scalability (for all tiers).
We all want the Product to be a success with customers BUT we don’t want to waste money by over-provisioning the infrastructure upfront (we also want to be able to scale up/down if we have a spiky traffic profile).
For that we need the application to support “horizontal scalability” and for that we need you to think about this when designing the application.
3 quick “For Examples”:
- Don’t tie user/session state to a particular web/application server (use a shared session state mechanism).
- Support for read-only replicas of the database (e.g. a separate connection string for “read” versus “write”)
- Support for multi-master or peer-to-peer replication (to avoid a bottleneck on a single “master” server if the application is likely to scale beyond a reasonable server specification). Think very carefully about how the data could be partitioned across servers, use of IDENTITY/@Auto_Increment columns etc.
#3 –Automation and “scriptability”
One of the key tenets in the CALMS DevOps Model is A for Automation (Culture-Automation-Lean-Metrics-Sharing if you want to know the others).
We want to automate the release process as much as possible, for example by packaging the application into versionable released or the “infrastructure-as-code” approach using tools like Puppet & Chef for the underlying “hardware”.
But this means that things need to be scriptable!
I can remember being reduced to using keystroke macros to automate the (GUI) installer of a 3rd party dependency that didn’t have any support for silent/unattended installation. It was a painful experience and a fragile solution.
When designing the solution (and choosing your dependencies) constantly ask yourself the question “Can these easily be automated for installation and configuration”? Bonus points if you can, in very large scale environments (1,000 of servers) build in “auto-discovery” mechanisms where servers automatically get assigned roles, service auto-discovery (e.g. http://curator.apache.org/curator-x-discovery/index.htm) etc.
#2 – Robust Regression Test suite
Another think we love, almost as much as “feature flags” is a decent set of regression test scripts that we can run “on-demand” to help check/verify/validate everything is running correctly in Production.
We understand that maintaining automated test scripts can be onerous and painful BUT automated testing is vital to an automation strategy – we need to be able to verify that an application has been deployed correctly, either as part of a software release or “scaling out” onto new servers, in a way that doesn’t involve laborious manual testing. Manual testing doesn’t scale!
The ideal test suite will exercise all the key parts of the application and provide helpful diagnostic messaging if something isn’t working correctly. We can combine this with the instrumentation (remember #10 above), synthetic monitoring, Application Performance Management (APM) tools (e.g. AppDynamics), infrastructure monitoring (e.g. SolarWinds) etc to create a comprehensive alerting and monitoring suite for the whole system.
The goal is to ensure that we know something is wrong before the customer!
#1 – Documentation
Contrary to popular belief we (Operations people) are quite happy to RTFM.
All we ask is that you WTFM (that’s W as in WRITE!)
Ideally we’d collaborate on the product-centric documentation using a Wiki platform like Atlassian Confluence as we think that this gives everyone the easiest and best way to create – and maintain – documentation that’s relevant to everyone.
As a minimum we want to see:
- A high-level overview of the system (the “big picture”) probably in a diagram
- Details on every dependency
- Details on every error message
- Details on every configuration option/switch/flag/key etc
- Instrumentation hooks, expected values
- Assumptions, default values, etc
Hopefully this “Top Ten” list will give you a place to start when thinking about your DevOps “Operational Requirements” but it’s by no means comprehensive or exhaustive. We’d love to get your thoughts on what you think are the key OR’s for your applications!