Cloud Operating Models: How To Expand Skillsets, Practices And Procedures

Introduction

Moving to the cloud is a complex undertaking. It’s far more than a tech-centric re-platforming exercise. When we look holistically at cloud adoption, technical aspects only touch the tip of the iceberg.

An Adaptive Operating Model

If you’re serious about cloud (and modernising to remain relevant in the digital economy), you will also need to change how you operate your IT services. The key is to create an IT organisation that can adapt to the changing needs of the business and its customers in today’s fast-moving digital economy.

In the below diagram, we’ve outlined how some of the more mature cloud-based businesses we work with handle this, at a super high level. It’s not a target model in itself – the significance of each element differs depending on an organisation’s size, stage and structure. However, it provides a useful starting point from which to consider what a high-performance, adaptive operating model might look like.

You might choose to call it a Cloud, DevOps or Digital Operating Model, or something else entirely. Whatever you call it, it’s likely to have similar key features:

Product-based teams, not projects
Self-service platforms
Modern Ops & SRE
Automation as the key to everything
Agility and adaptation

Let’s look at these in turn.

Products, not Projects

The core of the model is long-lived teams aligned to ‘products’. A product is a set of self-contained goods and/or services that delivers value to your stakeholders (internal or external) and has a product lifecycle that needs to be managed over time. This contrasts with the more traditional and temporary ‘project teams’ that exist to deliver a certain outcome and/or while budget remains for that work.

Product teams should be highly autonomous, operating as independently as possible from external decisions, expertise or systems. Increasing external dependencies that the team relies on will exponentially increase the chance that work is late or delayed. You can never eradicate dependencies entirely, but your organisational model, operating model and architecture should all seek to minimise them.

Product teams ideally have a ‘you build it, you run it’ mandate. As well as keeping dependencies low, it’s amazing how quickly the right behaviours emerge when we see a unified team with a mandate to make customers happy (by constantly delivering value), rather than one team battling to ship more features faster (Dev) and one battling for 100% uptime (Ops).

But even if the team doesn’t own the complete value stream, ongoing product ownership, with success measures linked to growth, customer happiness or the P&L is beneficial. It aligns the interests of the team with the interests of the business and the customer.

This has a surprising impact on happiness and motivation of team members. When people are reconnected with customers and see the impact their work has on business success, morale skyrockets, taking productivity and staff retention with it.

If you have a start-up or a very small number of teams operating in this way, you can probably stop reading here. But as the number of teams scales, the next two features of our high-level model gain ever-increasing importance.

Self-Service Platforms

As soon as you move beyond a small handful of product teams, it’s beneficial to start introducing self-service platform capabilities. With multiple autonomous teams it’s easy to get a lot of reinvention, which means wasted time. The job of the platform teams is to prevent separate product teams reinventing the wheel, by providing capabilities that help them deliver their work.

Depending on your needs, you might have one or more team with mandates covering:

Continuous delivery
Monitoring and observability
Security and compliance
Cloud infrastructure platform
Test automation
Data etc.

In effect, platform teams are just product teams that provide a product for internal customers. It’s important to think of the product delivery teams as customers because the platform teams should (for the most part) have no mandate to control what the product teams do or what tools they use (remember the importance of autonomy for the product team). Product teams will choose to use self-service capabilities offered by the platform teams because they’re simpler and more effective than re-inventing something which must then be maintained.

To reinforce this, it works best when combined with an enterprise open/shared source model. The platform teams provide transparency into how the self-service platforms are built and maintained by sharing the source code via a repository. The most common source control repository is git; hence the vernacular ‘git repo’ to describe where the code is stored.

The platform teams act as maintainers and core committers for this repo, but anyone can contribute to it. If a product team finds they need to extend the capabilities of something a platform team has created, they shouldn’t wait for the platform team to do it. They have access to simply make the change, use it, and submit it back to the platform team so that others can benefit too.

Modern Ops & SRE

The operations layer is typically the second challenge that starts to emerge with multiple product delivery teams. Even where teams fully embrace the ‘you build it, you run it’ mandate, incidents are going to emerge that need to be managed across multiple teams and/or aren’t the responsibility of any individual team. What’s more, when operating at scale, there is an increasing need for specialist operational expertise which might not sit naturally in any one team.

This is where some level of operational service may be beneficial. As a basic triage function, the primary mandate of this team will usually be to coordinate incidents across multiple product teams. It may also take on resolution where this can be done easily, in order to protect delivery teams from the scourge of unplanned work. And if anyone’s getting a full-on 24×7 mandate (i.e. shift work, not just being on-call) then these guys are the obvious candidates.

But be careful though, as this can lead to the re-emergence of a dreaded DevOps antipattern, the siloed Ops team.

So how do you offer some degree of shared operational capability without the emergence of silos, poor quality code and deployment hell?

Enter Site Reliability Engineering (SRE).

SRE offers a very effective way to separate primary operational responsibility (but not accountability) away from the product team. This protects product teams from unplanned work, allowing them to focus on user feature development.

It’s a massive topic that can’t be fully covered here. But there are some major features that differentiate SRE from a traditional operations approach:

1. SREs run a ‘shared responsibility model’ with the product team.

They don’t become accountable for uptime and availability after transition, they just take on a lot of the work associated with it. The product team may (or may not) stay partially responsible for day-to-day operations, perhaps sharing on-call duties, so they don’t lose sight of their product’s operability needs. And if availability or performance becomes a problem, feature work goes on hold and the product team is drafted straight back in to help fix things.

2. A Major focus for SREs is ‘making tomorrow better than today’.

Ideally they spend 50% of their time eliminating ‘operational toil’ (which basically means automating tasks that a machine can do better than a human). In this way SREs aren’t just an operations team, they’re building an automated operations capability. Over time, they’ll be able to take on more and more responsibility. But only if this split is maintained. Which is why it’s so important that they can push work back to the product team if it overwhelms their project work.

3. SREs embrace risk (in a sensible, controlled and structured way).

Each product will have a so-called error budget, which essentially covers allowable downtime that won’t meaningfully impact customers. There’s a lot of good thinking in how you measure this as a very reasonable proxy for customer experience. And you should make sure you spend the budget. Failure to do so points to an overly cautious approach, a lack of experimentation or an over-engineered system. If a product team isn’t spending its error budget, there’s probably scope to reduce costs or experiment more freely. But equally, overspending means unhappy customers, and that requires working proactively with the SREs to get back inside budget.

‘As code’ Automation – the Key to Everything

To fully unlock the benefits of cloud computing, organisations need to embrace modern automation tools and techniques. As discussed above, it’s this automation that removes wasted time and effort (toil), enabling product and platform teams to focus on creating value.

Core to modern automation is the ‘software-defined’ model – software-defined networks, software defined infrastructure and so forth. The shorthand ‘as code’ suffix is often used to describe this approach. For example:

Infrastructure as code
Configuration as code
Policy as code

It’s this ‘as code’ automation approach that enables teams to share reusable patterns and templates (in the ‘git repo’ described earlier). It also underpins the ability to dynamically adapt the capacity of your cloud environment to current customer demands. And it’s this dynamic rightsizing and provisioning of environments that drives the cost-saving of a cloud hosted model.

Agility and Adaptation

All of this requires an agile mindset. Organisations need to embrace agile ways of working that constantly adapt to changing circumstances. As Eric Ries describes in The Lean Startup the key to success is Build-Measure-Lean. Focus on small, incremental improvements and experiments that help you get customer feedback sooner, not multi-year projects that often fail to deliver value.

The best cloud operating models involve a series of initiatives that dovetail and reinforce each other over time. They deliver benefits individually, but the cumulative effect is truly transformational. In this way, the business is equipped with the adaptiveness and agility to thrive in the context of ever-changing demands that characterise the digital economy.