DevOpsGroup CPO, Stephen Thair, hosts a roundtable at WinOps 2018 where he discusses the future of operations and operability.
Date: 16th November 2018 | Duration: 41:27
Hello. What’s this talk about? It’s not a talk. I have no slides. I have no presentation. I actually want you guys to actually contribute to this. So, sucked in! You all came to a thing where you are actually expected to do something.
Basically I’ve been having lots of conversations with various people, Microsoft and various other companies, about what’s the role of the I.T. professional in a Cloud world? Where’s the future going? Because my suspicion is that there are a lot of I.T. people out there that don’t actually really know where their career is going at all. We knew the on-premise world. We could hug our servers.
There’s actually a really good presentation at Ignite by a guy who’s an SRE at Microsoft. Works for Microsoft IT. He gives this whole presentation about being a server hugger. How he loves hugging his servers and now he can’t hug his servers anymore.
So really this is market research for me. I want to try and understand what you guys are doing and how you’re managing systems in the Cloud. My suspicion is that we had on-premise we had ITIL we had all these really clearly defined roles. I was the sand guy, the storage guy. I was the core networking guy or girl. Think of all those traditional roles that we have. My history, I was the IIS guy. That was my thing. I was the app server layer and doing all the IIS stack. That was my technical passion from IIS 3 through to IIS 7. But I don’t think that people know where these roles are going in the future.
I really just wanted to get some of these people together to have a conversation. To see if you guys are thinking the same thing. For those people who are running systems in the Cloud where do they see the evolution of the patterns and practices that they’re currently doing? Then I’m going to draw a mind map and see if we can have a conversation.
So, my first question is who is working in a company that is running Cloud systems in production at scale? Okay. That’s awesome. Well put it this way if it goes down your company loses a lot of freakin money. That’s my definition of scale. You might only have one server.
My biggest claim to fame was I built an IIS platform for BNP Paribas in 1999 that did a hundred and twenty million euro bond issue and the site was up for one day. It was literally spin up an AS active server page, spin up this website. It did this bond trade and then we took the whole thing down. We spent a quarter of a million on the platform. They didn’t care. They’d made shedloads of money out of the bond transactions. So yeah it’s really about value.
So, I guess my first question for those people who put their hands up is are you still following an ITIL framework model way of managing your systems?
Parts of ITIL. Release management, not so much. So we don’t really implement it fully. So, maybe not ITIL, but the same tenants.
Okay, so Incident Management you are definitely doing. Still, you’re doing some sort of release management.
Change control is harder with Cloud. Harder to keep on top of.
Does anybody else have any? Are you following a framework? Do you have an operations framework that you’re following in your managing your systems? Or are you just making stuff up as you go along?
We’re definitely making it up as we go along.
Okay. So there’s a lot of making it up. How’s that working out for you?
We’re quite small so I think we get away with a lot that you wouldn’t get away with in a much larger organisation.
Small so we can. So that’s good. What do you think that you do that you wouldn’t be able to get away with in a larger organisation?
I think, in a smaller organisation, that the risks and the questions you have to answer are fewer and less awkward. It’s inherent in a smaller organisation that if you fail less bad things happen, and they happen on a smaller scale.
Is everybody familiar with the concept of VAR, value at risk? So it’s a way of basically calculating if I’m ASOS and I’m doing 3 billion a year. Forty-five orders a second at peak. Average order value of £120. I think we calculated something like three or four million for a 15 or 20 minute outage. There are basically metrics on how you calculate this. This is Value at Risk and the value of downtime and the value of reputational loss and the value of all these kind of things.
Is everybody familiar with GDPR? No vendor has come knocking on your door to hand you the magic GDPR in a box solution. There was a great section at the DevOps Enterprise Summit in Las Vegas where the lady, who is the Chief Risk Officer for Nike, she put up there Nike’s total value of 34 billion or something in revenue a year. Total possible fine under GDPR 4 percent of worldwide revenues. One point something billion was the value at risk. Then she worked it down, chances if something goes wrong we get caught. 100 percent. If we’re not compliant we’re going to get caught. Then work that down that they get the maximum fine with about 50/50 or something. So she worked it out is that her value at risk was like five hundred and thirty million dollars. That was the value of screwing GDPR up. The negative benefit basically. The Negative ROI. So if I’ve got five hundred and whatever million risk I can afford to spend a fair bit of money to get it fixed.
So I know you obviously Ian. You were talking about one of your goals for 2019 was to start to push more of this DevOps practice into operations. Do you feel that they understand what their job roles are in the future and where the Cloud operations are going? Do they have a Cloud operations framework?
No. We’re at the point of forking the road at the moment. We’ve still got an operations function that’s fairly traditional. Still sticking to their ITIL principals. Like to see things done in the old way. But you’ve got part of that function that is actually starting to schism. They are starting to realise that actually that isn’t reality anymore. So you’ve got the guys that understand the virtual world. Understand the infrastructure as code and all that side of the world. They are starting to pivot. They are realising that actually there is a different way of doing this.
You’ve still got the other side of the function who remain quite traditional. Trying to almost maintain those barriers between operations and development and engineering. They still see themselves as being masters of everything, coordinating key activities. Actually, they’ve got less and less knowledge about those inherent systems than they ever had before.
Beforehand they used to be the people that knew exactly what were all of the servers because those servers were probably inhabited by seven or eight different kinds of applications, that came from multiple different areas. They were the only people that understood them. Whereas now they haven’t spun that infrastructure up. They’ve never seen that infrastructure. That infrastructure could be being destroyed on a regular basis. Based on a technology stack that they’ve never seen before or heard of in their lives. Yet they’re still trying to maintain the illusion that they’re still controlling that world. And it’s quite interesting for us.
So we don’t have the gatekeepers now. You don’t have that single person. Everyone is sort of equal. You don’t have anyone controlling that server.
I think the other side is our systems are so complex. It used to be Windows services. There used to be so many of them. They used to sit on three big SQL servers. The entire company used to sit on one big database that was actually replicated out for different things. I could literally put my arm around it and say that is ASOS when I joined the company. Whereas now we’ve got something like 700 VMs in the Cloud. We’ve got an infinite number of PaaS services. We’ve got 80,000 billable items in Azure. No one person can understand that. It’s physically impossible. You’ve had to drive the platforms out. You’ve had to drive that knowledge about for incident management and issue management, performance management. All that stuff has to go out to the platforms. And that means the operations teams have to operate fundamentally differently in what they do.
It reminds me of a session. I don’t know if any of you went to Pipeline a couple of years ago when Dan North did the Keynote? He did a keynote called Ops and operability. It was excellent. It was hilarious. He was talking about this problem where we’ve gone down the cross-functional teams that own their product and their services all the way. But if we still have one shared operations team at the end of it that looks after the thing. Then there was a day when the operations team could define your platform. Saying you’ll use SQL Server and you’ll use C#. And once you give so much power to the product it means the operations team that are supporting way too much stuff that no one can understand.
So it’s saying if you’re going to go down this road you need to work out how to organise. You need to not make one operations team responsible for everything. The thing is, I guess, that either you do it entirely or you don’t do it at all. Going a way where you have individual product teams that have to come up with their own infrastructure and one general operations team has to manage an infrastructure they have no part in making. That is probably a way doomed to fail.
That’s why we’re in that schism. Operations are just waking up to that. I think they’re playing catch up a little bit. We’re now redefining the boundaries of where operations and engineering teams own stuff. So it used to the operations team used to own the physical server. They used to probably own the operating system. They used to own the patching. And then you used to put your software on top of it and that’s where you drew the line between the two.
I think we’re now drawing at the line of operations might own the underlying networking strategy of an implementation within Azure. And they might own that bit. And then the teams build the infrastructure on top of that. So now that line is starting to emerge. Whereas up to now the teams also owned some of that underlying infrastructure. We’ve got some problems because of that. Because they are not experts in networking.
The operations team have basically got a choice. Either if they want to focus on the hardware they need to end up seconding over to a product team and being the hardware expert in the product team. Or if they want to focus on the networking they stay in the operations team. Like, do they need to make an active choice which way they are going to go?
I think we’re going to give them a third choice. A couple of guys that drive up the templating, for example, might become more of a platform engineer like my team are. But yes we are starting to see those people getting direction. You can still be on the physical hardware side. We still have physical networks. We still have printers and servers. So some people are going in that direction because that’s the world they know. We’ve got people that are gravitating towards teams because actually they want to go and get closer to the operational aspect of platform teams. They can add value in that space. Or they’re becoming much more of a DevOps-y shared services function where they set the policies, strategy, enterprise-level arm templating. All that stuff the teams can then consume. That’s the three areas we’re starting to shift into. What we haven’t got yet is a strategy for that.
It’s interesting. I used to work for Redgate. I don’t work there anymore. They used to do the kind of thing that most companies do. Every couple of years as a cycle they would have a reorganisation where they basically reassess where they are at. What the company needs now. Then they’ll try and put all the people into it. Which was always a horrible and stressful time for everybody at Redgate as they were afraid of being made redundant or put into a role they didn’t want to do. They always tried very hard to make sure everyone was happy. He always used to make the point that once you get beyond about 12 people in your company there is no good way to organise it.
Take a breath because I want to unpack. You’ve all made a lot of points and you’re going too far away and I have to bring it back. There’s a couple of things that came out of that, that I wanted to have a conversation about. Somebody said they’re no longer the gatekeepers. Do you think that the product team, the multidisciplinary team that is building the software, in an SRE model there’s an argument that they do not have an unfettered right to production? The only path to production is the SRE team can say yes or no. Because if you exceed your error budget your access to production gets cut off. So the question then becomes should those product teams have an unfettered right to production or is partly operations role to say actually we are still the gatekeepers of production?
That feels like a contradiction to me.
Okay, so the reason I say that is because somebody else used the phrase guardrails. So one of the things I know Microsoft’s working very hard on is stuff like Azure blueprints. GitHub is doing GitHub blueprints. You have Azure policy. You have Hashicorp Sentinel. AWS has policy frameworks. Everybody’s trying to look at a way to say we are going to have these policy frameworks.
Who’s ever taken their kids ten pin bowling? You have those inflatable things that they put up in the gutter. So what happens if you go a bit too far? You’ll bump into this thing and you’ll bounce back in the right way. So one of the things that we’ve been talking about was that the role of operations, as it evolved, is you are the people who ultimately you build those guardrails. You build those templates and a way to make sure that the Ops teams can’t veer off. Because at the end of day you still have PCI compliance. If you’re in financial services you might have Sarbanes-Oxley. Or if you’re in healthcare in the US you’d have HIPAA. There are many many others. FCA compliance, GDPR. There’s still lots of compliance stuff and that’s non-negotiable. We have to get that right. Does that feel more like the operations of the future is going to be more focused on building the guardrails and then saying this is the playground in which you can operate?
That’s what we’re investing heavily in doing right now. The Azure policy stuff. I call it safety nets. That’s what we’re trying to do. We’re trying to build these in so teams inherently can only do the right things. Within a certain scope. So they can provision any VM they like from Azure. As long as it comes from a gold village. We know it’s going to be inherently secure. It’s going to have all the right patching applied to it. If they go into a subscription we can enforce tagging so we know where that asset comes from, what it’s got in it. Those things and we can pay for it.
But that’s the other side of it. We’re investing heavily in that. We’re working heavily with infrastructure team around those kinds of areas. So we’re talking about things like golden subscriptions now. So we can spin up a subscription, we’ve got underlying networking. It’s already built-in for you. You just build on top of it. So you don’t have to worry about that piece. But we know we got the governance by design. We haven’t got to be policemen. We know it’s there inherintly.
So you want to be the architects and the builders and then the product teams can decorate anything inside. You get the three-bedroom house, the four-bedroom house, the five-bedroom house and then how you decorate it inside is up to you. But you don’t get to build a completely different house.
That kind of gives us that trade-off between enterprise and platform. You have what the Enterprise are responsible for. You let the teams get on with what they need to get on with. You’ve got some confidence they’re not going to off the reservation too far.
Anybody in this side of the room want to have a comment?
You mention different teams. I feel it should be more of a partnership. The Ops team maybe should be in charge of server-level governance but they should be working in together with their engineering team or their development team. Those PCIs are a joint responsibility. Governance is a joint responsibility. It’s not the responsibility of one team. It’s everybody’s responsibility at all levels. Having people work together in a DevOps role or an SRE role. More of a product delivery team.
We do have product delivery teams. That’s how we work. However, there are some things that are enterprise concerns. Things like tagging. That’s an enterprise global problem across all platform teams. Those kinds of things we have a central function to build out those that we need for everybody. Things like PCI absolutely. I’ve got one of my platform engineers here today who looks after our PCI platform and that is massively a joint project between all areas actually to make sure that is compliant.
One of the things we talked about. Actually, if you go to the GitHub stand they’ve got a copy of Andy Oram’s book, the guide to insourcing through lightweight. But one of the concepts that we’re exploring, and started to come through very heavily DevOps Enterprise Summit Vegas, was the concept of enterprise open source, enterprise shared source, inner source, whatever you want to call it. Which is the idea that all of the source code in the organisation is public within that organisation. You have the same concept you have in an open-source product of you can submit a pull request and there are contributors and core committers and maintainers. There’s a level of hierarchy on who can approve and who can merge and stuff like that.
So I think that’s really interesting and I completely agree. There’s got to be that collaboration. But what I see is that the collaboration is implemented like an open-source project. So if you want to contribute there’s the Azure blueprint, there’s the framework, there’s the whatever, then pull request.
This idea that I’m going to be the only person who can touch this stuff. We talked to our customers a lot, both AWS and Microsoft recommend that you start your Cloud journey by starting up a Cloud Centre of Excellence. To which I always respond I’ve worked in large scale enterprise financial services organisations, nobody from the Centre of Excellence ever actually came to help me do my job. They came to tell me how to do my job. So we’re trying to get them to turn language now to being a Cloud enablement team. We are here to help you, enable you. We’re going to move these blockers out of the way. We’re going to move to a PaaS type environment. We’ll offer you all these services through a PaaS platform.
So you are talking about compliance etc. A lot of it is just NFRs and about ending the culture of NFRs. That begins with development. So the code you write does dictate the platform it sits on and the specs it needs. So we talk about shift left security. Have that conversation early on. So that culture of NFRs and formal security, that’s happened early on.
So when we talk about any NFRs we’ve banned the word NFRs with our customers. We talk about operational requirements. The minute you put the word non in front of it, mentally it’s a non-functional requirement. Doesn’t matter whether the non is a modifier to the word functional it’s just non, it’s not important anymore. It is literally the way people’s minds work.
So you mentioned the openness of opening like the configuration. But you do that for any platform when you start sharing it. It’s a communication issue. It’s about the communication where you have a control repo which controls your platform which is an enablement service. So whether you have a policy or you write rules like you don’t want things to happen Know it’s still part of this enablement. I don’t really like the approach where you have a governance team or person that is going to tell you that you can’t do this because it is going to cost too much. It’s more like I want to be able to do and you’ve got to work with that team. But if it goes above this we should not need this, lets just to make sure we don’t fat finger it. That’s what do. But it’s about the communication enablement.
So who, of those people who are running systems at scale, have had that Cloud bill? Or three months later found there was a whole bunch of servers running there that could have, should have, easily been turned off three months ago and you’ve spent somewhere between thousands or tens of thousands?
I don’t know if gatekeeper is the right approach.
Yeah, I’m not saying it is.
I look for choke points.
I like that choke points.
Without the choke points, you have no control over the account access. It’s important to give everyone as much flexibility as they can with as little overhead if possible. But at a certain point, I don’t think it’s up to Ops to decide the gates but there’s a finance team, there’s a marketing team, there are support teams. The business owner group needs to be able to have control and be able to slow down or speed up.
Potentially a way that it could go is developing more in-house skills within IT like a business analyst or IT analyst and having that ROI component. Or even how you define what you are going to monitor.
So better analytical skills. Which comes back to the whole data driven. How do we become more data-driven more metrics-driven? How do we move away from vanity metrics and collect metrics that actually matter? Okay that’s good. Thank you.
I’m used to working in an environment where there are gatekeepers. I’m thinking about the absence of that role. I think whilst a project team might understand what needs to be done to implement that project, they may not understand the implications of what happens in a year’s time when it comes to billing and stuff like that. So data policy, data duplication, those kinds of ramifications. They may not have a picture of really how the business actually works. They may just have an idea of how their one small silo works.
That’s a super interesting point because it comes back to… So Marcello and I had this conversation a couple of weeks back when I was in Seattle. We were talking about operations as the systems thinkers. I take your point that your Ops team no longer has situational awareness of really understanding how those things do. I believe that somebody has to have that systems thinking big-picture situational awareness of what’s going on and build a framework. No individual delivery unit, two pizza team that we were talking about before, is going to understand the whole system. The complexity of the system goes up to the square of the moving parts.
There was a really interesting conversation I was having with a Forrester analyst, Charles Betts, and one of the things that he was talking about is that we’re pushing down MTTR and change failure rate. So it’s all going in the right direction. If you look at the sort of quadrant all of the known knowns of bugs we’re driving these out and our MTTR is going down. But our systems are getting more and more and more complex. So what happens is, is that when something goes wrong now it’s going to be in this quadrant of the unknown unknowns. You’re literally not going to have a clue what’s going on. So what is that going to do to your MTTR? Actually MTTR is going to go back up again. It has to. Which I really like because it makes you think.
John Allspaw and the guys in this thing called the STELLA report talk about this dark debt. There is technical debt in your application stack that you will only find when something goes wrong. So troubleshooting and finding that it’s not just a matter of redeploying the thing. There is something in there that’s broken you’ve got to find it. Does that make sense? Do you think that operations should be the systems thinkers? We should be the people who are maintaining that situational awareness?
It’s fine to have this awareness but it needs to be tied to the business. So if we call them the operations team then they are going to be much closer to the business than they are in the traditional model.
So you’re saying it’s not just enough to have an awareness of the complexity of your Kubernetes environment and the number of microservices and the dependencies and all this sort of stuff. You’re saying a modern I.T. organisation must understand the organisational goals and why I’m doing that stuff.
Because it’s about decisions, and decisions are driven by the business goals and the value that they seek. So they need to be very much aligned to the business. They have a choice to make, and if that’s going to happen they need to make the right choice for the business.
Making the right choices. Okay, that’s good. Thank you.
With systems thinking, doesn’t that exist in the SRE space too? If you are looking at chaos engineering you have to know all the moving parts of what it might affect. Do you think that that’s something that would just be centralised IT? Or a principle of the engineering space?
I think it has to be designed into the system. Obviously no person can understand the detail of everything. But if you’re designing the components and the way those components interact and those guardrails there should be a way that if this thing goes wrong you understand the causal chain. You understand the impact and the dependencies that that would have. You’ve designed the monitoring and you’ve designed the incident response and the observability to push back on that. One of the things that again is a topic we were talking about was, so who here has a Configuration Management Database, CMDB? Yeah. What the hell does a CMDB mean in a world where you’ve got containers spinning up every second, VMs spinning up every second?
It depends on what you mean by CMDBs. That’s a very loaded term.
ITIL has a very clear definition of CMDB I would like you to know! But my challenge back to Microsoft is what does a next-gen CMDB look like? If we want to understand these dependencies and have these systems thinking I think there’s a role. Just metadata tagging is not good enough. Why can’t I add custom attributes? So if each server is an object I can overload that object with custom attributes. In fact, I can overload that object with custom methods. So I can say cms dot server or sitecore dot server dot kill yourself. Then pass it the GUI of that server and then it kills itself because it understands how to kill itself. Which is a graceful degraded drained shut down or whatever. Or maybe it’s a cms dot server dot upgrade yourself. We don’t have an object-oriented representation of that environment that has these custom attributes and custom methods to do something against.
Again CMDB is difficult to determine here because there are different things. There are the configuration items which represent the intent that you have on the configurations, which is one thing. This is what is driving the change because the human changed the intent of your infrastructure. On the other side, which is what you were talking about, which is the result of this intent. Which is the live system that you are on.
So it’s the Chef cookbook vs. the Inspec tests. Do you know what I mean? There is the thing of this is the way I want it to be. This is how I’m going to validate that my intent has been correctly actioned. There’s an assertion and then there’s a validation. Is this all making sense by the way? If anybody is bored you’re more than welcome to leave! I won’t be offended.
There is an interesting line there around systems thinking between the Cloud and the enterprise. The platform in the enterprise as you described it. I know from my perspective working in an Azure team I have no idea whats production!
Hopefully somebody does.
Could I serve you better building on that platform if I knew what production was? Yes. Would you tell me? Don’t know. Do you want to tell me? As a Cloud platform is it all a black box?
Everything’s production right? But there’s a different level of trust across all promotions, right? So then you don’t need the same level of trust. It’s about managing the trust across your estate. So everything is production to some extent. It should be treated like this. It’s just that the impact of having one stage to another stage, one environment to another environment doesn’t have the same impact. What’s important is the learning you take, the feedback loop. It’s the feedback loop which is the key to this. Everything is production, it just teaches you when it fails.
Okay, just another quick question because again one of the conversations I had with Marcello. We were talking about who treats the CI/CD platform as a business-critical, mission critical, level one thing within their organisation? It’s not something that’s dicked about by the developers. It’s treated the same way as your core SQL server environment. So we’ve got some yeses over there. Okay.
So we do but bizarrely our Ops function don’t. If we have problems with our production Octopus sever, it’s not deemed to be a sev one by our operations team. Because apparently it is not stopping the business running. But you can actually fix anything if something went wrong. So that’s quite a bizarre conversation we have. I think we’ve convinced them now actually it is a priority one.
So I used to have this conversation with our IT team about email. Email is number one because if something breaks that’s how you learn it’s broken. So email has got to be as important as everything else we do, otherwise what’s the point?
Who does e-mail anymore?
This was a few years ago! Whatever you use to alert you that something is broken has to be as resilient as the thing it’s alerting on.
Yeah. And this is the point. We talk about this great change failure rate. We talk about these great rates of change and we talk about how infrastructure as code and configuration management is driving down our MTTR. But if that pipeline is dead then your MTTR is now infinite.
Does that mean I need to replicate my communication away from Slack?
Well, we made an interesting organisational choice, for GDPR reasons, to turn on the 30 day archiving or the 30-day thing in Slack where it automatically deletes every message older than 30 days. So our slack lobotomises itself on a rolling 30-day basis. It’s really annoying when you know that somebody sent you a link to some really good article that you wanted to read and you’re only just getting around to and it’s not there anymore. But it does force you to go and put some more stuff into confluence and long term storage so it has its pluses and minuses. Though I still find it intensely frustrating.
So we’ve talked a bit about observability, security compliance and governance, putting the guardrails in. We’ve talked about system thinking. Performance and scalability. Historically there was, certainly in enterprise organisations I worked with, there was a separate team whose job was performance. Then there was a separate team whose job was capacity planning basically. We actually used to have separate agents running on the box that were just capacity planning agents, so they could pull together all these stats about where the boxes were going. Does anybody want to offer an opinion on what does capacity planning and performance management mean in a Cloud world?
We still have problems with capacity planning because ultimately it’s still sitting on solid infrastructure at the back end, in some datacenter somewhere. If we want to spin up more of a certain type of VM and we’re on a certain saturation of that there is a limited capacity. We’ve bumped into that quite a lot with Microsoft. So it turns out the Cloud isn’t infinite.
Has anybody else ever had a situation where you’ve tried to scale, you’ve tried to instantiate an instance and the API comes back ‘no’?
At a super-low scale. So, a lot of our classes we do we spin up VMs for people. I recently swapped over to managing it myself, I used to outsource it to somebody else. I had a fairly new AWS account and wanted 30 VMs for the 30 people in my class. No sorry, you’ve got a limit of 10 as that’s the default when you start. Oh.
And there’s a 24 hour SLA on getting your limit changed.
Fortunately, I managed to sort it out just in time. But yes, it happens both at the massive end, but also at the smaller end.
If you’re running some of the more exotic instances. Some of the ones that have massive amounts of memory. If you need a lot of those, I hear tell that you have to make a phone call and ask very nicely.
We’re almost at the opposite end. We’ve got a lot of Cloud services, and those Cloud services have been existing for a long time, so we’re kind of pegged to the older infrastructure. So again there is only so much of that because a lot has been deprecated now. Again you try to scale it and it’s not there. So we’ve had to over-scale our infrastructure to guarantee that we’ve got that growth capacity. Which of course costs us more money, so we have interesting conversations with Microsoft about it.
We’re going to wrap it up there. Sorry. Hopefully, that was interesting. It was designed to be a free form conversation. I’ve learned a lot. I’m going to take a photo of this and write some of the stuff up but hopefully, you found that interesting. So thank you very much for coming along to my session.