In the technology world, Site Reliability Engineering is a hot topic. It’s all about applying software engineering practices to IT operations, with the aim of designing reliable software that can be scaled.
Marcelo Bellinaso, who leads the newly formed Cloud Ops Advocacy team in Azure engineering at Microsoft, presented a track session about SRE at WinOps 2018. He explored its relationship to DevOps, when a traditional IT operations team should adopt SRE concepts and practices, and the story of SRE evolution at LinkedIn.
Making cloud better
Kicking off the session, Bellinaso explained that it’s his job is to spend time with ops teams, get their feedback and requirements, and help to make Azure a better cloud platform for them. And a big part of this is working on SRE at Microsoft and across its product portfolio, including LinkedIn.
He also gave an insightful overview of SRE, referring to the definition from Ben Treynor at Google: “Site Reliability Engineering is what happens when you ask a software engineer to design an operations team.” When Google developed the concept in 2003, it needed a way to scale its operations team and meet global demand.
To Bellinaso, SRE is designed to take failure out of the system and allow the introduction of change. While Google was the first to adopt this concept, fast-growing companies like Twitter, Netflix, Dropbox, and LinkedIn are also using it to expand their platforms and stay at the cutting edge of technology.
It was great to see Bellinaso break down SRE and explain its core principles, the first being that it positions operations as a software problem. He described it as a mindset and cultural shift, moving away from the traditional operations ticketing system to be more integrated into the business. For SRE teams to succeed, they need people with diverse skillsets and who can share metrics with development.
SRE teams also work to automate toil. They don’t accept their job as just doing toil (manual work) all day, which is where lean principles become crucial. And finally, it uses Service Level objectives and Error Budgets to balance reliability and innovation – all of which complement each other and allow teams to innovate at speed.
LinkedIn’s SRE evolution
To exemplify the importance of this practice in modern organisations, Bellinaso spent a significant amount of time talking about the SRE evolution at LinkedIn. The story started in 2010, when the platform went through hyper active user growth and kept crashing.
What’s more, code would be thrown across the fence from Dev to Ops. And Ops would be forced to make these major releases out of hours with the near-certainty of incidents the following day. LinkedIn needed to strong-arm its platform back into shape and speed up innovation, so it adopted DevOps and Agile ways-of-working.
Bellinaso didn’t just give a quick overview of this undertaking – he discussed every stage. At the beginning, LinkedIn battled with reliability issues and a growing userbase, although every developer was empowered to do their bit to keep the site up. Operations was viewed as an engineering problem, too.
Once the foundations were put in place, the first (firefighter) stage involved a great deal of incident management. Support engineers were constantly on call because the site would go offline every morning. Everything was purely reactive to keep the company alive for another day.
The second (gatekeeper) stage saw change controls introduced. LinkedIn became reactive towards development plans and fostered a mentality of protect “our site” from “them”. Meanwhile, the third (advocate) stage was about creating a “site up” culture and rebuilding trusted relationships that were damaged in the first two stages.
In the fourth (partner) stage, there was empowerment for intelligent risk, joint planning with the development team, and collaboration to magnify impact. The fifth (engineer stage) resulted in reliability throughout the software lifecycle, a proactive plan for SRE development, the mentality that everyone is an engineer, a new training programme, and machine learning.
This hasn’t been an easy journey for LinkedIn, but it’s certainly paid off. Bellinaso concluded his talk by explaining that LinkedIn has topped Business Insider’s Digital Trust Rankings two years in a row. The latter ranks organisations across security, legitimacy, community, user experience, share ability, and relevance.
Visit our blog for more WinOps 2018 coverage.