Site Reliability Engineering
How do the tech giants manage their infrastructure? What can other organisations learn from them?
Mike Krieger, the co-founder of Instagram announced in 2012, that “2 backend engineers can scale a system to 30+ million users.” After the acquisition by Facebook, this team grew to 5.
Pinterest, another social media app. Was able to handle 18 million users and 410 terabytes of data and, with a company size of 12 people. How do they do it?
Cloud technologies are certainly fundamental to achieving such massive scale, but who implements, operates and maintains the services?
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Originally pioneered as an engineering practice at Google, this principle-based approach radicalises traditional IT service management processes to deliver, scale and recover faster and with a minimised reliance upon human intervention.
SRE is growing more important due to the need for high reliability as organisations increasingly run their business via highly-utilised modern IT services that are constantly changed to meet new market needs. This led to the implementation of SRE and its processes that ensure high reliability but NOT in a risk-adverse manner that would prevent agile change and innovation from taking place.
Many organistions have been adopting this discipline. In Singapore, some of the organisations that have adopted SRE include major banks, government organisations, systems integrators, digital food delivery providers, MNCs.
This course focuses upon the practical application of the core SRE Principles and Practices and describes how an organisation can make the shift from traditional IT system administration towards high scalability, with Site Reliability Engineering.
This course is part of the Digital Agility series offered by NUS-ISS.
At the end of the course, the participants will be able to:
- Explain the differences between traditional operations, DevOps and Site Reliability Engineering
- Understand the importance of the SRE Principles and apply them in preventing and resolving problems to increase reliability
- Select an appropriate organisation topology to enable SRE and high-performance IT
- Design a policy to ensure SRE practices are carried out
- Understand and apply the key components of the CI/CD pipeline, including canarying releases and how to design a release tool chain
- Design and implement SLOs, SLIs and Error Budgets
- Create a business case for shifting from incident escalation to swarming for problem resolution
- Conduct blameless post-mortems to determine root causes via deep analysis
- Design chaos experiments
- Reduce manual toil using automation tools – with runbooks and helpdesk chatbots as examples