SRE和DevOps

前言

在搜索SRE和DevOps相关概念的过程当中偶然发现Google Cloud的Blog专门制做了这样一篇文章,国内虽然有很多翻译但并无彻底作到翻译术语中的“信,雅,达”,这里转载Google官方的文章和YouTube视频,同时也选择了网友精心翻译的文章并把视频搬运至bilibili也就是B站方便你们浏览,相信你们能够对SRE和DevOps有更深刻的理解。html

SRE vs. DevOps: competing standards or close friends?

更新历史

2019年06月25日 - 初稿git

阅读原文 - https://wsgzao.github.io/post...github

扩展阅读面试

SRE vs. DevOps: competing standards or close friends? - https://cloud.google.com/blog...
DevOps 和 SRE - https://blog.alswl.com/2018/0...promise


英文原文

SRE vs. DevOps: competing standards or close friends?app

Seth Vargo: Staff Developer Advocate
Liz Fong-Jones: Site Reliability Engineer
May 9, 2018less

<iframe width="560" height="315" src="https://www.youtube.com/embed...; frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>dom

Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they're not so different after all.ide

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.工具

1. The difference between DevOps and SRE

It’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.

The DevOps movement began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group's priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement does not explicitly define how to succeed in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:

DevOps SRE
Reduce organization silos Share ownership with developers by using the same tools and techniques across the stack
Accept failure as normal Have a formula for balancing accidents and failures against new releases
Implement gradual change Encourage moving quickly by reducing costs of failure
Leverage tooling & automation Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system
Measure everything Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.

DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.

2. SLIs, SLOs, and SLAs

The SRE discipline collaboratively decides on a system's availability targets and measures availability with input from engineers, product owners and customers.

<iframe width="560" height="315" src="https://www.youtube.com/embed...; frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system's uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer's code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.

SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

  • SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
  • SLOs are targets for the cumulative success of SLIs over a window of time (like "last 30 days" or "this quarter"), agreed-upon by stakeholders

The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.

SLIs, SLOs and SLAs tie back closely to the DevOps pillar of "measure everything" and one of the reasons we say class SRE implements DevOps.

3. Risk and error budgets

We focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.

<iframe width="560" height="315" src="https://www.youtube.com/embed...; frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Maximizing a system's stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won't notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an "error budget." When error budgets are depleted, the focus shifts from feature development to improving reliability.

As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.

Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.

4. Toil and toil budgets

An important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of "normal" is constantly changing.

<iframe width="560" height="315" src="https://www.youtube.com/embed...; frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

Toil is not simply "work I don't like to do." For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.

The SRE discipline aims to reduce toil by focusing on the "engineering" component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it's realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE's time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.

Toil and toil budgets are closely related to the DevOps pillars of "measure everything" and "reduce organizational silos."

5. Customer Reliability Engineering (CRE)

Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.

<iframe width="560" height="315" src="https://www.youtube.com/embed...; frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.

Thus, in 2016, we launched the CRE program as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.

CRE aligns with the DevOps pillars of "reduce organization silos" by forcing collaboration across organizations, and it also closely relates to the concepts of "accepting failure as normal" and "measure everything" by creating a shared responsibility among all stakeholders in the form of shared SLOs.

Looking forward with SRE

We are working on some exciting new content across a variety of mediums to help showcase how users can adopt DevOps and SRE on Google Cloud, and we cannot wait to share them with you. What SRE topics are you interested in hearing about? Please give us a tweet or watch our videos.

Posted in:

中文翻译

中文翻译原文为繁体中文,我转化为简体中文,视频替换为B站

[[好文翻譯] 你在找的是 SRE 還是 DevOps?](https://medium.com/kkstream/%...

Neil Wei in KKStream
Aug 3, 2018

敝社这半年来开始大举征才,其中不乏 DevOps 和 SRE 的职缺,然而 HR (或其余部门的同事) 对于二者的相异之处并不了解,甚至认为 SRE 和传统维运单位同样,只是换个名字,从管机房到管云端而已,究竟二者到底有什么差异呢?

这对前来的面试的应征者会有负面的影响,好像连咱们本身要找什么样的人都不清楚似的。因而,花了点时间跟 HR 介绍二者的差别,也在支援了 SRE 团队四个月后留下这篇翻译文加一点点心得。

请先记得…

SRE is a DevOps (香蕉是一种水果)

DevOps is NOT a SRE (水果不是香蕉)

DevOps 并非一个 "工做职称",SRE 才是

《本文已取得原做者之一 Seth Vargo 赞成翻译刊登》

原文网址:https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html?m=1


正文开始

Site Reliability Engineering (SRE) 和 DevOps 是目前至关热门的开发与维运文化,有着很高的类似程度。然而,早期有些人会把 SRE 视为和 DevOps 不一样的实践方式,认为二者不同,必需选择其一来执行,可是如今你们更倾向二者其实其实很类似。

究竟 SRE 和 DevOps 有什么相同点呢?在年初,Google 的工程师 (Liz Fong-JonesSeth Vargo) 准备了一系列的影片去解答这些问题以及尝试跳出来去减小社群间的意见分歧,本篇文章总结了影片中所涵盖到的主题,以及如何实际去建置一个更加可靠的系统。


1. SRE 和 DevOps 的差别

在开始以前,先了解一下 SRE 和 DevOps 有什么相同之处?又有什么相异之处?

<iframe src="//player.bilibili.com/player.html?aid=56870162&cid=99334829&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

DevOps 文化的兴起是由于在早期 (约十年前),有许多开发者对于本身的程式是怎么跑在真实世界,其实所知有限。开发者要作的事情就是将程式打包好,而后扔给维运部门后,本身的工做周期就结束了,而维运部门会负责将程式安装与部署到全部生产环境的机器上,同时也要想尽各类辨法与善用各类工具,确保这些程式持续正常地执行,即便维运部门彻底不了解这些程式的实做细节。

这样的工做模式很容易形成两个部门之间的对立,各自的部门都有本身的目标,而各自的目标和公司商业需求可能会不一致。DevOps 的出现是为了带来一种新的软体开发文化,用以下降开发与维运之间的鸿沟。

然而,DevOps 的本质并非教导你们怎么作才会成功,而是订定一些基本原则让你们各自发挥,以程式设计的术语来讲,DevOps 比较像是一个抽象类别 (abstract class),或是介面 (interface),定义了这种文化该有什么样的行为,实做则是靠各个部门成员一块儿决定,只要符合这个「介面」,就能够说是 DevOps 文化的实践。

SRE 一词由 Google 提出,是 Google 在这十多年间为了解决内部日渐庞大的系统而制定出一连串的规范和实做,和 DevOps 不一样的是,它实做了 DevOps 的所定义的抽象方法,并且规范了更多关于如何用软体工程的方法与从维运的角度出发,以达成让系统稳定的目的。简单来讲,SRE 实做了 DevOps 这个介面 (interface),如下列出五点 DevOps 定义的介面以及 SRE 如何实做

DevOps:减小组织之间的谷仓效应

SRE:在整个开发周期中,和开发团队使用相同的工具以及一块儿分享与全部权。(注:Infra as code, configuration as code)

DevOps:接受失效,视失效为开发周期中的一个元素

SRE: 对于新的版本,创建一套能够量化的指标去衡量 "意外" 和 "失效"

DevOps: 逐渐改变

SRE:鼓励团队透过下降排除故障的成原本达成速交付的目的 (就是不须要一次作到最好,而是逐渐改变)

DevOps:善用工具和自动化

SRE:鼓励团队把本身今年的工做自动化,最小化” 工人智慧” 要作的事,把精力放在中长期的系统改善。

DevOps:任何事都是能够被量测的

SRE:相信维运是软体工程的范筹,规范关于可用性,运行时间 (uptime),停机时间 (outages),哪些是苦工等量测值。

若是你已经认同 DevOps 是一个 "介面 (interface)",那么以程式语言的角度来讲就是:

class SRE implements DevOps

虽然实际上二者之间仍有需多独立的原则,SRE 并不是彻底 1:1 实做了 DevOps 的全部的概念,但最终他们两个的结论是相同的,也和程式语言相同,类别在继承介面以后,能够作更多的延伸,也能够实做更多不一样的介面,SRE 包含了更多细节是 DevOps 本来所没有定义的。

在软体开发和维运的领域中,DevOps 和 SRE 并不是互相竞争谁才是业界标准,相反地,二者都是为了减小组职之间的隔阂与更快更好的软体所设计出来的方法,若是你想看更多细节的话,How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) 这本书值得一看。


2. SLIs, SLOs, and SLAs

SRE 的原则之一是针对不一样的职务,给出不一样的量测值。对于工程师,PM,和客户来讲,整个系统的可用程度是多少,以及该如何测量,都有不一样的呈现方式。

<iframe src="//player.bilibili.com/player.html?aid=56870270&cid=99335415&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

若是没法衡量一个系统的运行时间与可用程度的话,是很是难以维运已经上线的系统,经常会形成维运团队持续处在一个救火队的状态,而最终找到问题的根源时,可能会是开发团队写的 code 出了问题。

若是没法定出运行时间与可用程度的量测方法的话,开发团队每每不会将「稳定度」视为一个潜在的问题,这个问题已经困扰了 Google 好多年,这也是为何要发展出 SRE 原则的动机之一。

SRE 确保每个人都知道怎么去衡量可靠度以及当服务失效时该作什么事。这会细到当问题发生时,从 VP 或是 CxO,至最组织内部的每个相关员工,都有作己该作的事。每个「人」,该作什么「事」都被规范清楚,SRE 会和全部的相关人员沟通,去决定出 Service Level Indicators (SLIs) 与 Service Level Objectives (SLOs)。

SLIs 定义了和系统「回应时间」相关的指标,例如回应时间,每秒的吞吐量,请求量,等等,经常会将这个指标转化为比率或平均值。

SLOs 则是和相关人员讨论后,得出的一个时间区间,指望 SLIs 所能维持必定水准的数字,例如「每月 SLIs 要有如何的水准」,比较偏内部的指标。

该影片也讨论到了 Service Level Agreements (SLAs),即便这不是 SRE 天天所关心的数字。做为一个线上服务的提供者,SLA 是对客户的承诺,确保服务持续运行的百分比,一般是和客户「谈」出来的,每一年 (或每个月) 的停机时间不得低于几分钟。

SLI, SLO, SLA 的概念和 DevOps 所提的「任何事均可以被量测」很是类似,这也就是为何会说 class SRE implements DevOps 的缘由之一了。


3. 风险和犯错预算

对于风险,咱们会用犯错预算来评估,犯错预算是一个量化的值,用来描述服务天天 (或每个月) 能够失效的时间,若服务的 SLAs 是 99.9%,那么开发团队就等于有 0.1%的犯错预算通能够用。这个值是一个和 Product Owner 和开发团队谈过以后取得平衡的值,如下的影片也讲到了为何 0 犯错预算并非一个适合的值。

<iframe src="//player.bilibili.com/player.html?aid=56870355&cid=99335555&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

致力于将一个系统的可用程度维持在 100% 是一件会累死你又无心义的事情,不切实际的目标会限制了开发团队推出新功能到使用者手上速度,并且使用者多半也不会注意到这件事 (例如可靠度是 99.999999%),由于他们的 ISP 业者,3G/4G 网路,或是家里的 WiFi 可能都小于这个数字。致力维持一个 100% 不间断的服务会严重限制开发团队将新功能交付出去的时间。为了要达成这个严酷的限制,开发人员每每会选择不要修 bug,不要增长功能,不要改进系统,反之,应该要保留一些弹性让开发团队能够自由发挥。

SRE 的原则之一就是计算出能够容忍的「犯错预算」,一旦这个预算耗尽,才应该开始将重点放在可靠性的改善而非持续开发新功能。

如第二个影片提到的,这个文化能让管理阶层买单是最重要的事,由于 SLIs 是你们一块儿订出来的,若是不照游戏规则走的话,SRE 又会沦为持续为了让系统维持必定的稳定度了而一直作苦力的事,可是没人知道 (由于没有订标准),最终这个服务必定会失败。风险和犯错预算会将犯错视为正常的事,而改善的方式之一是让新功能持续且小规模的发布,这也和 DevOps 的原则相符合。


4. 杂事和杂事预算

另外一个 SRE 的原则是杂事的控管,如何减小杂事?何谓杂事?

维运中须要手动性操做的、重复的,能够被自动化的

或是一次性,没有持久价值的工做,都是杂事。

<iframe src="//player.bilibili.com/player.html?aid=56870600&cid=99336041&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

然而杂事并非「我不想作的事」,举例来讲,公司会有许多常常性的事务,一再的发生,例如开会,沟通,回 email,这些都不是杂事。

反之,像是天天手动登入某台机器,取得某个档案后作后续的处理,而后作成报告寄出来,这种就是杂事,由于他是手动,重复,能够被自动化的。

SRE 的原则是尝试使用软体工程的方法消除这些事情,当 SRE 发现事情能够被自动化后,便会着手执行自动化流程的开发,避免以后再作同样的事情,虽然使杂事最小化很重要,但实际上,这是不可能彻底消除的,Google 致力于将 SRE 的平常杂事缩小到 50% 如下,使得 SRE 成员能够将时间发费在更有意义的事情上,每季的回顾也都会检视成果。

然而杂事也并不是彻底是坏事,对于新进成员来讲,先参与这事例行事务有助于了解这个服务该作些什么事情,这是相对低风险与低压力的,可是长远来看,任何一个工程师都不应一直在作杂事。

杂事管理也和 DevOps 的原则 — 任何事都是可被测量与减小组织之间的谷仓效应相符。


5. 客户可靠性工程 (Customer Reliability Engineering, CRE)

我的以为这个主题对目前而言稍微走远了,就不逐句翻译。

大意如何将 SRE 的概念传达出去,让 GCP 的客户知道该怎么正确的使用 GCP 的各项服务以及推广 SRE 的风气。

<iframe src="//player.bilibili.com/player.html?aid=56870699&cid=99336238&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>


我的后记

其实目前敝社渐渐转型中,的确处在一个从传统开发与维运转互相独立,到目前渐渐实作 DevOps 文化的路上,在支援了 SRE 部门 4 个月后,参与了不少现实面会碰到的挑战,也和你们一块儿制定自动化流程与改善目前现有的杂事,也渐渐朝 DevOps 的文化前进中,但愿让你们能够知道:

SRE 是软体工程,不应只是维运人员或是系统管理员。

DevOps 并非一个职称,SRE 才是,就像你不会到市场菜摊跟老板说我要买 "青菜",并且会说要买高丽菜仍是小白菜吧!

不过理想老是完美的,仍是要面对现实,咱们的公司不叫 Google,大部份的人也进不去 Google,Google 的 SRE 可能比大多数公司的软体开发工程师还要会写 code,比网路工程师还要懂网路,比维运工程师还要懂维运,在咱们周围的环境所开的 SRE 职缺,其实不少都不是想象中的这样美好,杂事 / 手动的事可能仍是占大多数,部门间仍是存在隔阂,不会写 code 的 SRE 可能也不少,维运仍是占平常工做的多数等现况。

传统维运人员或 IT 网管人员若想往 SRE 发展的话,也必需改变一下思惟,跳脱温馨圈,在这个什么都 as code,什么都 as a service 的年代,不写 code 就等著等淘汰了。

改变是缓慢并且须要慢慢培养的,就让咱们… 咦…P0 事件发生了!先这样啦!

延伸阅读

在此感谢全部人的分享,推进技术的不断进步
相关文章
相关标签/搜索