Netflix的全周期开发者—运行您构建的内容(中英双语)

原文app

The year was 2012 and operating a critical service at Netflix was laborious. Deployments felt like walking through wet sand. Canarying was devolving into verifying endurance (“nothing broke after one week of canarying, let’s push it”) rather than correct functionality. Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another. All of these were signs that changes were needed.less

2012年的时候在Netflix运营一项关键服务很费力的。部署就像在潮湿的沙地上行走。金丝雀部署变成了对耐心的测试(“一周的金丝雀测试都没问题发生因此咱们继续推动吧”)而不是正确的功能。研究问题就好像皮球同样在团队之间被踢来踢去,很难抓住根源。全部这些迹象代表须要作一些改变了。运维

Fast forward to 2018. Netflix has grown to 125M global members enjoying 140M+ hours of viewing per day. We’ve invested significantly in improving the development and operations story for our engineering teams. Along the way we’ve experimented with many approaches to building and operating our services. We’d like to share one approach, including its pros and cons, that is relatively common within Netflix. We hope that sharing our experiences inspires others to debate the alternatives and learn from our journey.dom

时间快进到2018年。Netflix的全球用户已增至1.25亿,天天观看时长超过1.4亿小时。咱们已经在改善咱们的工程团队的开发和运营方面投入了大量的资金。在此过程当中,咱们尝试了许多方法来构建和运行咱们的服务。在此,咱们愿意把其中一种咱们内部用得相对广泛的方案,包括它的优缺点拿出来跟你们分享。咱们但愿咱们的经验分享能给你们一点启发而且讨论出可能的替代方案。ide

一个团队的旅程

Edge Engineering is responsible for the first layer of AWS services that must be up for Netflix streaming to work. In the past, Edge Engineering had ops-focused teams and SRE specialists who owned the deploy+operate+support parts of the software life cycle. Releasing a new feature meant devs coordinating with the ops team on things like metrics, alerts, and capacity considerations, and then handing off code for the ops team to deploy and operate. To be effective at running the code and supporting partners, the ops teams needed ongoing training on new features and bug fixes. The primary upside of having a separate ops team was less developer interrupts when things were going well.工具

Edge Engineering(边缘工程)负责AWS服务的第一层,Netflix流媒体必须依靠这些服务才能正常工做。在过去,Edge Engineering有专一运维的团队以及SRE(网站可靠性工程师)专家,他们负责软件生命周期的部署+运营+支撑这部分。发布一个新特性意味着开发人员要在度量、警报和容量考虑事项等方面与运维团队协调,而后将代码交给运维团队进行部署和操做。为了有效地运行代码并支持合做伙伴,运维团队须要持续接受新特性和bug修复方面的培训。拥有一个单独的运维团队的主要好处是,当事情进展顺利时,开发人员的干扰更少。oop

When things didn’t go well, the costs added up. Communication and knowledge transfers between devs and ops/SREs were lossy, requiring additional round trips to debug problems or answer partner questions. Deployment problems had a higher time-to-detect and time-to-resolve due to the ops teams having less direct knowledge of the changes being deployed. The gap between code complete and deployed was much longer than today, with releases happening on the order of weeks rather than days. Feedback went from ops, who directly experienced pains such as lack of alerting/monitoring or performance issues and increased latencies, to devs, who were hearing about those problems second-hand.性能

当事情进展不顺时,成本就会增长。开发人员和ops/SREs之间的交流和信息交换是有损耗的,须要额外的往返调试问题或回答合做伙伴的问题。部署问题由于运维团队对所部署内容的更改了解较少。因此检测和解决问题须要很长的时间。代码完成与部署之间的鸿沟变得更大,发布每每是以周为量级而不是日。反馈从运维发起,这帮人直接经历了缺乏告警/监控或者性能问题及时延增长这样的痛苦,而后再传递到开发人员这里时问题已是二手了。学习

To improve on this, Edge Engineering experimented with a hybrid model where devs could push code themselves when needed, and also were responsible for off-hours production issues and support requests. This improved the feedback and learning cycles for developers. But, having only partial responsibility left gaps. For example, even though devs could do their own deployments and debug pipeline breakages, they would often defer to the ops release specialist. For the ops-focused people, they were motivated to do the day to day work but found it hard to prioritize automation so that others didn’t need to rely on them.开发工具

为了改进这一点,Edge Engineering尝试了一种混合模型,在这种模型中,开发人员能够在须要的时候本身推送代码,同时还负责非工做时间的生产问题和支持请求。这改进了开发人员的反馈和学习周期。但这会出现部分的责任不到位的问题。例如,即便开发人员能够执行他们本身的部署和调试管道中断,他们每每也会交给运维处理。对于那些专一于运维的人来讲,他们有动力去作天天的工做,可是很难会把无需别人依赖本身的自动化放在优先考虑的位置

In search of a better way, we took a step back and decided to start from first principles. What were we trying to accomplish and why weren’t we being successful?

为了寻找更好的方法,咱们退了一步,决定从第一性原理开始。咱们想要完成什么,为何咱们没有成功?

软件生命周期

The purpose of the software life cycle is to optimize “time to value”; to effectively convert ideas into working products and services for customers. Developing and running a software service involves a full set of responsibilities:

软件生命周期的目的是优化“价值时间”;为客户有效地将想法转化为工做产品和服务。开发和运行软件服务涉及一系列职责:

We had been segmenting these responsibilities. At an extreme, this means each functional area is owned by a different person/role:

咱们一直在划分这些责任。在极端状况下,这意味着每一个功能区由不一样的人/角色拥有:

These specialized roles create efficiencies within each segment while potentially creating inefficiencies across the entire life cycle. Specialists develop expertise in a focused area and optimize what’s needed for that area. They get more effective at solving their piece of the puzzle. But software requires the entire life cycle to deliver value to customers. Having teams of specialists who each own a slice of the life cycle can create silos that slow down end-to-end progress. Grouping differing specialists together into one team can reduce silos, but having different people do each role adds communication overhead, introduces bottlenecks, and inhibits the effectiveness of feedback loops.

这些专门的角色在每个细分领域内创造出了效能,可是却有可能形成整个生命周期的低效。专家在其聚焦的领域发展专业知识并针对该领域的须要进行优化。他们在解决特定领域的难题上变得愈来愈高效。可是软件须要整个生命周期来为客户提供价值。各自精通生命周期的一小块的专家团队反而可能会制造出信息孤岛,从而减慢端到端的进度。将不一样的专家分组到一个团队中能够减小信息孤岛,可是让不一样的人担任每一个角色会增长沟通开销,引入瓶颈,并抑制反馈回环的有效性。

运行构建的内容

To rethink our approach, we drew inspiration from the principles of the devops movement. We could optimize for learning and feedback by breaking down silos and encouraging shared ownership of the full software life cycle:

为了从新思考咱们的方法,咱们从devops运动的原则中得到了灵感。咱们能够经过打破信息孤岛和鼓励共享整个软件生命周期的全部权来优化学习和反馈:

“Operate what you build” puts the devops principles in action by having the team that develops a system also be responsible for operating and supporting that system. Distributing this responsibility to each development team, rather than externalizing it, creates direct feedback loops and aligns incentives. Teams that feel operational pain are empowered to remediate the pain by changing their system design or code; they are responsible and accountable for both functions. Each development team owns deployment issues, performance bugs, capacity planning, alerting gaps, partner support, and so on.

“运营你开发的东西”经过让开发系统的团队也负责系统的运营和支持来践行devops原则。把这个责任分摊给每一支开发团队,而不是外化它,这样就创建直接反馈回环而且把激励给统一块儿来。感觉到运维痛苦的团队被赋权经过改变系统设计或代码来治疗这种痛苦;他们要负责这两种职能。每一支开发团队都要负责部署问题、性能bug、能力规划、告警差别、伙伴支持等等。

经过开发工具进行扩展

Ownership of the full development life cycle adds significantly to what software developers are expected to do. Tooling that simplifies and automates common development needs helps to balance this out. For example, if software developers are expected to manage rollbacks of their services, rich tooling is needed that can both detect and alert them of the problems as well as to aid in the rollback.

对整个开发生命周期的全部权给软件开发者显著增长了负担。这就须要有简化和自动化共同开发需求的工具来减轻负担。比方说,若是软件开发者预期要管理服务的回滚的话,就要有丰富的工具既能检测到问题并予以告警,又能辅助进行回滚才行。

Netflix created centralized teams (e.g., Cloud Platform, Performance & Reliability Engineering, Engineering Tools) with the mission of developing common tooling and infrastructure to solve problems that every development team has. Those centralized teams act as force multipliers by turning their specialized knowledge into reusable building blocks. For example:

Netflix建立了集中的团队(例如,云平台、性能和可靠性工程、工程工具),其任务是开发通用的工具和基础设施来解决每一个开发团队都有的问题。这些集中的团队将他们的专业知识转化为可重用的构建块,从而起到了力量倍增器的做用。例如:

Empowered with these tools in hand, development teams can focus on solving problems within their specific product domain. As additional tooling needs arise, centralized teams assess whether the needs are common across multiple dev teams. When they are, collaborations ensue. Sometimes these local needs are too specific to warrant centralized investment. In that case the development team decides if their need is important enough for them to solve on their own.

有了这些工具在手,开发团队能够专一于解决他们特定产品领域中的问题。随着其余工具需求的出现,集中化团队会评估多个开发团队是否也有这些需求。若是有,接着就要协做。有时,这些地方需求过于具体,没法保证集中投资。在这种状况下,开发团队决定他们的需求是否重要到足以让他们本身解决问题。

Balancing local versus central investment in similar problems is one of the toughest aspects of our approach. In our experience the benefits of finding novel solutions to developer needs are worth the risk of multiple groups creating parallel solutions that will need to converge down the road. Communication and alignment are the keys to success. By starting well-aligned on the needs and how common they are likely to be, we can better match the investment to the benefits to dev teams across Netflix.

对相似问题在局部与集中投资间进行平衡是咱们的方案当中最棘手的地方。按照咱们的经验寻找开发需求的新颖解决方案的好处,是值得冒险让多支团队同时开发在未来异曲同工的解决方案的。沟通与协调是成功的关键。经过协调好需求及其共性,咱们就能更好地将投资与跨开发团队的好处进行匹配。

全周期开发者

By combining all of these ideas together, we arrived at a model where a development team, equipped with amazing developer productivity tools, is responsible for the full software life cycle: design, development, test, deploy, operate, and support.

把全部这些想法凑到一块儿,咱们就得出了这么一个模式,在配备了出色的开发者生产力工具以后,开发团队将负责整个软件生命周期:包括设计、开发、测试、部署、运营以及支持。

Full cycle developers are expected to be knowledgeable and effective in all areas of the software life cycle. For many new-to-Netflix developers, this means ramping up on areas they haven’t focused on before. We run dev bootcamps and other forms of ongoing training to impart this knowledge and build up these skills. Knowledge is necessary but not sufficient; easy-to-use tools for deployment pipelines (e.g., Spinnaker) and monitoring (e.g., Atlas) are also needed for effective full cycle ownership.

全周期开发者须要熟悉软件生命周期各个领域而且高效。对于不少不熟悉Netflix的开发者来讲,这意味着要在本身以前不怎么关注的领域加把劲。咱们开设有dev新兵训练营及其余持续培训形式来灌输这种知识并培养技能。知识是必要非充分条件;部署管道和监控还须要有易用的工具才能支撑高效的全周期开发运营。

Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will enable my partners to answer their questions without needing me to be involved?” This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.

全周期开发者把工程规范应用到生命周期的各个领域。他们从开发者的角度去评估问题,会提出相似“我如何才能自动化该系统运营所需的东西?”以及“什么样的自服务工具能让个人伙伴回答他们的问题而不须要个人参与?”优先考虑聚焦系统的办法而不是聚焦于人的办法,优先考虑自动化而不是手工,这帮助了咱们团队实现伸缩性。

Moving to a full cycle developer model requires a mindset shift. Some developers view design+development, and sometimes testing, as the primary way that they create value. This leads to the anti-pattern of viewing operations as a distraction, favoring short term fixes to operational and support issues so that they can get back to their “real job”. But the “real job” of full cycle developers is to use their software development expertise to solve problems across the full life cycle. A full cycle developer thinks and acts like an SWE, SDET, and SRE. At times they create software that solves business problems, at other times they write test cases for that, and still other times they automate operational aspects of that system.

转向全周期开发者模式须要理念的转变。一些开发者认为设计+开发,或者有时候测试才是创造价值的主要手段。这会致使一种反模式,认为运营是分心的事情,更喜欢对运营和支持问题进行短时间性质的修补以便可以回到本身“真正的工做”上去。可是全周期开发者这项“真正的工做”是利用他们的软件开发知识去解决全生命周期的问题。全周期开发者要像SWE、SDET以及SRE同样思考和行动。有时候他们要建立软件去解决商业问题,有时候他们写相应的测试用例,还有些时候他们会对系统的运营方面进行自动化。

For this model to succeed, teams must be committed to the value it brings and be cognizant of the costs. Teams need to be staffed appropriately with enough headroom to manage builds and deployments, handle production issues, and respond to partner support requests. Time needs to be devoted to training. Tools need to be leveraged and invested in. Partnerships need to be fostered with centralized teams to create reusable components and solutions. All areas of the life cycle need to be considered during planning and retrospectives. Investments like automating alert responses and building self-service partner support tools need to be prioritized alongside business projects. With appropriate staffing, prioritization, and partnerships, teams can be successful at operating what they build. Without these, teams risk overload and burnout.

这一模式要想取得成功,团队必须为它所带来的价值作奉献而且要认识到所需的成本。团队须要预留合理的人手去管理开发和部署,处理生产问题,而且对伙伴的支持请求做出响应。须要投入时间到培训上。要利用好工具而且投资于工具。须要跟集中化团队培养合做关系来建立出可重用的组件和解决方案。规划和回顾阶段要考虑到生命周期的各个领域。除了商业项目之外,像自动化告警响应和开发自服务伙伴支持工具这样的投资须要优先考虑。有了合适的人力、恰当的优先次序,再加上合做关系,团队就能成功地运营本身开发的东西。没有这些,团队就会有负担太重精疲力竭的风险。

To apply this model outside of Netflix, adaptations are necessary. The common problems across your dev teams are likely similar — from the need for continuous delivery pipelines, monitoring/observability, and so on. But many companies won’t have the staffing to invest in centralized teams like at Netflix, nor will they need the complexity that Netflix’s scale requires. Netflix’s tools are often open source, and it may be compelling to try them as a first pass. However, other open source and SaaS solutions to these problems can meet most companies needs. Start with analysis of the potential value and count the costs, followed by the mindset-shift. Evaluate what you need and be mindful of bringing in the least complexity necessary.

在Netflix以外的地方应用这一模式须要进行必要的调整。开发团队之间的共同问题多是相似的——好比持续交付管道的需求,好比监控/可观察性等等。但不少公司并无像Netflix这样有足够的人力投资到集中化团队上,或者也不须要Netflix这种规模致使的复杂性。Netflix的工具每每是开源的,因此一开始你想尝试一下也正常。不过这些问题其余的开源和SaaS解决方案也能知足大多数公司的需求。先从分析潜在价值和计算成本开始没而后再进行观念转变。评估你须要什么,当心不要引入没必要要的复杂性。

权衡

The tech industry has a wide range of ways to solve development and operations needs (see devops topologies for an extensive list). The full cycle model described here is common at Netflix, but has its downsides. Knowing the trade-offs before choosing a model can increase the chance of success.

技术圈有很丰富的手段来解决开放和运营需求(延伸阅读:devops拓扑)。这里描述的全周期模型在Netflix很广泛,但这种模式也有缺点。在选择一种模式前先了解其中的利弊能够提升成功的概率。

With the full cycle model, priority is given to a larger area of ownership and effectiveness in those broader domains through tools. Breadth requires both interest and aptitude in a diverse range of technologies. Some developers prefer focusing on becoming world class experts in a narrow field and our industry needs those types of specialists for some areas. For those experts, the need to be broad, with reasonable depth in each area, may be uncomfortable and sometimes unfulfilling. Some at Netflix prefer to be in an area that needs deep expertise without requiring ongoing breadth and we support them in finding those roles; others enjoy and welcome the broader responsibilities.

在全周期模式下,一我的要管的事情变宽了变多了。而一些开发者偏向于专一成为比较狭窄的领域的世界级专家,在一些领域咱们也是须要那种类型的专家的。对于那些专家来讲,须要一专多能,对每一个领域都懂一些的要求可能会感受不太舒服并且有时候勉为其难。有些人宁愿呆在须要深厚知识不须要持续扩展广度的领域,咱们也支持他们去找到这样的角色;有的则享受而且欢迎承担更广的责任。

In our experience with building and operating cloud-based systems, we’ve seen effectiveness with developers who value the breadth that owning the full cycle requires. But that breadth increases each developer’s cognitive load and means a team will balance more priorities every week than if they just focused on one area. We mitigate this by having an on-call rotation where developers take turns handling the deployment + operations + support responsibilities. When done well, that creates space for the others to do the focused, flow-state type work. When not done well, teams devolve into everyone jumping in on high-interrupt work like production issues, which can lead to burnout.

根据咱们开发和运营基于云的系统的经验,咱们见识过哪些重视拥有全周期所需的广度的开发者的效能。可是这种广度增长了每一位开发者的认知负荷,这意味着团队每周将比仅关注一个领域要平衡更多的优先事项。为此咱们采起了随时待命的轮转来缓解这一点:即让开发者轮流分担部署+运营+支持责任。作得很差的状况下,就会出现人人都在当救火队员去处理生产问题等高中断的状况,致使全部人精疲力竭。

Tooling and automation help to scale expertise, but no tool will solve every problem in the developer productivity and operations space. Netflix has a “paved road” set of tools and practices that are formally supported by centralized teams. We don’t mandate adoption of those paved roads but encourage adoption by ensuring that development and operations using those technologies is a far better experience than not using them. The downside of our approach is that the ideal of “every team using every feature in every tool for their most important needs” is near impossible to achieve. Realizing the returns on investment for our centralized teams’ solutions requires effort, alignment, and ongoing adaptations.

工具和自动化有助于扩展专业知识,但没有一项工具能解决开发者生产力和运营领域的每个问题。Netflix有集中化团队支撑的现成的一套工具和实践。咱们不强求其余团队必定要用这些,可是经过确保开发和运营采用这些技术的体验要比不用好得多来鼓励他们采用。咱们的办法很差之处在于“每一支团队将每一项工具的每个功能用到其最重要的需求”上这个理想几乎是不可能实现的。须要意识到咱们集中化团队解决方案的投资回报须要努力、协调以及持续适配。

结论

The path from 2012 to today has been full of experiments, learning, and adaptations. Edge Engineering, whose earlier experiences motivated finding a better model, is actively applying the full cycle developer model today. Deployments are routine and frequent, canaries take hours instead of days, and developers can quickly research issues and make changes rather than bouncing the responsibilities across teams. Other groups are seeing similar benefits. However, we’re cognizant that we got here by applying and learning from alternate approaches. We expect tomorrow’s needs to motivate further evolution.

从2012年走到今天经历了种种实验、学习和适配的过程。Edge Engineering的早期经历刺激了寻找更好模式的需求,今后全周期开发者模式就被咱们积极地应用到今天。部署是平常,进行得很频繁,金丝雀行动只须要数小时而不是很多天了,开发者能够迅速调研问题做出变动而不是在团队之间踢皮球。其余的团队也看到了相似的好处。然而,咱们认识到咱们是经过应用替代方案并从中学习才走到今天的。咱们预期未来的需求还会推进进一步的演进。

Interested in seeing this model in action? Want to be a part of exploring how we evolve our approaches for the future? Consider joining us.

相关文章
相关标签/搜索