Hot Best Seller

Site Reliability Engineering: How Google Runs Production Systems

Availability: Ready to download

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitmen The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient--lessons directly applicable to your organization. This book is divided into four sections: Introduction--Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples--Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices--Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systemsManagement--Explore Google's best practices for training, communication, and meetings that your organization can use


Compare

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitmen The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient--lessons directly applicable to your organization. This book is divided into four sections: Introduction--Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples--Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices--Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systemsManagement--Explore Google's best practices for training, communication, and meetings that your organization can use

30 review for Site Reliability Engineering: How Google Runs Production Systems

  1. 4 out of 5

    Simon Eskildsen

    Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, and each chapter stands too much on its own—building from first principles each time, instead of leveraging the rest of the book. This makes the book much longer than it needs to be. Furthermore, it tries to be both technical and non-technical—this confuses the narrative of the book, and it ends up not excelling at either of them. I would love to see two books: SRE the technical parts, and SRE the non-technical parts. Overall, this book is still a goldmine of information to a 5/5—but it is exactly that, a goldmine that you'll have to put a fair amount of effort into dissecting to retrieve the most value from, because the book's structure doesn't hand it to you—that's why we land at a 3/5. When recommending this book to coworkers, which I will, it will be chapters from the book—not the book at large.

  2. 4 out of 5

    Dimitrios

    I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them. It helped me wrap my head around concepts for which I used to rely on intuition. I've shaped processes and created template documents (postmortem / launch coordination checklist) for work based on this book.

  3. 5 out of 5

    Michael Scott

    Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or use Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or used to think, a few years back) about its datacenters. If you're interested in this topic, you have to read this book. Period. Structure The book is divided into four main parts, each comprised of several essays. Each essay is authored by what I assume is a Google engineer, and edited by one of Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. (I just hope that what I didn't like about the book can be attributed to the editors, because I really didn't like some stuff in here.) In Part I, Introduction, the authors introduce Google's Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Truly impressive achievement, no doubt about it!) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a simple service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter. In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly (why not super-linearly as well?!?!) with services, yet can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic). In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering int he SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engs, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. Overall, palatable material, but spread too thin and with too much much overlap with prior related work of a decade ago, especially academic, and not much new insight. What I liked I especially liked Part II, which in my view is one of the best introductions to datacenter management available today to the students of this and related topics (e.g., applied distributed systems, cloud computing, grid computing, etc.) Some of the topics addressed, such as risk and team practices, are rather new for many in the business. I liked the approach proposed in this book, which seemed to me above and beyond the current state-of-the-art. Topics in reliability (correlated failures, root-cause analysis) and scheduling (overload management, load balancing, architectural issues, etc.) are currently open in both practice and academia, and this book emphasizes in my view the dearth of good solutions but for the simplest of problems. Many of the issues related to automated monitoring and incident detection could lead in the future to better technology and much innovation, so I liked the prominence given to these topics in this book. What I didn't like I thoroughly disliked the statements claiming by omission that Google has invented most of the concepts presented in the book, which of course in the academic world would have been promptly sent to the reject pile. As an anecdote, consider the sentence Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it!. I'll skip the discussion about who is the originator of the term SRE, and focus on the meat of this statement. By omission, it makes the reader think that Google, through its Ben Treynor Sloss, is the first to understand the importance of reliability for datacenter-related systems. In fact, this has been long-known in the grid computing community. I found in just a few minutes explicit references from Geoffrey Fox (in 2005, on page 317 of yet another grid computing anthology, "service considers reliable delivery to be more important than timely delivery"), Alexandru Iosup (in 2007, on page 5 of this presentation, and again in 2009, in this course, "In today’s grids, reliability is more important than performance!"). Of course, this notion has been explored for the general case of services much earlier... anyone familiar with air and especially space flight? The list of concepts actually not invented at Goog but about which the book implies to the contrary goes on and on... I also did not like some of the exaggerated claims of having found solutions for the general problems. Much remains to be done, as hiring at Google in these areas continues unabated. (There's also something called computer science, whose state-of-the-art indicates the same.)

  4. 4 out of 5

    Michael Koltsov

    I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge. For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealin I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge. For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealing any Google’s secrets (do they really have any secrets?) But it’s a great start even if you don’t need the scale of Google but want to write robust and failure-resilient apps. Technical solutions, dealing with the user facing issues, finding peers, on-call support, post-mortems, incident-tracking systems – this book has it all though, as chapters have been written by different people some aspects are more emphasized than the others. I wish some of the chapters had more gory production-based details than they do now. My score is 5/5

  5. 5 out of 5

    Mircea Ŀ

    Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them act Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them actually) whose sole connection is that they wrote a chapter for this book. F the reader, F structure, F focusing on the core of the issue. Let's just dump a stream of consciousness kind of junk and after that tell everyone how hard it is and how we care about work life balance. Again, boring and in general you're gonna waste your time reading this (unless you want to know what borg, chubby and bigtable are)

  6. 4 out of 5

    Alexander Yakushev

    This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company. The text quality is top-notch, the book is written with clarity in mind and thoroughly edited. I'd rate the content itself at four stars. But the book deserves the fi This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company. The text quality is top-notch, the book is written with clarity in mind and thoroughly edited. I'd rate the content itself at four stars. But the book deserves the fifth star because it is a superb example of a material that gives you the precise understanding of how some company (or its division) operates inside. Apparently, Google can afford to expose such secrets while not many other companies can, but we need more low-BS to-the-point books like this to share and exchange the experience of running the most complex systems (that is, human organizations) efficiently.

  7. 5 out of 5

    James Stewart

    Loads of interesting ideas and thoughts, but a bit of a slog to get through. The approach of having different members of the team write different sections probably worked really well for engaging everyone, but it made for quite a bit of repetition. It also ends up feeling like a few books rolled into one, with one on distributed systems design, another on SRE culture and practices, and maybe another on management.

  8. 5 out of 5

    Alex Palcuie

    I think this is the best engineering book in the last decade.

  9. 5 out of 5

    Tomas Varaneckas

    This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how maje This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how majestic their scale happens to be. After reading this book, I am absolutely sure I would never ever want to work for Google.

  10. 4 out of 5

    Chris

    There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments.

  11. 4 out of 5

    Vít Listík

    I like the fact that it is written by multiple authors. Everything stated in the book seems so obvious but it is so sad to read it because it is not yet an industry standard. A must read for every SRE.

  12. 4 out of 5

    Tim O'Hearn

    “Perfect algorithms may not have perfect implementations.” And perfect books may not have perfect writers. Site Reliability Engineering is an essay collection that can be rickety at times but is steadfast in its central thesis. Google can claim credit for inventing Site Reliability Engineering and, in this book, a bunch of noteworthy engineers share their wisdom from the trenches. When it comes to software architecture and product development, I’ve found delight in reading about how startups’ p “Perfect algorithms may not have perfect implementations.” And perfect books may not have perfect writers. Site Reliability Engineering is an essay collection that can be rickety at times but is steadfast in its central thesis. Google can claim credit for inventing Site Reliability Engineering and, in this book, a bunch of noteworthy engineers share their wisdom from the trenches. When it comes to software architecture and product development, I’ve found delight in reading about how startups’ products are built because the stories are digestible. It’s possible for a founder, lead engineer, or technical writer to lay down the blueprint of a small-scale product and even get into the nuts and bolts. When it comes to large tech companies, this is impossible from a technical point of view and improbable from a compliance standpoint. This is beside the purpose of the book, but arrangements like this one help bridge the gap between one’s imagination and the inner-workings of tech giants. There are plenty of (good!) books that tell you all about how Google the business works, but this one happens to be the best insight into how the engineering side operates. Sure, you have to connect some dots and bring with you some experience, but the result is priceless--you start to feel like you get it. The essays are almost all useful. If you haven’t spent at least an internship’s worth of time in the workforce, you should probably table this one until you have a bit more experience. I would have enjoyed this book as an undergraduate, no doubt, but most of it wouldn’t have clicked. The Practices section--really, the meat of the book--is where the uninitiated might struggle. When I emerged on the other side I had a list of at least twenty topics that I needed to explore in more detail if I was to become truly great at what I do. I highly recommend this book to anyone on the SRE/DevOps spectrum as well as those trying to understand large-scale tech companies as a whole. See this review and others on my blog

  13. 4 out of 5

    Ahmad hosseini

    What is SRE? Site Reliability Engineering (SRE) is Google’s approach to service management. An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). Typical SRE activities fall into the following approximate categories: • Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work. • System engineering: Involves configuring p What is SRE? Site Reliability Engineering (SRE) is Google’s approach to service management. An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). Typical SRE activities fall into the following approximate categories: • Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work. • System engineering: Involves configuring production systems, modifying configuration, or documenting systems in a way that products lasting improvements from a one-time effort. • Toil: work directly to running a service that is repetitive, manual, etc. • Overhead: Administrative work not tied directly to running a service. Quotes “Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.” – Brain Redman “Ways in which things go right are special cases of the ways in which things go wrong.” – John Allspaw About book This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. “Essential reading for anyone running highly available web services at scale.” – Adrian Cockcroft, Battery Ventures, former Netflix Cloud Architect

  14. 4 out of 5

    David

    The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most useful. On the other hand, the chapter on simplicity was indeed simplistic, and the chapter on data integrity was (surprisingly) disappointing. The good: there's a lot of excellent information in this book. It's a comprehensive, thoughtful overview for anybody entering the world of distributed systems, cloud infrastructure, or network services. Despite a few misgivings, I'm pretty on board with Google's approach to SRE. It's a very thoughtful approach to the problems of operating production services, covering topics ranging from time management, prioritization, onboarding, plus all the technical challenges in distributed systems. The bad: The book gets religious (about Google) at times, and some of it's pretty smug. This isn't a big deal, but it's likely to turn off people who've seen from experience how frustrating and unproductive it can be when good ideas about building systems become religion.

  15. 5 out of 5

    Scott Maclellan

    A fantastic and in-depth resource. Great for going deeper and maturing how a company builds and runs software at scale. Touches on the specific tactical actions your team can take to build more reliable products. The extended sections on culture slowed me down alot, but have led to some very interesting conversations at work.

  16. 5 out of 5

    Tadas Talaikis

    "Boring" (at least from the outside world perspective, ok with me), basically can be much shorter. Culture, automation of everything, load balancing, monitoring, like everywhere else, except maybe Borg thing.

  17. 5 out of 5

    Luca

    There’s interesting content for sure. But the writing isn’t engaging (the book is long so that becomes boring kinda fast) and some aspects of the google culture are real creepy (best example: “humans are imperfect machines” while talking about people management...)

  18. 5 out of 5

    David Robillard

    A must read for anyone involved with online services.

  19. 5 out of 5

    Gary Boland

    A useful checklist for production engineering is tarnished by the undercurrent of marketing/recruiting. Still deserves its place on the shelf if you deliver software for a living

  20. 5 out of 5

    Sundarraj Kaushik

    A wonderful book to learn how to manage websites so that they are reliable. Some good random extracts from the book. Site Reliability Engineering 1. Operations personnel should spend 50% of their time in writing automation scripts and programs. 2. the decision to stop releases for the remainder of the quarter once an error budget is depleted 3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning o A wonderful book to learn how to manage websites so that they are reliable. Some good random extracts from the book. Site Reliability Engineering 1. Operations personnel should spend 50% of their time in writing automation scripts and programs. 2. the decision to stop releases for the remainder of the quarter once an error budget is depleted 3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). 4. codified rules of engagement and principles for how SRE teams interact with their environment—not only the production environment, but also the product development teams, the testing teams, the users, and so on 5. operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them. 6. There are three kinds of valid monitoring output: Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation. Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result. Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so. 7. Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency. SLI - Service Level Indicator - Indicators used to measure the health of a service. Used to determine the SLO and SLA. SLO - Service Level Objective - The objective that must be met by the service. SLA - Service Level Agreement - The Agreement with the client with respect to the services rendered to them. Don’t overachieve Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads. "If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow." Four Golden Signals of Monitoring The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four. Latency: The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors. Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second. Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content. Saturation: How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation. Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours." If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring. Why it is important to have control over the software that one is using? Why and when it makes sense to roll out one's own framework and/or platform? Another argument in favor of automation, particularly in the case of Google, is our complicated yet surprisingly uniform production environment, described in The Production Environment at Google, from the Viewpoint of an SRE. While other organizations might have an important piece of equipment without a readily accessible API, software for which no source code is available, or another impediment to complete control over production operations, Google generally avoids such scenarios. We have built APIs for systems when no API was available from the vendor. Even though purchasing software for a particular task would have been much cheaper in the short term, we chose to write our own solutions, because doing so produced APIs with the potential for much greater long-term benefits. We spent a lot of time overcoming obstacles to automatic system management, and then resolutely developed that automatic system management itself. Given how Google manages its source code, the availability of that code for more or less any system that SRE touches also means that our mission to “own the product in production” is much easier because we control the entirety of the stack. When developed in-house the platform/framework can be designed to manage any failures automatically. There is no external observer required to manage this. One of the negatives of automation is that humans forget how to do a task when required. This may not be always good. Google Cherry Picks features for release. Should we do the same? "All code is checked into the main branch of the source code tree (mainline). However, most major projects don’t release directly from the mainline. Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline. Bug fixes are submitted to the mainline and then cherry picked into the branch for inclusion in the release. This practice avoids inadvertently picking up unrelated changes submitted to the mainline since the original build occurred. Using this branch and cherry pick method, we know the exact contents of each release." Note that cherry picking is of specific release branches and not changes in specific branch. Surprises vs. boring "Unlike just about everything else in life, "boring" is actually a positive attribute when it comes to software! We don’t want our programs to be spontaneous and interesting; we want them to stick to the script and predictably accomplish their business goals. In the words of Google engineer Robert Muth, "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code." Surprises in production are the nemeses of SRE." Commenting or flagging code "Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?" "Why don’t we just comment the code out so we can easily add it again later?" or "Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion (especially as the source files continue to evolve), and code that is never executed, gated by a flag that is always disabled, is a metaphorical time bomb waiting to explode, as painfully experienced by Knight Capital, for example (see "Order In the Matter of Knight Capital Americas LLC" [Sec13])." Writing blameless RCA Pointing fingers: "We need to rewrite the entire complicated backend system! It’s been breaking weekly for the last three quarters and I’m sure we’re all tired of fixing things onesy-twosy. Seriously, if I get paged one more time I’ll rewrite it myself…" Blameless: "An action item to rewrite the entire backend system might actually prevent these annoying pages from continuing to happen, and the maintenance manual for this version is quite long and really difficult to be fully trained up on. I’m sure our future on-callers will thank us!" Establishing a strong testing culture One way to establish a strong testing culture is to start documenting all reported bugs as test cases. If every bug is converted into a test, each test is supposed to initially fail because the bug hasn’t yet been fixed. As engineers fix the bugs, the software passes testing and you’re on the road to developing a comprehensive regression test suite. Project Vs. Support Dedicated, noninterrupted, project work time is essential to any software development effort. Dedicated project time is necessary to enable progress on a project, because it’s nearly impossible to write code—much less to concentrate on larger, more impactful projects—when you’re thrashing between several tasks in the course of an hour. Therefore, the ability to work on a software project without interrupts is often an attractive reason for engineers to begin working on a development project. Such time must be aggressively defended. Managing Loads Round Robin Vs. Weighted Round Robin (Round Robin, but taking into consideration the number of tasks pending at the server) Overload of the system has to be avoided by usage of load testing. If despite this the system is overloaded then any retries have to be well controlled. A retry at a higher level can cascade the retries at the lower level. Use jitter retries (retry at random intervals) and exponential retry (exponentially increase the time between the retries) and fail quickly to prevent overload on the already overloaded system. If queuing is used to prevent overloading of server then sometimes FIFO may not be a good option as the user waiting for the tasks at the head of the queue may have left the system not expecting a response. If task is split into multiple pipelined tasks then it will be good to check at each stage if there is sufficient time for performing the rest of the tasks based on the expected time that will be taken by the remaining tasks in the pipeline. Implement a deadline propagation. Safeguarding the data Three levels of guard against data loss 1. Soft Delete (Visible to user in the recycle bin) 2. Back up (incremental and full) before actual deletion and test ability to restore. Replicate live and backed up data. 3. Purge data (Can be recovered only from backup now) Out of Band data validation to prevent surprising data loss. Important to 1. Continuously test the recovery process as part of your normal operations 2. Set up alerts that fire when a recovery process fails to provide a heartbeat indication of its success Launch Coordination Checklist This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity: 1. Architecture: Architecture sketch, types of servers, types of requests from clients 2. Programmatic client requests 3, Machines and datacenters 4, Machines and bandwidth, datacenters, N+2 redundancy, network QoS 5. New domain names, DNS load balancing 6. Volume estimates, capacity, and performance 7. HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out 8. Load test, end-to-end test, capacity per datacenter at max latency 9. Impact on other services we care most about 10. Storage capacity 11. System reliability and failover What happens when: Machine dies, rack fails, or cluster goes offline Network fails between two datacenters For each type of server that talks to other servers (its backends): How to detect when backends die, and what to do when they die How to terminate or restart without affecting clients or users Load balancing, rate-limiting, timeout, retry and error handling behavior Data backup/restore, disaster recovery 12. Monitoring and server management Monitoring internal state, monitoring end-to-end behavior, managing alerts Monitoring the monitoring Financially important alerts and logs Tips for running servers within cluster environment Don’t crash mail servers by sending yourself email alerts in your own server code 13. Security Security design review, security code audit, spam risk, authentication, SSL Prelaunch visibility/access control, various types of blacklists 14. Automation and manual tasks Methods and change control to update servers, data, and configs Release process, repeatable builds, canaries under live traffic, staged rollouts 15. Growth issues Spare capacity, 10x growth, growth alerts Scalability bottlenecks, linear scaling, scaling with hardware, changes needed Caching, data sharding/resharding 16. External dependencies Third-party systems, monitoring, networking, traffic volume, launch spikes Graceful degradation, how to avoid accidentally overrunning third-party services Playing nice with syndicated partners, mail systems, services within Google 17. Schedule and rollout planning Hard deadlines, external events, Mondays or Fridays Standard operating procedures for this service, for other services As mentioned, you might encounter responses such as "Why me?" This response is especially likely when a team believes that the postmortem process is retaliatory. This attitude comes from subscribing to the Bad Apple Theory: the system is working fine, and if we get rid of all the bad apples and their mistakes, the system will continue to be fine. The Bad Apple Theory is demonstrably false, as shown by evidence [Dek14] from several disciplines, including airline safety. You should point out this falsity. The most effective phrasing for a postmortem is to say, "Mistakes are inevitable in any system with multiple subtle interactions. You were on-call, and I trust you to make the right decisions with the right information. I'd like you to write down what you were thinking at each point in time, so that we can find out where the system misled you, and where the cognitive demands were too high." "The best designs and the best implementations result from the joint concerns of production and the product being met in an atmosphere of mutual respect." Postmortem Culture Corrective and preventative action (CAPA) is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE's strong culture of blameless postmortems. When something goes wrong (and given the scale, complexity, and rapid rate of change at Google, something inevitably will go wrong), it's important to evaluate all of the following: What happened The effectiveness of the response What we would do differently next time What actions will be taken to make sure a particular incident doesn't happen again This exercise is undertaken without pointing fingers at any individual. Instead of assigning blame, it is far more important to figure out what went wrong, and how, as an organization, we will rally to ensure it doesn't happen again. Dwelling on who might have caused the outage is counterproductive. Postmortems are conducted after incidents and published across SRE teams so that all can benefit from the lessons learned. Decisions should be informed rather than prescriptive, and are made without deference to personal opinions—even that of the most-senior person in the room, who Eric Schmidt and Jonathan Rosenberg dub the "HiPPO," for "Highest-Paid Person's Opinion"

  21. 5 out of 5

    Jeremy

    This is the kind of book that can be quite hard to digest in one go, cover to cover. It took me more than two years to (casually) read it! Of course not everything can be applied everywhere. Not every organization is of the size of Google, and has the same amount of resources to apply the principles. Still, there is good advice mentioned in the book which can come handy in many situations.

  22. 5 out of 5

    Mark Hillick

    This review has been hidden because it contains spoilers. To view it, click here. Having worked in tech for many years, at a fairly large scale but not Google-scale, I'm probably the ideal audience for this book. I've never had so bookmarks or follow-up reading from a book, awesome knowledge-sharing from the Google SRE team. Although this book itself is not overly technical, its subject is very technical and this book is undoubtedly well-worth reading for all engineers, even if you don't operate at scale. You can learn what works and what doesn't, and then incorporate the vari Having worked in tech for many years, at a fairly large scale but not Google-scale, I'm probably the ideal audience for this book. I've never had so bookmarks or follow-up reading from a book, awesome knowledge-sharing from the Google SRE team. Although this book itself is not overly technical, its subject is very technical and this book is undoubtedly well-worth reading for all engineers, even if you don't operate at scale. You can learn what works and what doesn't, and then incorporate the various best practices, and possibly technologies, into your day job (in a controlled fashion with a clear strategy). There are so many good things that are worth calling out about this book, a short summary of highlights would include: - What makes a good SRE, and it ain't all technical - Everything about toil, what it is and why it is bad, particularly for the team's health, its success and the individual growth of team members - How to successfully build a SRE team (discipline), engage/embed with other teams, and bringing in new team members - The links to further reading or external papers, especially when the book didn't have enough space to dive into things technically (e.g. Maglev load balancer) - I live that the book called out burnout, ensuring that team members still do tedious but necessary work, while still having time to take a break and ensuring they can have dedicated blocks for more interesting or project work - The templates for on-call, triage, incident response, and postmortems are excellent (I love that they called the "no-blame" approach and the usefulness of checklists) Some things I'd have liked to see: - Better flow the earlier sections (2 & 3), particularly around alerting and monitoring. At times, reading was a drag here. - There's often repetition, probably caused by the changes in authors with "first principles" suffering the most from repetition (which would've clearly reduced the length of the book and made it easier to read) - At times, I felt there could have been more detail and meet to some of the internal tools and incidents, though the fact that Google have published this book in the first place and been honest that they've screwed up at times is amazing and quite unique in the tech industry. I want to recommend this book to colleagues but I will probably recommend specific chapters as opposed to the whole book, due mainly to repetition mentioned above. Lastly, I work in InfoSec and I sincerely hope those in InfoSec read this book in order to understand how the SRE team came into existence at Google and became such a success that they have to turn Product Development teams away when asked for 100% engagement support. Sadly many InfoSec teams are in an echo chamber in their corner as their company scales.

  23. 4 out of 5

    Moses

    When I started working on software infrastructure at large companies, I was struck by how little of what I was working on had been covered in school, and how little I could find in academia. Talking to friends in industry, many of us were facing the same problems, but there didn't seem to be any literature on what we were doing. Everything we learned, we learned either through the school of hard knocks, or from more experienced folks. This book fills a much needed gap. Furthermore, since many com When I started working on software infrastructure at large companies, I was struck by how little of what I was working on had been covered in school, and how little I could find in academia. Talking to friends in industry, many of us were facing the same problems, but there didn't seem to be any literature on what we were doing. Everything we learned, we learned either through the school of hard knocks, or from more experienced folks. This book fills a much needed gap. Furthermore, since many companies have evolved their processes in silos, even engineers who already have a pretty good idea of how to increase 9s will learn something new, since Google's history has probably led them down a different evolutionary path than what your company followed. Because of this, I hope that folks don't consider the matter of reliability open and shut now that this book has come out. In truth, this book is in many ways a history book about how Google handles reliability, and is not the end-all, be-all of reliability in distributed systems. This book is a good starting place, but not all of their practices or ideas are right for all systems, and we can remember that we're in a nascent field, and there's still work to be done. With that said, this book comes with the same problems that many books that are collections of essays have. There isn't a cohesive narrative, it often repeats itself, and the essays are uneven. Some of them are radiant, and some of them are not. Even considering the flaws of this book, I highly recommend it for anyone who is trying to make distributed systems reliable within a large engineering organization.

  24. 4 out of 5

    Daniël

    So, I can see how this book has been so influential in the industry. A lot of what's in here should be common sense for people who have been working in tech for over 5 years but sadly isn't. This book reminded me of much advice I've given over my career, that was then promptly discarded for multiple reasons. This book gives me some ammo in these kinds of discussions, so that's great. That's not to say that I didn't learn anything from the book, there's many things in here that made me think "Oh, So, I can see how this book has been so influential in the industry. A lot of what's in here should be common sense for people who have been working in tech for over 5 years but sadly isn't. This book reminded me of much advice I've given over my career, that was then promptly discarded for multiple reasons. This book gives me some ammo in these kinds of discussions, so that's great. That's not to say that I didn't learn anything from the book, there's many things in here that made me think "Oh, that's a great idea" and other twists on things that I already knew that I will incorporate in my work. Furthermore, it's interesting to see how Google manages things and the details about their infrastructure are good reads. This book was quite a dry read though, as tech books tend to be, which explains why I took quite a bit longer to finish it.

  25. 4 out of 5

    Sumit Gouthaman

    TL;DR: Great book, really difficult to read cover to cover This is a great book. The collection of stories here gives a lot of insight into running production systems. However, having being written by multiple authors it comes across as rather un-organized. A lot of repetition and uneven pacing in the chapters makes it very difficult to finish. Also, I feel like a lot of concepts in the book could work better if some of the Google-specific details were abstracted out. I can imagine this book to be TL;DR: Great book, really difficult to read cover to cover This is a great book. The collection of stories here gives a lot of insight into running production systems. However, having being written by multiple authors it comes across as rather un-organized. A lot of repetition and uneven pacing in the chapters makes it very difficult to finish. Also, I feel like a lot of concepts in the book could work better if some of the Google-specific details were abstracted out. I can imagine this book to be a great read for someone inside Google, but for a general reader, it comes across as convoluted in some sections. I think with better editing work, this could be a 5/5 book.

  26. 5 out of 5

    André Santos

    This is an overall good book to people new to the SRE way of things. It also provides some interesting look on Google's production environment and methodologies. Since each chapter is kind of self contained, the book is a bit repetitive on some aspects.

  27. 4 out of 5

    Luke Amdor

    Some really great chapters especially towards the beginning and the end. However, I feel like it could have been edited better. It meanders a lot.

  28. 4 out of 5

    Rod

    I read this book because many (software engineers) see this book as an extension and enhancement of the agile software devel process which is popular in the DevOps movement. The book is a more a survey of how Google SRE (Site Reliability Engineers) maintain the extensive Google worldwide network rather than a discussion of how software should be developed. Some of the main philosophical components of the Google NetOps strategy are: 1. NetOps people should do about 50% operational work and 50% (Softw I read this book because many (software engineers) see this book as an extension and enhancement of the agile software devel process which is popular in the DevOps movement. The book is a more a survey of how Google SRE (Site Reliability Engineers) maintain the extensive Google worldwide network rather than a discussion of how software should be developed. Some of the main philosophical components of the Google NetOps strategy are: 1. NetOps people should do about 50% operational work and 50% (Software) devel. the devel work should be on features or in-house tools which will extend the productivity of SRE workers. The productivity of SRE (NetOps) workers should scale so that as services grow, the growth of the SRE staff will be at a lower rate - this is because of a set of SRE strategies and SRE incident (outage) troubleshooting tools which make the SRE worker more productive. 2. After each outage (i.e. incident) there should be a blame-free postmortem document which is discussed among SRE groups. 3. the practice of continual monitoring and testing will prevent unexpected in production software. 4. key differences between agile and SRE methodologies are the concepts of SLO (service level objective) and 'error budget' which are a part of the reliability expectation for any software service. SLO should be no more than 1-error budget. As outages happen, they eat into the error budget and reduce the agility/rate at which new features/software changes are released into production environment. Changes in the production environment (i.e. new software releases) are the cause of most service outages. The need to roll out fixes and features is bounded by the corporate requirement to deliver reliable services to customers. The book has much to say about the software development that SRE (NetOps) workers should do but it does not claim to be a guide for development done by general enterprise application software engineers. The book seems to assume a group of enterprise software developers as being separate from the SRE /NetOps group. It is hard to see this book could be applied as a corrective to agile software development in an enterprise environment. If you work in a general network operations group in a high tech company, this book is recommended reading. If you work as an enterprise software developer, I don't think it would contribute much to your daily work flow. Chapter 19 begins a four chapter technical discussion of how load balancers are implemented in the Google global network. This section seems out of place in the rest of the book Each chapter is written and reviewed by different Google employees. Here's a Youtube video which sees SRE as an extension of agile/devops https://www.youtube.com/watch?v=0UyrV...

  29. 5 out of 5

    Adil

    Google's Site Reliability Engineering book is a collection of essays about its titular Site Reliability Engineering (SRE) organization and its philosophy of applying software engineering principles to traditional sysadmin tasks like reliability, monitoring, and stability. The book is a collection of situational essays about different topics related to the SRE organization, and should be consulted one-off rather than read cover-to-cover - though you will learn quite a lot about Google-specific us Google's Site Reliability Engineering book is a collection of essays about its titular Site Reliability Engineering (SRE) organization and its philosophy of applying software engineering principles to traditional sysadmin tasks like reliability, monitoring, and stability. The book is a collection of situational essays about different topics related to the SRE organization, and should be consulted one-off rather than read cover-to-cover - though you will learn quite a lot about Google-specific use cases if you take the latter approach. Some of the most personally useful sections in this book concern Google SRE's approach to building multiple levels of reliability and communication into each of their products and teams: "hope is not a strategy" remains my favorite take-away from this collection by far. While many ideas outlined in these sections were only practical at Google-like scale, I still found it insightful to read essays that had clearly been written by veterans from every sort of field under the sun: the epilogue lists contributors from the defense, medical, mechanical engineering, and even mountaineering industries. The authors took their time to tell us just how important clear communication is when facing a data-loss disaster that affects leagues of consumers: they included references to FEMA's disaster response mechanisms and emphasized how important it was to have a team member running comms while the rest of the group triaged the emergency effort. Site Reliability Engineering is a useful team reference from a company that has taken many risks and difficult paths to achieve the automation, reliability, and scale that it has today. I recommend it as a reference to anyone who is working in the software engineering field. Oh, and did I mention that it's available totally free online?

  30. 5 out of 5

    Matt

    This is a "soft" introduction of building and taking care of large systems through the lens of how Google does it. I use the term "soft" here because the book gives a high level technical overview instead of getting into the nitty-gritty details and it also has a strong focus on the people-side of things. As it turns out, building a system composed of thousands of machines isn't just about choosing the right version of a consensus algorithm, though these are touched upon as well. The people exten This is a "soft" introduction of building and taking care of large systems through the lens of how Google does it. I use the term "soft" here because the book gives a high level technical overview instead of getting into the nitty-gritty details and it also has a strong focus on the people-side of things. As it turns out, building a system composed of thousands of machines isn't just about choosing the right version of a consensus algorithm, though these are touched upon as well. The people extending and maintaining the system have to be organized in certain ways that will allow them to respond to sudden changes in the system (internal or external), introduce changes themselves, and do all of that while allowing the system only a few minutes of downtime a year. The book is divided into two primary parts - principles and practices. The principles are knowledge distilled into a handful of page and take on topics such as "embracing risk", "eliminating toil", and "monitoring". These chapters explain the "why" behind each idea. Practices focus on showcasing real-life implementations of these principles, which is a more meaty and interesting topic. This material describes tools and procedures that have allowed Google to build robust and reliable services. Again, the main focus here are the people responsible for building and running these services - we are treated to examples of managing on-call rotation, responding to incidents, setting up our monitoring solution, or building critical components that manage to survive disasters. And doing all of this while deploying production code hundreds of times a day. While a lot of what's presented in the book is targeted at engineers working on huge systems, I think that a lot of the people-oriented material can be applied to smaller shops to deliver more reliable products faster.

Add a review

Your email address will not be published. Required fields are marked *

Loading...
We use cookies to give you the best online experience. By using our website you agree to our use of cookies in accordance with our cookie policy.