Now that we've covered the key performance metrics, let's design for reliability. Avoid single points of failure by replicating data and creating multiple Virtual Machine instances. It is important to define your unit of deployment and understand its capabilities. To avoid single points of failure, you should deploy two extra instances, or N+2 to handle both failure and upgrades. These deployments should ideally be in different zones to mitigate for zonal failures. Let me explain the upgrade consideration. Consider three VMs that are load balanced to achieve N+2. If one is being upgraded and another fails, 50 percent of the available capacity of the compute is removed, which potentially doubles alone on the remaining instance and increases the chances of that failing. This is where capacity planning and knowing the capability of your deployment unit is important. Also, for ease of scaling, it is a good practice to make the deployment units interchangeable stateless clones. It is also important to be aware of correlated failures. These occur when related items fail at the same time. At the simplest level, if a single machine fails, all requests served by that machine fail. At a hardware level, if a top of rack switch fails, the complete rack fails. At the Cloud level, if a zone or region is lost, all the resources are unavailable. Servers running the same software suffer from the same issue. If there is a fail in the software, the servers may fail at a similar time. Correlated failures can also apply to configuration data. If a global configuration system fails and multiple systems depend on it, they potentially fail too. When we have a group of related items that could fail together, we refer to it as a failure or fault domain. Several techniques can be used to avoid correlated failures. It is useful to be aware of failure domains. Then servers can be decoupled using microservices distributed among multiple failure domains. To achieve this, you can divide business-logic into services based on failure domains and deploy to multiple zones and, or regions. At a finer level of granularity, it is good to split responsibilities into components and spread these over multiple processes. This way, a failure in one component will not affect other components. If all responsibilities are in one component, a failure in one responsibility has a high likelihood of causing all responsibilities to fail. When you design microservices, your design should result in loosely coupled, independent but collaborating services. A failure in one service should not cause a failure in another service. It may cause a collaborating service to have reduced capacity or not be able to fully process it's workflows, but the collaborating service remains in control and does not fail. Cascading failures occur when one system fails, causing others to be overloaded and subsequently fail. For example, a message queue could be overloaded because a backend fails and it cannot process messages placed on the queue. The graphic on the left shows a Cloud Load Balancer distributing load across to backend servers. Each server can handle a maximum of 1,000 queries per second. The load balancer is currently sending 600 queries per second to each instance. If Server B now fails, all 1,200 queries per second have to be sent to just Server A as shown on the right. This is much higher than the specified maximum and could lead to cascading failure. So how do we avoid cascading failures? Cascading failures can be handled with support from the deployment platform. For example, you can use health checks in Compute Engine or readiness and liveliness probes and GKE to enable the detection and repair of unhealthy instances. You want to ensure that new instances start fast and ideally do not rely on other backends or systems to startup before they are ready. The graphic on this slide illustrates a deployment with four servers behind a load balancer. Based on the current traffic, a server failure can be absorbed by the remaining three servers as shown on the right-hand side. If the system uses Compute Engine with instance groups and auto healing, the failed server would be replaced with a new instance. As I just mentioned, it's important for that new server to startup quickly, to restore full capacity as quickly as possible. Also, this setup only works for stateless services. You also want to plan against query of death, where a request made to a service causes a failure in the service. This is referred to as the query of death because the error manifests itself as over consumption of resources, but in reality is due to an error in the business-logic itself. This can be difficult to diagnose and requires good monitoring, observability and logging to determine the root cause of the problem. When the requests are made, latency, resource utilization, and error rates should be monitored to help identify the problem. You should also plan against positive feedback cycle overload failure, where a problem is caused by trying to prevent problems. This happens when you try to make the system more reliable by adding retries in the event of a failure. Instead of fixing the failure, this creates the potential for overload. You may actually be adding more load to an already overloaded system. The solution is intelligent retries that make use of feedback from the service that is failing. Let me discuss two strategies to address this. If a service fails, it is okay to try again, however, this must be done in a controlled manner. One way, to use the exponential backoff pattern. This performs a retry, but not immediately. You should wait between retry attempts, waiting a little longer each time a request fails, therefore, giving the failing service time to recover. The number of retries should be limited to a maximum, and the length of time before giving up should also be limited. As an example, consider a failed request to a service. Using exponential backoff, we may wait one second plus a random number of milliseconds and try again. If the request fails again, we wait two seconds plus a random number of milliseconds and try again. Fail again, then wait four seconds plus a random number of milliseconds before retrying and continue until a maximum limit is reached. The circuit breaker pattern can also protect a service from too many retries. The pattern implements a solution for when a service is in a degraded state of operation. It is important because if a service is down or overloaded, and all it's clients are retrying, the extra requests actually make matters worse. The circuit breaker design pattern protects the service behind a proxy that monitors the service health. If the service is not deemed healthy by the circuit breaker, it will not forward request to the service. When the service becomes operational again, the circuit breaker will begin feeding request to it again in a controlled manner. If you are using GKE, the Istio's service mesh automatically implements circuit breakers. Lazy deletion is a method that builds in the ability to reliably recover data when a user deletes the data by mistake. With lazy deletion, a deletion pipelines similar to that shown in this graphic, is initiated and the deletion progresses in phases. The first stage is that the user deletes the data, but it can be restored within a predefined time period. In this example, it's 30 days. This protects against mistakes by the user. When the predefined period is over, the data is no longer visible to the user, but moves to the soft deletion phase. Here, the data can be restored by user support or administrators. This deletion protects against any mistakes in the application. After the soft deletion period of 15, 30, 45 or even 50 days, the data is deleted and no longer available. The only way to restore the data is by whatever backups or archives were made of the data.