The face of large-scale systems engineering, how to deal with Facebook troubleshooting (a)

Author: Ben Maurer is the technology leader in Facebook's Web-based team, responsible for the performance and reliability of the entire Facebook product for the user. Ben officially joined Facebook in 2010, a member of the infrastructure team. Before joining Facebook, he and Luis von Ahn co-founded the verification code. Recently, this collaboration with American Digital Services to improve the use of technology in the federal government.

Many people today to bring a Ben Maurer to share the "Facebook face large-scale system engineering troubleshooting practice", because the content is more, so the number of people today only for everyone to bring the upper part of the follow-up content will be Be released tomorrow!

Faults are part of any large-scale engineering system. One of Facebook's cultural values ​​is to embrace the failure. This can be seen from the posters hanging from the walls of Menlo Parker: "If you are fearless, what would you do?" "Blessing the brave."

In order to make Facebook's system in the case of rapid changes to maintain a reliable, specifically for the study of the common failure mode, and the establishment of abstract ideas to solve these problems. These ideas ensure that the best practices apply to the entire infrastructure. Through the establishment of tools to diagnose the problem, and create a culture of complex events to promote and make improvements to prevent future failure.

Why is a failure?

Although every fault has a unique story, but most of the failures can be attributed to a small number of reasons.

Individual machine failure

A single machine usually encounters an isolated fault that does not affect the rest of the infrastructure. For example, a hard drive of a machine may have failed, or a service on a machine encounters an error in the code, a memory corruption, or so on.

The key to avoiding a single machine failure is automation, and automation is best done in conjunction with known failure modes (such as hard disk drive SMART errors) with unknown problem searches (for example, by exchanging servers with unusually slow response times). When automating an unknown problem, manual surveys can help develop better tools to detect and fix problems.

Reasonable workload changes

In the event of a sudden situation, Facebook will change the daily behavior of the habit, for the infrastructure to bring challenges. For example, in important global events, unique workloads may test the infrastructure in an unusual way. When Obama won the 2008 US presidential election, Facebook page activity refreshed the record. Such as the Super Bowl or the World Cup such a major sporting event will lead to a significant increase in the number of postings. Load testing, including "grayscale publishing", is a new feature release, but is not visible to users, helping to ensure that new features are able to handle the load.

The statistics collected in these events often provide a unique perspective for the design of the system. Often, significant events lead to changes in user behavior (for example, by creating a theme activity around a specific object). Data about these changes usually point to design decisions to allow for smoother operations in subsequent events.

Human error

In view of Facebook to encourage engineers to "act quickly, break the routine" – as another poster in the decoration office shows, some people may think that many mistakes are man-made. According to the data, human error is a factor in failure. Figure 1 covers time node data that is serious enough to be considered an event that violates the SLA (Service Level Agreement). Because the goal is very strict, so for the site users, most of the events are minor, not obvious. Figure 1a shows that the probability of events occurring on Saturdays and Sundays is significantly reduced, but it does not affect site traffic. Figure 1b shows that there are no events for only two weeks in six months: a week of Christmas and a week of mutual review by employees.

These two data seem to show that when Facebook employees are busy with other things (such as weekends, holidays and staff assessment, etc.) and not actively to change the infrastructure when the site's reliability but at a relatively high level. Led us to believe that this is not because the Facebook staff too careless, but proved that the infrastructure is largely self-repair of non-human error, such as machine failure.

Three easy causes of the accident

Although the accident has a different cause, but through the summary found that there are three common reasons will make the failure to expand and become a large-scale problem. For each cause, appropriate precautions should be taken to mitigate large-scale accidents.

Quickly deploy configuration changes

Configuration systems are often designed to replicate changes quickly on a global scale. Rapid configuration changes are a powerful tool that allows engineers to quickly manage new product launches or adjust settings. However, a quick configuration also means that a failure is caused quickly when configured incorrectly. We have taken some steps to prevent configuration changes from causing a failure.

  • Let everyone use a common configuration system

Using a common configuration system ensures that programs and tools are available for all types of configurations. On Facebook, we found that the team sometimes tried to configure it in a one-time way. To avoid the use of this approach and a unified way to configure, so that the configuration system to improve the reliability of a measure of the site.

  • Static validation configuration changes

Many configuration systems allow loose type configurations such as JSON structures. These types of configurations make it easy for engineers to make some low-level errors, such as knocking the wrong field if the field is required to use an integer string. The best way to do this type of error is to use static validation. A structured format (for example, Thrift used on Facebook) can provide the most basic validation. However, it is also reasonable to write a verification procedure to verify a more detailed request.

  • Run a Canary

First, deploy the configuration to a small range of services to prevent catastrophic changes. A Canary can take many forms. The most obvious is the A / B test, such as only one percent of the user to launch a new configuration. Multiple A / B tests can run at the same time, and you can use data to track metrics over time.

However, for reliability purposes, the A / B test does not meet all of our needs. A change is deployed to a small number of users, but the resulting server crash or memory depletion changes will clearly produce the impact of limited users beyond the test. A / B test is also time consuming. Engineers often want to introduce some minor changes without using A / B testing. To this end, the Facebook infrastructure automatically tests the new configuration of a small number of servers. For example, if we want to deploy a new A / B test to a percent of users, first deploy the tests to only those users who affect a small number of servers and monitor them in a very short time to ensure that they Will not crash or have other obvious problems. This mechanism provides a basic "sound check" that applies to all changes to ensure that they do not cause large areas of failure.

  • Keep a good configuration

Facebook's configuration system is designed to ensure that the update is maintained when the failure to maintain a good configuration. The configuration system that the developer wants to create will crash when an invalid update configuration is received. Prefer to keep old configurations in these types of situations and alert the system operator that the configuration can not be updated. Continuing to run the old configuration is usually better than returning the error to the user.

  • Make it easy to recover

Sometimes, despite the best efforts to deploy the configuration is still a problem, quick search and recovery is the key to solve this problem, the configuration system is controlled by the version, which makes the system easy to recover.

Core services are hard dependent

Developers usually default configuration management, service discovery, storage systems and other core business will never fail. However, these core business of a minor fault will cause a large area of ​​the accident.

  • Core services cache data

Dependent on these types of services is usually unnecessary, you can cache the data, in order to ensure that one of the system is temporarily interrupted, and other services continue to run.

  • Provide hardened APIs using core services

Core services are the best way to supplement public libraries and follow best practices to use these core services. For example, libraries can provide good api to manage caching or handle failures.

  • Run a fire drill

You may think that you can survive in the core service interruption, before you try, you will never know. For these types of interrupts, we had to develop a system of fire drills that were used to manually trigger the entire data center from the failover system to a single server.

Increase latency and resource depletion

Some failures cause the service latency to increase to the client. This increase in latency may be minimal (for example, taking into account the fact that a person's configuration is wrong, but still the ability to service increases CPU usage), and that is, it may be infinite (a service thread service response is paralyzed). While a small amount of delay can be easily solved by Facebook's infrastructure, a large number of delays can lead to a total failure. Almost all services have a limit on the number of outstanding requests, which may be due to the limited number of requests per service thread, or because of limited memory based on faulty services. If a service faces a lot of latency, then calling its service will run out of their resources. This failure will enter the system through many aspects of service, leading to system failure.

Resource depletion is a very destructive failure mode because it allows a subset of service requests to be used to cause all failed requests to fail. For example, a service call only introduces 1% of users to a new experimental service, usually requiring 1 ms of this lab service, but 1% of requests for new service failures require 1% of users to use this new service request May consume too many threads, other 99% of users can not run this thread.
Today, we have found some technology that can avoid this type of accumulation with lower false positives.

  • Control delay

In analyzing the previous accident latency, we found that many of the worst failures involved a large number of queues waiting for processing requests. The problematic service has a resource limit (such as the number of active threads or memory) and will be buffered to keep the request below the limit. Because the service can not keep up with the speed of the incoming request, the queue will become bigger and larger until it breaks through the application-defined restrictions. In order to solve this situation, we hope to limit the size of the queue without affecting the normal operation and guaranteeing the reliability. We studied a very similar bufferbloat, while ensuring the reliability of the use of the queue will not cause excessive delay. Tried a codel1 (delay) control algorithm:

OnNewRequest (req, queue):

If (queue.lastEmptyTime () <(now – N seconds)) {

Timeout = M ms

} Else {

Timeout = N seconds;


Queue.enqueue (req, timeout)

In this algorithm, if the service can not empty the queue in the last N milliseconds, the time spent in the queue is limited to M milliseconds. If the service can complete the empty queue in the last N milliseconds, the time spent in the queue is limited to N milliseconds. The algorithm avoids standing on the queue (due to lastEmptyTime will be in the distant past, resulting in anM-ms queuing timeout), reaching a short time for queuing for reliability purposes. While it seems paradoxical, the request time is shorter, and this process allows you to quickly drop the service, rather than being built on a system that can not keep up with incoming requests. A short timeout ensures that the server is always in the working state, not idle.

An attractive feature of the algorithm is that M and N values ​​often do not need to be adjusted. Other ways to solve the queuing problem, such as setting the number of items in the queue or setting the timeout time of the queue, need to be adjusted on a per-service basis. We have found that M and 100 milliseconds have a value of 5 milliseconds, which can be well used in N. Facebook's open source library5 provides the algorithm is implemented by thrift4 framework.

  • Self-adaptive first-in, first-out

Most service process queues FIFO (first in first out). When in the high-level processing process, the advanced command has been running for a long time, so that the user may have stopped the request to generate the operation. When dealing with advanced application orders, compared to this just arrived at the request command, the first will consume a little may benefit from the user's request command, the service process request program is used in the way first. In the normal working conditions, the requirements in accordance with the first-in first out of the order to deal with, but when a queue is about to start forming, the server will switch to LIFO mode, then, LIFO and CoDel can be very good together, as shown in Figure 2. The CoDel timeout setting prevents long computer program queue growth and then adapts the first outbound command to set a new request pattern in the computer program queue, and then under the action of the digital signal encoder, both of them can maximize effect. HHVM3, Facebook PHP runtime, adaptive first-in first-out algorithm to achieve.

  • Concurrent control

Both the encoding and the adaptive first-in-first-out method are running on the server side. A server is usually the best measure to perform a delay – the server is more likely to have a large number of customers who can have more customer information at the same time. However, some failures are so serious that server-side controls can not start. To this end, we implement a contingency on the client. Each client tracks the number of outbound requests that are not completed by each server. When a new request is sent, the request is immediately marked as an error if the number of unsuccessful requests for the service exceeds the configurable number. This mechanism prevents a single service from monopolizing the resources of its clients.

The above content is the first half of the "Facebook's face of large-scale system engineering troubleshooting practice" brought by everyone today, which mainly covers the cause of the failure, and can use a common system and other related content, hope to To help everyone ~ tomorrow will bring you the ultimate solution yo, so stay tuned ~

Author: Ben Maurer

Original: Fail at Scale Reliability in the face of rapid change


Heads up! This alert needs your attention, but it's not super important.