Google's use of Borg for large-scale cluster management (Chapter 3, Chapter 4)
[Editor's Note] These two chapters introduce Borg's architecture and usability, from the front of the user perspective to the system designer's perspective.
3. Borg architecture
A Borg Cell includes a bunch of machines, a logical central control service called Borgmaster, and a Borglet agent process running on each machine (see Figure 1). All Borg components are written in C ++.
- Record sharing | Google Borg system and Coding Docker practice
- Classic papers | Google uses Borg for large-scale cluster management (Chapter 7, Chapter 8, Acknowledgments, References)
- Borg, Omega and Kubernetes: Lessons Learned from Google for more than a decade from three container management systems
- Google's use of Borg for large-scale cluster management (Chapter 1, Chapter 2)
- What is the difference between Borg and Kubernetes and what is the future of the cloud?
- Mesos, Omega and Borg What are the differences between the three systems?
Cell's Borgmaster consists of two processes, the main Borgmaster process and a separate scheduler ($ 3.2). The master Borgmaster handles all client RPC requests, such as modifying the state (creating a job) and providing a data read service (lookup job). It also manages the state machines of all components (machines, tasks, allocs, etc.) in the system, communicates with Borglet, and provides a backup Web UI for Sigma.
Borgmaster is logically a single process, but actually has five copies. Each copy maintains a memory-level copy of the cell state, which is also recorded on a highly available, distributed, Paxos-based store  on the local hard disk of those copies. In a cell, a separate elected master is also used for the Paxos leader and state modifier to handle all requests that change the state of the cell, such as submitting a job or terminating a task on a machine. When the cell starts or the previous master hangs, the Paxos algorithm will elect a master; this requires a Chubby lock and then other systems can find the master. Election of a master or a new event is a typical event is 10s, but it takes about 1 minute to make a large cell effective, because some of the memory state to be refactored. When a copy is recovered from network isolation, it is necessary to dynamically resynchronize itself from other Paxos replicas.
The state of the Borgmaster at some point is called checkpoint and is saved in the form of a snapshot + change log in the Paxos store. Checkpoints have many uses, including restoring the state of Borgmaster to any previous time (for example, before processing a request to solve a software defect); in extreme cases, manually modify checkpoints to form a persistent event log for future use; Online simulation of the line.
A high simulation Borgmaster called Fauxmaster, can be used to read the checkpoint file, including a complete Borgmaster code, and the Borglet stub interface. It accepts RPC to change the state machine and perform operations, such as scheduling all blocked tasks, we use it to debug errors, and it interacts with Borgmaster interaction is the same, we also have a simulation Borglet can use the checkpoint to reproduce the real Interaction. Users can single-step debugging to see all past changes in the system. Fauxmaster is also useful in this case: more than this type of job more appropriate? And do a security check before changing the cell configuration (will this kick off any key jobs?)
3.2 schedule schedule
When a job is submitted, Borgmaster will store it persistently in the Paxos store and put the job's task in the pending queue. The queue will be scanned by the scheduler asynchronously, and then the task is distributed to the machine with sufficient resources. Scheduler is mainly dealing with the task, not job. Scan from high priority to low priority, in the same priority with round-robin way to deal with, to ensure the fairness between users and to avoid the big job on the head block. Scheduling algorithm has two parts: feasibility check (feasibility checking), find a machine to run the task, and scoring (scoring), find a most suitable machine.
At this stage of the feasibility check, the scheduler will find a set of machines that meet the constraints of the task and have enough resources available – including some resources that have been allocated to low priority tasks that can be freed. In the scoring stage, the scheduler will find the "best" machine. This score includes the user's preferences, but is mainly built-in standards: for example, to minimize the other tasks, to find the task has been installed in the package, in the power and error between the available domain as much as possible in the single The machine mix the high and low priority task to ensure the peak expansion.
Borg originally used E-PVM  variants algorithm to score, in the heterogeneous resources to generate a single score, in a task scheduling to minimize the system changes. But in practice, E-PVM finally distributes the load evenly to all machines, leaving the extended space to the peak – but the cost is to increase the fragmentation, especially when the big task requires most of the machines; Sometimes give this assignment nicknamed "worst match".
The other end of the distribution strategy spectrum is "the best match", the machine plug the task plug the more tight the better. This will leave some empty machines to the user jobs (they also run the storage service), so dealing with the big task is more direct, but the tight allocation will punish those who need their own resources to estimate the lack of users. This strategy will hurt the application of the explosive load, and the need for low CPU batch task is particularly unfriendly, these tasks can be easily scheduled to the unused resources: 20% of the non-prod task needs less than 0.1 core CPU.
Our current scoring model is a mixture of trying to reduce the stranded resources – some because the machine's resources are not being used and left. This model provides a 3% to 5% improvement in package efficiency (defined in ) compared to "best match".
If a machine does not have enough resources to run a new task after scoring, Borg will deport (preempts) the low priority task and kicks it from the lowest priority until the resource is enough. We put the kicked task into the scheduler's pending queue, rather than migrating or hibernating these tasks.
Task start delay (from the job submission to the task between the time period) is our continued attention. This time is very different, in general, is 25s. Package installation cost 80% of the time inside: a known bottleneck is the fight for the local hard drive. In order to reduce the task startup time, the scheduler expects the machine to have enough packets (programs and data): most of the packages are read-only so they can be shared and cached. This is the only way that Borg scheduler supports data localization. By the way, Borg distributes the package to the machine by means of a tree-like and BT-type protocol.
In addition, scheduler with some technology to spread to tens of thousands of machines inside the cell. ($ 3.4)
Borglet is a local Borg agent that is deployed on each machine of the cell. It starts to stop the task; if the task fails to restart; by modifying the OS kernel settings to manage local resources; roll debug log; the state of the machine reported to Borgmaster and other monitoring system.
Borgmaster will poll all Borglets every few seconds to get the current state of the machine and send any requests. This allows Borgmaster to control the frequency of communication, to avoid an explicit flow control mechanism, and to prevent the recovery of the storm .
The elected master is responsible for sending the message to the Borglet and updating the cell status according to the response. For performance to be scalable, each Borgmaster copy runs a stateless connection link to handle and communicate with a specific Borglet; this allocation is recalculated at the Borgmaster election. In order to ensure flexibility, Borglet reported all the status, but link shard will aggregate and compress this information to the state machine, to reduce the election of the master load.
If the Borglet does not respond to the polling request a few times, it will be marked as down, and the run above will be reassigned to the other machine. If the communication is restored, Borgmaster will let the Borglet kill the task that has been assigned to avoid duplication. Borglet will continue the regular operation even if and Borgmaster resume contact, so the current run of the task and service to keep running to prevent all the Borgmaster hung up.
We do not know where Borg's scalability is, and every time we meet a limit, we're going to go. A separate Borgmaster can manage thousands of machines inside a cell, and several cells can handle 10,000 tasks per minute. A busy Borgmaster uses 10-14 CPU cores and 50GB of memory. We used a few techniques to get this scalability.
Early Borgmaster had a simple, synchronized loop to handle requests, dispatch tasks, and Borglet communications. In order to deal with large cells, we put the scheduler as a separate process, and then you can run in parallel with other Borgmaster function, other Borgmaster can open a copy to fault tolerance. A scheduler replicates a copy of the state of the cell. It repeats: retrieves the state change from the elected master (including all assigned and pending jobs); updates its own local copy, makes the dispatch job to assign the task; tells the delegate of the assigned master. The master will accept this information and apply it, unless the information is not appropriate (for example, out of date), which will be in the next cycle of the scheduler. All this is in line with Omega's  optimism, and we have recently added this functionality to Borg to schedule different workloads with different schedulers.
To improve the response time, we added some independent threads and Borglet communications in response to read-only RPC. For higher performance, we share (zoning) these requests to 5 Borgmaster copies $ 3.3. Finally, this allows 99% of the UI to respond within 1s, while 95% of the Borglet polls within 10s.
Some things that make Borg scheduler more scalable:
Score Cache: Evaluating the availability and score of a machine is more expensive, so Borg will always cache the score until the machine or task changes – for example, the machine's task is over, some properties are modified, or task The demand has changed. Ignore small resource changes to make cache shelf life longer.
The same level of homogenization: the same Borg job task in general have the same needs and resources, so do not have a waiting task every time to find the available machines, which will be available to all machines n times points. Borg will be the same level of the task to find the available machine to play a sub-points.
Moderate random: to a large Cell inside all the machines are measured again Availability and scoring is more wasteful. So the scheduler will randomly check the machine, find enough available machines to score, and then pick out the best one. This will reduce the task to enter and leave the system when the number of scoring and cache failure. Moderately randomly like batch processing techniques like Sparrow , also faced with priority, expulsion, non-isomorphic systems and package installation costs.
In our experiments ($ 5), it would take several hundred seconds to schedule the entire cell's workload, but it would take three days or more to use the above techniques. In general, an online dispatch from the waiting queue to spend half a second can get.
Faults in large distributed systems are common [10,11,12]. Figure 3 shows the reasons for the deportation of the 15 cells inside the task. Applications running on Borg need to be able to handle such events, applications to support open copies, store data to distributed storage of these technologies, and can do snapshots on a regular basis. Even so, we also mitigate the effects of these events as much as possible. For example, Borg:
- Automatically re-schedule the deported task, if needed to run on a new machine
- By distributing a job into a different available domain, such as a machine, a rack, a power supply domain
- In the machine, OS upgrade these maintenance work, reduce the same time in a job in the task of the closing rate
- Use the declarative state of the state representation and the power of the state change to do the operation, so that the faulty client can be a lossless restart or a secure forgotten request
- For the loss of the machine on the task, limit a certain rate to re-scheduling, because it is difficult to distinguish between large-scale machine failure and network partition
- Avoid a specific task that will cause a crash
- Critical level of the middle of the data written to the local hard disk to save the task is very important, even if the task belongs to the alloc is terminated or scheduled to other machines, but also to restore out to do. Users can set how long the system will try to repeat: a few days is a more reasonable approach.
A key Borg design feature is: even if Borgmaster or Borglet hangs, the task will continue to run. However, it is also important to keep the master running, because the new job can not be committed when it hangs, or if it can not be updated at the end, the task on the failed machine can not be scheduled again.
Borgmaster uses a combination of techniques to ensure 99.99% availability in practice: copy technology to deal with machine failures; management controls deal with overload; deploy instances with simple, underlying tools to reduce external dependencies (Translator: I guess rsync or scp this) tool). Each cell and other cells are independent, thus reducing misuse of association and fault transmission. In order to achieve this purpose, so we do not engage in large cell.
Links: Large-scale cluster management at Google with Borg (translation: easy )