Container and persistent data

[Editor's Note] How to keep the data stored in the container, which is the date since the birth of Docker has been the problem. This article describes a number of related projects, including Raft algorithm, Etcd, CockroachDB, Flocker and high availability PostgresSQL cluster Governor and so on.

More and more users want to container all the infrastructure, including not only web servers, queues and proxy servers, but also database servers, file storage and caching. At CoreOS Fest 's application container specification ( appc ) expert group meeting, the audience's first question is: what about the state container, how to do? The so-called state, means that these containers need to save some of the data (that is, state), can not casually discard the data. Google's Tim Hockin replied, "It is best to ensure that all containers are stateless, and of course, to provide an internal mechanism for saving."

How to keep the data stored in the container, which is since the date of the birth of Dock has been the problem. In the initial design of Docker, data is shared with the container, and it is difficult to migrate the container from one machine to another. In Docker 1.0, the only state-related concept is volume : the container can access the external file system, completely out of Docker's control. It is generally accepted that the data should not be put in the container.

According to the application of the container specification expert group, in the latest appc specification , the entire container image and all the initialization files can not be changed, but the specified configuration directory or mount except. The idea behind this is that the data that the administrator is concerned should be stored in a specific, web-mounted directory, which is managed by the container layout framework.

It is no wonder some readers believe that Docker and CoreOS from the beginning to decide to ignore the container data persistence storage problem. Of course, many developers in the field of containers also have this feeling, they are determined to make up for this defect. At the ContainerCamp and CoreOS Fest congresses, a number of projects have been released that address the problem of persistent storage of container data. One solution is to keep the container data persistently in a reliable distributed storage, and administrators no longer have to consider the migration of container data.

Raft consensus algorithm

A reliable distributed storage must ensure that all nodes' data ends (if not immediately) are consistent. Single-node databases are easy to do with this. If it is across multiple nodes of the database, each node may be invalid, you need a set of complex logic to ensure that the consistency of the write operation. This is the "consensus": multiple nodes agree on the shared state. Diego Ongaro is the developer of Raft , LogCabin and RAMCloud , and he introduces the work of the consensus algorithm to the audience.

In 1989, Leslie Lamport proposed the Paxos consensus algorithm . In the next twenty years, this is the only consensus algorithm. But there are some problems with the Paxos algorithm, the most important of which is that it is very complex. Ongaro said: "Maybe only five people really understand all aspects of the Paxos algorithm." If you do not understand the Paxos protocol, people can not write a test to verify the correctness of the Paxos algorithm implementation. This means that there is a big difference in the different implementations of the Paxos algorithm, and you can not prove that an implementation is correct, he says.

To this end, Ongaro and Stanford University professor John Ousterhout together designed a simple and easy to test and explain the consensus algorithm, called the Raft algorithm, meaning is to "escape the island of Paxos." To learn more about the Raft algorithm, read Ongaro's doctoral thesis . Ongaro said: "When deciding on each step of the design choice, we will ask ourselves: which design is easier to explain?"

There are many projects to implement the Raft algorithm , the most important of which is CoreOS's ectd. Project, indicating that Raft algorithm is easy to understand. Ongaro spent half an hour introducing the core of the Raft algorithm and showing a completely Raft model written in JavaScript.

In the Raft cluster, each node has a consensus module and a state machine. The state machine stores the data that the user is interested in, including a serialized log that records the history of the data state change history. The role of the Consensus Module is to ensure that the logs of the current node are consistent with the logs of other nodes. In order to do this, all state changes (write operations) can only be carried out by a leader node: it first sends a message to other nodes, requiring confirmation of this write; only most nodes (called a quorum) Write operation, the leader node will really write data. This is a two-phase submission model, distributed database has been used for a long time.

Each node also has a countdown clock, the countdown time is random, relatively long. If the current node is not available, the node that first completes the countdown will send a message to another node requesting a re-election of the leader node. If most of the nodes have confirmed this message, then the node becomes a new leadership node. When a new leader node sends a log message, it sets a new value for the message's term field, indicating who is the leader when the write occurs. This can avoid log conflicts.

Of course, the real Raft algorithm is not as simple as the one described above, for example, it also takes into account the loss of the log or repeated multiple transmissions. Ongaro can explain Raft's core design in less than half an hour, which means that many developers can apply Raft to the development of the software. For example, the etcs, the CockroachDB, and the Consul project use Raft, and many programming languages ​​have Raft libraries, including Python, C, Go, Erlang, Java, and Scala.

Update: In the comments below, Josh Berkus provided Raft's latest information.

Etcd and Intel

IBM's SDI (Software Defined Infrastructure) Division, Nic Weaver, describes their company's improvements to etcd. Intel is very interested in Docker and CoreOS, and with them, each administrator can manage a large number of machines. With cloud hosting, businesses can easily expand their IT systems and have a lot of services. At this time, rely on configuration management is not enough, Intel that the container can further improve the level of enterprise expansion.

Therefore, Intel also requires the company's team to help improve the container infrastructure software. As mentioned in the first article, Intel and Supermicro together released the Tectonic cluster server suite. In addition, Intel has done some software work, it chose etcd as a breakthrough, trying to provide a solution to build a truly large ectd cluster. The container-based infrastructure management tool based on etcd is able to manage thousands of containers, which means that the number of ectd nodes grows and the number of write operations grows faster.

Intel's team found that the more the nodes contained in the etcd cluster, the slower they became. The reason is that the Raft algorithm requires that every successful write operation must have more than 50% of the nodes synchronize the data to disk and return a successful confirmation. In this way, even if only a few nodes encounter slow storage problems, it will slow down the entire cluster write speed. If you remove the need for disk synchronization, this will eliminate a factor that may slow down the cluster, but also brings the risk: if the data center power outages, the entire cluster state can not be restored.

Intel's solution leverages one of the features of the Xeon processor: asynchronous DRAM self-refresh (ADR). This is a small piece of special memory, the information when the downtime will not be lost, the system can continue to read the information after the restart. ADR is designed for dedicated storage devices, but Linux systems also provide an ADR API, so applications such as ectd can also use ADR.

Intel team to modify etcd, with ADR as a log write buffer, the effect is very significant. The percentage of write time for the entire cluster is reduced from 25% to 2% of the total time, and the number of writes per second is doubled to more than 10,000 times. The modified code is about to be merged into the ectd project.


The Raft consensus algorithm can be used not only in key-like storage like etcd, but also for building a fully functional database. Although ectd is well suited for saving configuration information, it is not suitable as an application's database system because it does not provide functions such as transactions, complex query handling. Cockroach Labs Spenser Kimball introduced their work, a new database called CockroachDB . Why is this name? Because the database is like "Xiaoqiang, you kill it, kill it here, it will come back from somewhere else," he explained.

CockroachDB imitates the design of Google Megastore . When Kimball is still working on Google, I am very familiar with the Megastore project. The main idea of ​​the design is to support the consistency and availability of the entire cluster, including support for the transaction. CockroachDB plans to add SQL support over the distributed key store, similar to Google's Spanner project, which provides the three most important database usage patterns for transactions, SQL, and key-values. "We want users to focus on the construction of business applications without having to worry about finding a temporary distributed database solution," Kimball said.

The deployed CockroachDB database consists of a set of containers that are distributed on the server cluster. The database's key-value address space is divided into multiple ranges, each of which is copied to a portion of the nodes, usually three or five, and these nodes form an Raft consensus group. The entire cluster contains multiple Raft groups, Kimball calls MultiRaft . In this way, the entire cluster can contain more data, is conducive to the expansion of the database.

When each node is running, the default transaction mode is "serializable", that is, all transactions must be able to reproduce in the order of the log. If a serialized transaction fails, the node can be restored to its previous state. In this way, without the need for unnecessary locking mechanism, you can achieve distributed transactions.

At this point, CockroachDB seems to have solved the data persistence storage problem for all of the distributed infrastructure. However, the main problem with CockroachDB is that it has not yet released a stable version, and many good features, such as SQL support, have not yet begun to write. Perhaps in the future CockroachDB can solve many of the problems of persistent storage, it is still necessary to look at other solutions.

High Available PostgresSQL

As the distributed database has not yet reached the extent of the production environment is really available, developers choose the popular database to enhance, so that it has fault tolerance, more convenient container use. Two projects that build highly available PostgresSQL are ClusterHQ's Flocker and Governor of

ClusterHQ CTO Luke Mardsen shows Flocker on ContainerCamp. Flocker is a data volume management tool that can move a database container from one physical host to another physical host. Previously, because of the state, redeploying the database container was a challenging issue. Now with Flocker, the container layout framework can re-deploy stateless services and database containers in almost the same way.

Flocker uses ZFS on Linux to migrate containers between different physical machines. Flocker creates a Docker volume in a dedicated ZFS directory, so that users can move and copy Docker volumes with ZFS snapshots. The user performs the above operation through a simple declarative command line interface.

Flocker is designed as a plugin for Docker. The problem is that Docker does not currently support plugins. As a result, the Flocker team created a plug-in infrastructure called Powerstrip for Docker. However, this tool has not yet been merged into the Docker main branch. Only after the merger, Flocker project can provide a unified management interface.

Flocker solves the problem of container migration. Compose created the Governor project – according to Chris Winslett's argument – to solve the container availability problem. Governor is a choreographed prototype that implements a self-managed, replicated PostgresSQL cluster. You can think of it as a simplified version of the Compose infrastructure.
Compose is a SaaS (Software as a Service) hosting company, which provides all the services must be fully automated. In order to provide PostgresSQL deployment services, you must support automated database copy deployment and failover. Because the user has full database access, Compose hopes that the solution does not need to modify the PosgresSQL code or the user's database.

Winslett pointed out that the main node of the PostgresSQL database cluster can not be used to store the state of the cluster, since it is necessary to ensure that the data of the master node and the replica node are exactly the same (if the primary node stores the cluster status and the replica node is not saved, The information is not exactly the same). Initially select the distributed high availability information service consul to save the state of the cluster. However, consul requires 40GB of virtual memory per data node, and a low-profile cloud server node is clearly not required. Thus, Winslett replaces consul with simpler etcd, which greatly simplifies the processing logic of failover.

Each container has a Governor daemon that controls PostgreSQL. At startup, the govern daemon queries the ectd, finds the current master node, and then copies the data from the master node to the current node. If there is no master node yet, the govern daemon will request etcd to give it a master node key, and ectd ensures that only one node can get the key. When a node gets the key to become the new master node, the other node starts copying data from that node. The master node key is time-to-live (TTL). If the current primary node fails, a new election will be made soon to ensure that the cluster soon has a new master node.

With Governor, Compose manages PostgreSQL in much the same way as the multi-master, non-relational database that manages MongoDB and Elasticsearch. In the Governor system, etcd to save the PostgreSQL node configuration, with the container layout system deployment PostgreSQL nodes, do not need to manually manage, each container does not require special treatment.

in conclusion

Of course, there are many related projects and speeches. For example, during CoreOS Fest, Sean McCord's lecture discussed how the Docker container uses the distributed system ceph as a block device, and how to use the container to run every node of ceph. This method is still relatively preliminary. However, if the container service requires large-scale file storage, using ceph is a viable candidate. Alex Crawford also introduced the new CoreOS boot configuration tool, Cloudconfig .

With the Linux container ecosystem from test instances, web servers to database and file storage, it is foreseeable that new approaches and new tools will continue to address persistent storage problems for container data. Participate in CoreOS Fest and ContainerCamp, you can clearly feel the Linux container technology is not very mature. Let us look forward to, in the next year there are more related projects and methods.

Original links: Containers and persistent data (translation: Liu Quanbo review: Wei Xiaohong)

    Heads up! This alert needs your attention, but it's not super important.