Docker Container Management Platform

A Brief Introduction to American Mission Container

This article introduces the Docker container cluster management platform (hereinafter referred to as "container platform"). The platform, which began in 2015, is based on the US Army cloud infrastructure and components developed Docker container cluster management platform. At present, the platform for the US group comments takeaway, hotels, shops, cat's eye and so on more than a dozen business units to provide container computing services, carrying hundreds of online business, daily average request more than 4.5 billion times, business type covers the Web, database , Cache, message queue, and so on.

Why develop container management platform

As a large domestic O2O Internet company, the US group comments business development is extremely rapid, every day a massive online search, promotion and online transactions. Before the implementation of the container platform, the US group commented on all the business is running in the US group of private clouds on the virtual machine. With the expansion of the business, in addition to online business to provide high stability, the private cloud also need to have a high flexibility in the ability to quickly create a large number of virtual business hours, in the business downturn Recycle resources and allocate them to other businesses. US group Comments Most of the online business is for consumers and businesses, business types are diverse, flexible time, frequency is not the same, these are flexible services made a very high demand. At this point, the virtual machine has been difficult to meet the needs, mainly to reflect the following two points.

First, the virtual machine resilience is weak. The use of virtual machine deployment business, flexible expansion, the need to apply for virtual machines, create and deploy virtual machines, configure the business environment, start business examples of these steps. The first few steps belong to the private cloud platform, the latter steps are business engineers. A expansion of the need for multi-sector with the completion of the expansion time in hours, the process is difficult to achieve automation. If you can achieve automation "a key fast expansion", will greatly improve the efficiency of business flexibility, the release of more manpower, but also eliminates the manual operation leading to accidents.

Second, IT costs are high. Due to the weak resilience of the virtual machine, the business department has adopted a large number of machines and service examples in order to cope with peak traffic and burst traffic. That is, the first deployment of a large number of virtual machines or physical machines, according to the business needs of the peak resources to do the reservation, usually twice the peak demand for resources. Resource reservation approach brings very high IT costs, in the off-peak hours, these machine resources are idle, is also a huge waste.

For the above reasons, the United States group from the beginning of 2015 to introduce Docker, build container cluster management platform for the business to provide high-performance flexibility scalability. Many of the industry's approach is to use Docker eco-circle of open source components, such as Kubernetes, Docker Swarm and so on. We combined with their own business needs, based on the US Mission cloud existing architecture and components, practice a self-research Docker container management platform road. We chose the self-study container platform, mainly for the following considerations.

Quickly meet the United States group of a variety of business needs

US group commented on a wide range of business types, covering almost all types of Internet companies business. Each business needs and pain points are not the same. For example, some stateless services (such as Web), the flexibility of the expansion of the delay requirements are high; database, the business of the Master node, the need for high availability, but also on-line adjustment CPU, memory and disk configuration requirements. Many services require SSH to access the container in order to tune or quickly locate the cause of the failure, which requires the container management platform to provide convenient debugging capabilities. In order to meet the diverse needs of different business units, the container platform requires a lot of iterative development work. Based on the existing platforms and tools we are familiar with, we can achieve the "more and more good" to achieve development goals, to meet the various needs of the business.

From the stability of the container platform, the need for platform and Docker bottom technology has a higher control ability

Container platform carries a large number of online business reviews, online business SLA availability requirements are very high, generally to reach 99.99%, so the container platform stability and reliability is the most important indicator. If we introduce the open source components directly, we will face three problems: 1. We need to open the source components, master its interface, evaluate its performance, at least to achieve the source level understanding; 2. Build container platform, the need for these open source components Do the stitching, from the system level to continuously optimize the performance bottleneck to eliminate a single hidden dangers, etc .; 3. In the monitoring, service governance and the United States and the United States to comment on the existing infrastructure integration. These jobs require a lot of work, more importantly, such a platform to build, in a short period of time its stability and availability are difficult to protect.

Avoid duplication of private clouds

US group of private cloud carrying the US group comment on all the online business, is the largest private cloud platform. After several years of operation, the reliability of the company after the test of massive business. We can not because of the support of the container, it will mature and stable private cloud aside, start again and then re-develop a new container platform. Therefore, from the stability, cost considerations, based on the existing private cloud to build container management platform, for us is the most economical program.

Design of container management platform for US group

We will container management platform as a cloud computing model, cloud computing architecture also applies to the container. As mentioned earlier, the architecture of the container platform relies on the existing architecture of the US private cloud, where most of the components of the private cloud can be directly multiplexed or developed in small amounts. The container platform architecture is shown below.
01.jpg
It can be seen that the overall structure of the container platform is divided into business layer, PaaS layer, IaaS control layer and host resource layer from top to bottom, which is basically consistent with the US cloud architecture.

Business layer : on behalf of the US group comment on the use of container lines of business, they are the end user of the container platform.

PaaS layer : the use of the container platform HTTP API, complete the container layout, deployment, flexibility, monitoring, service management and other functions, the above business layer through the HTTP API or Web way to provide services.

IaaS control layer : provide container platform API processing, scheduling, network, user authentication, mirror storage and other management functions, PaaS provide HTTP API interface.

Host resource layer : Docker host cluster, composed of multiple rooms, hundreds of nodes. Each node deployment HostServer, Docker, monitoring data acquisition module, Volume management module, OVS network management module, Cgroup management module.

The vast majority of the components in the container platform are developed based on existing components of the US private cloud, such as the API, the mirror repository, the platform controller, the HostServer, and the network management module, which will be described separately.

API

API is the container platform to provide external services interface, PaaS layer through the API to create, deploy the cloud host. We think of containers and virtual machines as two different virtualized computing models that can be managed with a unified API. That is, the virtual machine is equivalent to the set (described later), the disk is equivalent to the container. This business has two advantages: 1. Business users do not need to change the use of cloud host logic, the original virtual machine-based business management process also applies to the container, so you can seamlessly move the business from the virtual machine to the container; The container platform API does not have to be re – developed, and can reuse the API processing process of the US private cloud.

Create a virtual machine process more, generally need to experience scheduling, prepare the disk, deployment configuration, start and other stages, the platform controller and Host-SRV need a lot of interaction between the process, bringing a certain amount of delay. The container is relatively simple and only needs to be dispatched and deployed to start two phases. So we simplified the API of the container, will prepare the disk, deployment configuration and start into a single step, the simplified container to create and start delay less than 3 seconds, basically reached the Docker startup performance.

Host-SRV

Host-SRV is the container process manager on the host computer. It is responsible for the management of the container, such as container image extraction, container disk space management, container creation and destruction.

Mirror pull: Host-SRV connected to the controller after the creation of the request, from the mirror warehouse to download the image, cache, and then through the Docker Load interface to load the Docker.

Container runtime management: Host-SRV through the local Unix Socker interface with the Docker Daemon communication, the container life cycle control, and support container Logs, exec and other functions.

Container disk space management: At the same time management of the container Rootfs and Volume disk space, and reported to the controller disk usage, the scheduler can determine the use of the container scheduling strategy.

Host-SRV and Docker Daemon communicate via Unix Socket, the container process is hosted by Docker-Containerd, so the upgrade of Host-SRV does not affect the operation of the local container.

Mirror the warehouse

The container platform has two mirror stores:

  • Docker Registry: provides Docker Hub Mirror function, accelerate the image download, easy to business team to quickly build business image;
  • Glance: A Docker mirror repository developed based on the OpenStack component Glance extension to host Docker images from business units.

Mirroring a repository is not only a necessary component of a container platform, but also a necessary component of a private cloud. US Mission Clouds use Glance as a mirror repository, and Glance is only used to host virtual machine mirroring before building a container platform. Each mirror has a UUID, using the Glance API and mirroring UUID, you can upload and download virtual machine images. The Docker image is actually composed of a set of submirrors, each with a separate ID, with a Parent ID attribute pointing to its parent image. We have a little transformation of Glance, for each Glance mirror to increase the Parent ID attribute, modify the mirror upload and download the logic. With a simple extension, Glance has the ability to host Docker mirroring. Support for Docker mirroring with Glance extension has the following advantages:

  • You can use the same mirror repository to host Docker and virtual machine mirroring, reduce operational management costs;
  • Glance has been very mature and stable, the use of Glance can reduce the mirror management on the pits;
  • The use of Glance can make Docker Mirror Warehouse and the US private cloud "seamless" docking, using the same set of mirror API, you can also support virtual machine and Docker image upload, download, support for distributed storage back-end and multi-tenant isolation and other features;
  • Glance UUID and Docker Image ID is a one-to-one relationship, using this feature we implemented the Docker image in the warehouse uniqueness, to avoid redundant storage.

There may be questions, using Glance to do the mirror warehouse is "re-create the wheel." In fact we have only about 200 lines of code on Glance's transformation. Glance simple and reliable, we have completed in a very short time the development of the mirror warehouse on the line, the current US group has been hosting more than 16,000 business side of the Docker mirror, the average upload and download the image delay is second.

High-performance, highly flexible container network

The network is very important and there are technical challenges in the field. A good network architecture, the need for high network transmission performance, high flexibility, multi-tenant isolation, support the software to define the network configuration and many other capabilities. Early Docker provides a simple network solution, only None, Bridge, Container and Host these four network modes, there is no user development interface. 2015 Docker in the 1.9 version of the integration of Libnetwork as its network solutions to support the user according to their needs, the development of the corresponding network driver, to achieve network function customization function, greatly enhanced Docker's network expansion capabilities.

From the container cluster system, only single-host host network access is far from enough, the network also need to provide cross-host, rack and room capacity. From this point of view, Docker and virtual machine is common, there is no obvious difference, theoretically can also use the same set of network architecture to meet the Docker and virtual machine network needs. Based on this concept, the container platform in the network reuse of the US mission cloud network infrastructure and components.
2.jpg

Data plane : We use the 10 Gigabit network card, combined with OVS-DPDK program, and further optimize the forwarding performance of single stream, several CPU cores bound to the OVS-DPDK forward use, only a small amount of computing resources can provide 10 Gigabit Data forwarding capability. The OVS-DPDK and the CPU used by the container are completely isolated and therefore do not affect the user's computing resources.

Control Plane : We use the OVS scheme. The scheme is to deploy a self-developed software controller on each host, dynamically receive the network rules issued by the network service, and further send the rules to the OVS flow table to determine whether to release a network flow.

MosBridge

Before MosBridge, we configured the container network to use the None mode. The so-called None mode is the custom network model, configure the network needs the following steps:

  1. Specify -net = None when creating container, no network after container creation is started;
  2. After the container starts, create eth-pair;
  3. Connect the eth-pair to the OVS Bridge;
  4. Use nsenter this Namespace tool will eth-pair the other end into the container's network Namespace, and then renamed, configure the IP address and routing.

However, in practice, we found that None mode there are some deficiencies:

  • The container is just started when there is no network, some business will check the network before starting, resulting in business failure;
  • Network configuration and Docker detachment, the container restart after the network configuration is lost;
  • The network configuration is controlled by Host-SRV, and the configuration flow of each network card is implemented in Host-SRV. After the upgrade and expansion of network functions, such as adding a network card to the container, or supporting VPC, will make Host-SRV more and more difficult to maintain.

In order to solve these problems, we will look to Docker Libnetwork. Libnetwork provides users with the ability to develop Docker networks, allowing users to implement network drivers based on Libnetwork to customize their network configuration. That is, the user can write the driver, so that Docker in accordance with the specified parameters for the container configuration IP, gateway and routing. Based on Libnetwork, we developed the Docker network driver for MosBridge – a US cloud network architecture. In the creation of the container, you need to specify the container to create parameters-net = mosbridge, and IP address, gateway, OVS Bridge and other parameters passed to the Docker, MosBridge completed by the network configuration process. With MosBridge, the container can be used when the container is created. The network configuration of the container is also persistent in MosBridge, and the network configuration is not lost after the container is restarted. More importantly, MosBridge makes Host-SRV and Docker fully decoupled, after the network function upgrade will be more convenient.

Resolve the problem of Docker storage isolation

Many companies use Docker in the industry face storage isolation issues. Docker that is provided by the data storage program is Volume, through the mount bind way to mount a local disk directory to the container, as a container "data disk" to use. This kind of local disk volume can not do the capacity limit, any one container can not limit to write data to the volume, until the full disk space.
3.jpg
In response to this problem, we developed the LVM Volume scheme. The program is to create a LVM VG on the host as the storage backend of the volume. When creating a container, create a LV from VG as a disk and mount it in a container, so that Volume capacity is strongly limited by LVM. Thanks to the powerful management capabilities of the LVM machine, we can do a better and more efficient management of Volume. For example, we can easily call the LVM command to view the volume usage, by marking the way to achieve Volume pseudo delete and recycle bin function, you can also use the LVM command to do Volume expansion on the volume. It is worth mentioning that, LVM is based on the Linux kernel Devicemapper development, and Devicemapper in the Linux kernel has a long history, as early as the kernel 2.6 version has been incorporated, its reliability and IO performance can be trusted.

A container state acquisition module for a variety of monitoring services

Container monitoring is a very important part of the container management platform, monitoring not only in real time to get the container running, but also need to obtain the container occupied by the dynamic changes in resources. Before the design and implementation of container monitoring, the United States group has a lot of internal monitoring services, such as Zabbix, Falcon and CAT. So we do not need to re-design to achieve a complete set of monitoring services, more is to consider how to efficiently collect container operation information, according to the configuration of the operating environment reported to the corresponding monitoring services. In simple terms, we only need to consider the implementation of a highly efficient Agent, the host can collect a variety of container monitoring data. There are two things to consider here:

  1. Monitoring indicators, the amount of data, data acquisition module must be efficient;
  2. Monitoring the low overhead, the same host can run dozens, or even hundreds of containers, a large number of data collection, sorting and reporting process must be low overhead.

4.jpg
For the monitoring needs of business and operation and maintenance, we developed the Mos-Docker-Agent monitoring module based on Libcontainer. The module from the host proc, CGroup and other interfaces to collect container data, after processing conversion, and then through different monitoring system driver reported data. The module is written in GO language, both efficient and easy to use Libcontainer. And the monitoring of data collection and reporting process does not go through the Docker Daemon, it will not increase the burden of Daemon.

In monitoring the configuration of this piece, because the monitoring report module is plug-in, you can highly customize the reported monitoring service type, monitoring items configuration, it can be very flexible to adapt to different monitoring needs of the scene.

Support the design of micro service architecture

In recent years, micro-service architecture in the field of Internet technology rise. Micro Service utilizes lightweight components to disassemble a large service into multiple instances of microarchitectures that can be individually encapsulated and independently deployed, and the complex logic inherent in large services is implemented by the interaction between services.

A lot of online business is a micro service architecture. For example, the service management framework of the US team will configure a service monitoring agent for each online service, which is responsible for collecting the status information of the online service. There are many similar micro-services. For this micro service architecture, the use of Docker can have the following two packaging modes.

  1. Encapsulate all micro service processes into a container. But this update the service, the deployment is very flexible, any micro-service updates have to rebuild the container image, which is equivalent to the Docker container as a virtual machine to use, did not play the advantages of Docker.
  2. Encapsulate each micro service into a separate container. Docker has the advantage of lightweight, environmental isolation, it is suitable for packaging micro services. But this may produce additional performance problems. One is the large-scale service of the container will produce several times the calculation of the case, which distributed system scheduling and deployment of a lot of pressure; the other is the performance deterioration problem, for example, there are two closely related services, mutual communication traffic Very large, but was deployed to a different room, will have a considerable network overhead.

For supporting micro service issues, Kubernetes' solution is Pod. Each Pod consists of multiple containers, which is the smallest unit of service deployment, scheduling, and management. It is also the smallest unit of scheduling. Pod containers share resources, including networks, Volume, IPC and so on. So that multiple containers within the same Pod can communicate with each other efficiently.

We draw on the idea of ​​Pod, in the container platform for the development of micro-service-oriented container group, we call the set inside. A set logic is shown in the following figure.
5.jpg
Set is the basic unit of container platform scheduling, flexible expansion / shrinkage. Each set consists of a BusyBox container and a number of business containers, BusyBox container is not responsible for specific business, only responsible for managing the set of network, Volume and IPC configuration.
6.jpg
All containers within the set share network, volume and IPC. The set configuration uses a JSON description (shown in Figure 6). Each set instance contains a Container List that describes the container's runtime configuration. The important fields are:

  • Index, container number, on behalf of the container boot sequence;
  • Image, Docker image name or ID on Glance;
  • Options, which describes the parameter configuration when the container is started. Where CPU and MEM are percentages, indicating the distribution of the container relative to the entire set in the CPU and memory (for example, for a 4-core set, the container CPU: 80 indicates that the container will use up to 3.2 physical cores) The

Through the set, we will be the United States Mission all the container business are standardized, that is, all the online business are described with the set, the container platform only set, scheduling, deployment, start and stop units are set.

For the implementation of the set we have done some special treatment:

  • Busybox has Privileged permission, you can customize some of the sysctl kernel parameters, improve container performance.
  • For stability reasons, users do not allow SSH login Busybox, only allowed to log in other business containers.
  • To simplify Volume management, each set has only one Volume and is mounted under Busybox. Each container shares the volume with each other.

Many times a set of containers from different teams, mirror update frequency is different, we set on the basis of a set of gray-scale update function. This feature allows the business to update only part of the container image in the set, through a gray update API, you can upgrade the line set. The biggest benefit of grayscale updates is the ability to update some containers online and keep online services uninterrupted.

Docker stability and characteristics of the solution: MosDocker

As we all know, Docker community is very hot, version update is very frequent, about 2 to 4 months or so there will be a large version of the update, and each version update will be accompanied by a lot of code refactoring. Docker does not have a long-running LTS version, and every update will inevitably introduce new bugs. Due to aging reasons, under normal circumstances, a bug repair to wait until the next version. For example, 1.11 Bug introduced, generally to 1.12 version can be resolved, and if you use the 1.12 version, will introduce a new Bug, but also to wait 1.13 version. In this way, Docker's stability is difficult to meet the requirements of the production scene. So it is necessary to maintain a relatively stable version, if found Bug, can be based on this version, through self-study repair, or use the community BugFix to repair.

In addition to the stability of the demand, we also need to develop some features to meet the needs of the US group comments. Some of the needs of the US group commented on the business from our own production environment, but not the industry's general needs. For such needs, open source communities are usually not considered. Many companies in the industry have a similar situation, as the company's basic service team must be through technology development to meet this demand.

Based on the above considerations, we started from the Docker version 1.11, self-maintenance to maintain a branch, we call MosDocker. The reason to choose from the beginning of version 1.11, because from the beginning of the version, Docker made a number of major improvements:

Docker Daemon is reconstructed as the three Binarys of Daemon, Containerd and runC, and solve the single point failure problem of Daemon.

  • Support OCI standard, the container by the unified rootfs and spec to define;
  • Introduced the Libnetwork framework, allowing users to customize the container through the development of the interface network;
  • Reconstructing the Docker image storage backend, the mirror ID is changed from the original string to the mirror-based content, making the Docker image more secure.

So far, MosDocker self-research features are:

  1. MosBridge, to support the US group cloud network architecture network driver, based on this feature to achieve multi-IP container, VPC and other network functions;
  2. The Cgroup is persistent and extends the Docker Update interface to allow more CGroup configurations to persist in the container, ensuring that the CGroup configuration is not lost after the container restarts.
  3. Docker Save to support sub-mirror, can greatly improve the Docker image upload, download speed.

In short, the maintenance of MosDocker so that we can Docker stability gradually control in their own hands, and can be customized according to the needs of the company's business development.

In the actual business in the promotion and application

In the container platform to run more than a year, has access to the US group commented on a number of large business units of business, business types are varied. By introducing Docker technology, the business unit brings many benefits, the typical benefits include the following two points.

  • Rapid deployment, rapid response to business burst traffic. As a result of the use of Docker, business machine application, deployment, business release step by step, business expansion from the original level reduced to second, greatly improving the flexibility of the business.
  • Saving IT hardware and operation and maintenance costs. Docker is more efficient in computing, coupled with high flexibility makes the business sector do not have to reserve a lot of resources, save a lot of hardware investment. Take a business as an example, before dealing with traffic fluctuations and burst traffic, set aside 32 8-core 8G virtual machine. The average stand-alone QPS increased by 85% and the average resource occupancy rate decreased by 44-56% (as shown in Figures 7 and 8), using the container flexibility program, ie 3 containers + flexible expansion program.
  • Docker online expansion capabilities, security services without interruption. Some stateful business, such as databases and caching, is a common requirement to adjust CPU, memory, and disk at run time. Prior to deployment in the virtual machine, adjust the configuration need to restart the virtual machine, business availability is inevitably interrupted, and become the pain of the business. Docker on the CPU, memory and other resource management is achieved through the Linux CGroup, adjust the configuration only need to modify the container CGroup parameters, do not have to restart the container.

7.jpg
8.jpg

Concluding remarks

This article describes the United States group comments Docker practice. After a year of practice, from the department to use their own, to cover most of the company's business sector and product line; from a single business type to the company dozens of business lines, proved Docker container virtualization technology to improve the transport Dimensional efficiency, streamline publishing processes, and reduce IT costs.

Docker platform is still in the United States group comments in-depth promotion. In this process, we found that Docker (or container technology) itself has many problems and shortcomings, for example, Docker IO isolation is not strong problem, can not limit the Buffered IO; occasionally Docker Daemon will be stuck, no response problems ; Container memory OOM lead to the container is deleted, open OOM_kill_disabled may lead to host kernel crash and other issues. So Docker technology, in our view and the virtual machine should be complementary relationship, can not be expected in all scenarios Docker can replace the virtual machine, so only the Docker and virtual machine both in order to meet the user's various scenarios for cloud computing demand.

    Heads up! This alert needs your attention, but it's not super important.