Broadband Bank operation and maintenance practice sharing: Docker adaptation of the traditional operation and maintenance of those things

Several people cloud Shanghai & Shenzhen two "container Mesos / K8S / Swarm Romance of the Three Kingdoms" guests wonderful record the first bomb to come. Today is the Guangdong Development Bank data center operation and maintenance veteran Shen Wei Kang on the traditional operation and maintenance and container fit all-round sharing,

Shen Weikang, Guangfa Bank Data Center

Operation and maintenance of middle-aged, experienced traditional operation and maintenance, building automation operation and maintenance, try cloud computing operation and maintenance

Hello everyone! I am the Guangdong Development Bank of Shen Weikang, from the traditional industry background, and now still in the traditional industry pit, today's content is shared in the traditional operation and maintenance will encounter all but do not have to do, but also had to do thing.

CMDB: standardization, differentiation and customization

Whether it is the traditional operation or maintenance and operation and maintenance, CMDB is a very important core. If Docker does not have its own CMDB, there will be a lot of uncomfortable places to use.

From the environment in this regard to talk about the role of CMDB on the Docker. If everything can be standardized, it will be very simple, very convenient, this is a good ideal, but in reality, especially the traditional industry want to promote the standardization, to achieve a certain degree of difficulty.

It will face a problem: differentiation. After the difference is more, Docker will have a variety of images, different applications will have a different image. Even with the same application, different monthly versions have different mirroring, such as upgrading a library, the mirror is not the same, then how should we do? At this time according to the normal logic, will give it a custom. If you want to customize a Docker image, you can do with DockerFile.

Now the major manufacturers of products which almost have a WebUI interface allows users to choose some content, you can independently program a DockerFile. If you simply put some of the FROM, ADD parameters directly added to the page to choose, at least have a certain adaptation process.

So we learn from the traditional concept of operation and maintenance, and is equipped with a point of convergence with the traditional traditional CMDB, GF has the following practices:

First, the operating system. Facing the DockerFile FROM, let it in the list to choose this application to run in what kind of OS inside, including its version and so on.

Second, commonly used software. In the drop-down box after the election is an ADD, for example, selected JAVA, Docker to run inside the environment to its environment variables, containers to find JAVA-related commands. Tomcat or other software, there will be some environment variables. So in the commonly used software, and now most of the package is Tomcat or JAVA class software, some specific to use the environment variable, according to the page after the election, add the package with ADD at the same time, use ENV to set it to the environment Variable.

Third, the demand package upload, for the differentiation is very important. For example, the application of this project group depends on a Python version, another project team and rely on another version of Python, and if the OS comes with some of the so library, it will depend on the next version of the next version, but Do not want to put so many versions are made into different mirroring to provide different service market, there will be a demand package upload, the package of the project group needs in addition to the common software with the OS basic package other than the library or software package , Such as Python installation package, RPM package, there are some applications of their own things, such as the need to start when the need to load the certificate are packaged in this package.

Fourth, the implementation of the installation. After the package is finished, define an entry for the installation process, that is, an installation script that allows the installation script to be placed in the first directory of the archive. This package is equivalent to an entry with, which allows the application to define what to install, how to install it, and how to install it.

Fifth, the mapping port. Correspond to DockerFile's EXPOSE, application, service or container after the start, which will expose some of the port.

Sixth, storage use. Storage path selection corresponds to VOLUMN, if the IO requirements are relatively high on the inside of the container without AUFS, if the need for persistent use of the plug-in path, if the sharing between hosts need to put some distributed storage or NAS this shared storage inside.

Seventh, start running. Equivalent to the CMD, so that the project team in a page set up after, put it into the traditional CMDB docked with a Docker exclusive CMDB.

The main content of this CMDB summary has three parts: the environment needs configuration, configuration file management, application running configuration. Application run configuration is the project team in a page done after the configuration, run and compile the time you do not need to fill in the parameters, and all the different projects are set here once a good time.

Management latitude, configuration is an application plus a project logo. This project identity can be understood as monthly or in accordance with their favorite naming rules, such as overseas and domestic versions. But for the GF, the use of more is the monthly identity, for example, an application of ABC environment, respectively, corresponding to a few months version.

Continuous Integration

Mirror classification

Here the mirror is divided into three categories: the first category, the basic environment mirror, only OS above some depend on the installation of the library, a running middleware. It will have a naming convention, "application name + project identifier", such as "ABC_", then "201701", is the January 2017 version, "base" that the mirror tag is base, that this is a basic environment Mirror.

The second category is the application version of the mirror, in the first basic environment mirror with the compiled target code, without the environment difference in the configuration file. At this time the naming rule is "application name + project identification", tag becomes the timestamp of the target code, in the continuous integration of the entire line has a unique logo, is the timestamp. Of course, everyone will have a variety of other unique identification options.

The third is called application running mirror, which is the image of the application version above plus the environment configuration file. The development environment has its own database, a variety of different environments will have their own database configuration, this configuration is not the same. If it is abstracted into a configuration center, it can be managed, but still with the configuration file. The naming convention is "application name + project identifier", plus "target code timestamp" and "environment". The environment includes the DEV development environment, the TEST test environment, and the PROD production environment. Finally, the "configuration file timestamp", a project team at the beginning of the project definition of the contents of the configuration file has four configuration items, after a period of time may become five configuration items, so it is a timestamp, that is, the configuration file Timestamp, in order to identify a complete running mirror.

Compared with the traditional process projects, the GF process is to build an application of the OS environment, install the appropriate middleware, and then deploy the relevant application target code. With Jenkins to continue to integrate its entire application version of the mirror. The whole process is the application version of the mirror, coupled with the test environment configuration, it becomes a test environment to run the application image, coupled with the production configuration into a production environment to run the mirror image.

Configuration management

Why do the configuration file without configuration center? Promote the configuration center, the application to change a lot of content. Traditional applications inside a lot of configuration are written in the configuration file inside. If you want to change the configuration file from a library inside read out, for example, the development environment, it has a matching IP plug-in from the database inside the configuration extracted from the original configuration file to replace it in the environment to do its development ; Or you can do a similar Eclipse plug-in to do this thing, but the matching thing or a lot. If for the sake of Docker to push this thing, it will become very realistic: first, long time; second, resistance big.

Another way is the environment variable. In the case of a database configuration environment, the environment variable is a little simpler, but it requires the project team to pull out the configuration of the configuration file and then turn it into an environment variable and tell the project team "The original DBURL configuration ", The code inside need to become System.getEnv () to get DBURL, and no longer use getProperty read out.

So GF has used a configuration package. This configuration file package is a tar that does not restrict it to have a very serious name, but its directory format rules have a restricted rule, and its first directory is the last accessed sub URL, which is the TOMCAT webapps See the catalog. And then all the configuration, assuming the bottom of the example, the application has three configuration files, requiring it in strict accordance with the relative path, the final relative path package, packaged into a tar.

It packages the configuration file into a tar according to the relative path of the war package. And then upload the tar to a different environment directory, for example, it has three stages, one is the development, one is the test, one is the production, then it will have three catalogs, these three catalogs by different operation and maintenance personnel editing, development The environment in principle, do not have to change, because it is from the development of the test environment to test the environment from the operation and maintenance of students, those DBURL, database users and other configurations according to the actual situation changes, the production environment is similar.

And then use the most simple DockerFile, that is, FROM application version of the mirror, and then ADD, the configuration file to the specified path, assuming Tomcat is webapps directory, because the ADD will automatically extract, automatically cover it into a real corresponding test environment Running mirroring, mirroring the running environment of the test environment, and running the image of the production environment. The project team just to find a person out of these configuration files on it. A long time ago we have declined all the hard solution to the code inside to go, so this scene is not suitable. JAVA which directly write an environment in the database link above, but it should be applied to the allocation of all out of a file inside or a file inside.

Version process

This is the framework for the continued pooling of GF. Code with Git, there is a target code library, and configuration library. Although on the Docker, but did not give up the traditional environment. WAR package is a continuous integration of the assembly after the war package, save and open to the traditional deployment of colleagues download use. The configuration repository is the place where the configuration file is just mentioned. Test mirror library is independent, the synchronization between them is through the script to automatically synchronize, that is, export out of the mirror, a pull, a push.

Developers write code, write the code after the submission, submitted by Jenkins will automatically download back to compile. In this process there is a code review done by FindBugs. And then compile and generate a war package, the war package to this stage in theory with the normal continuous integration process or manual assembly of the same war package. At this time if you need the traditional deployment, you can download the war package through FTP back, put into production directly.

If you want to use Docker, the war package will add its first mirror: the application of the basic environment mirror, generate it an application version of the mirror; application version of the mirror generated after the completion of the test environment with the configuration file, it will change As a mirror of the test environment; the mirror of the test environment as long as running up, it will become a test environment; test environment is tested by the tester, or by an automated tool to do automated testing.

Test the operation of the environment to the production is also the same process, GF also has a quasi-production environment, the whole process is similar, quasi-production environment is shared with the test environment. The production environment is also mirrored by the version with the configuration file. All from the application version of the mirror to generate the process of running the mirror can be constantly iterative automation, running time will run up in the environment inside.

Operation of those things


In the traditional operation of the Victoria if you get a virtual machine, it has a fixed IP or DNS domain name, want to do anything you can do, you can view the data or performance, especially in the performance problems when looking at the OS inside Resource usage, and some application status, including the state of the OS. And these things to the Docker inside, it will become a lot of resistance.

If the Docker container out of the performance problems, then how to check? If in accordance with the traditional concept, to SSH to the inside of the container to do, for example, there is an application, Tomcat to 90%, it is necessary to keep the environment in the production environment, so that the application development to check? Or directly destroy it, to re-play one or two, the business volume will not be affected, this kind of thing varies from person to person.

Another way to put these simple interactions into two categories, one is to view the type of demand, try to plug the directory, because it is assumed to see the generated javacore, heapdump file, etc., before the practice is to use the kill inside the kill-3 heapdump File, but if these generated heapdump action attributed to the third point of operation, then it is not directly on the host to put an agent, which containers do heapdump equivalent to allow users to directly select a page on the page heapdump or An action, and then by the agent through the EXEC command to go inside the container to do, as far as possible to prohibit the user directly with the container to interact. Of course, there are more rude, such as WebSSH.

Application updates and grayscale releases

The concept of applying the update is that the service is not interrupted. People often say that the rolling upgrade, which is achieved in many products inside. But the level of implementation may be like this: Assuming there are five containers, rolling upgrade is in batches, the first batch of two upgrades, after the end of the destruction of the old container, with the new two containers to replace the old two Container, after a period of time and then upgrade the three behind. This batch upgrade will have a need to focus on the place – the destruction of the container.

Common cloud platform scheduling algorithm, the container state OK, the scheduling platform will replace the original container, but this time the container state OK does not mean that the service is available, because the container Tomcat port up, it will say this The container is OK, but Tomcat up after the service to load the process, fast words can be a few seconds, slow, for example, a very large did not do any micro-service transformation of the application, it will be a minute. But this minute, the new container has replaced the old container, then this minute on the tragedy. So the service load time can not be ignored. The time when the container is destroyed is greater than the time of the container state OK plus the time from the start of the application. At the time of the dispatch of the container, at least the time of each project group plus the service, ten seconds to ten seconds, ten seconds After the clock and then in accordance with the batch rolling process to do the upgrade.

Now the traditional operation and maintenance of an application can not be used, especially in the bank can not be used is a great impact, so GF has a lot of applications in the operation and maintenance of some of the external service interface, and then through automated monitoring tools to monitor it Of usability. From this perspective, the scheduler can interface with the traditional monitoring, by calling the contents of the traditional operation and maintenance to a service state is available when the implementation of five upgrade two, and then upgrade the three of this action.

The second to pay attention to traffic transfer, after the start of the service, through the load balance automatically set the weight of the new traffic transferred to the new container inside. Because there is a container to destroy the time, so this container will not be destroyed, but do not transfer the new request to it. In the traditional industry, if the new container or service is good, the old container immediately turned off, the application of the structure may not be able to support. In production, especially in the bank, for example, when there is a transaction process in the transfer, in a container which provides 1, 2, 3, 4, 5 steps, not the first step done in a place, and then by Any other person to tune the second step can be. If the 1, 2, 3, 4, 5 are tied to a container inside, so that the third step when the old container service was stopped, and no external transfer interface to go to someone else to go there, the consequences Serious, so the flow of transfer work, including the destruction of time is to be careful.

In many manufacturers where will hear the gray release, but most are just that there is no interruption. This interruption is not really no interruption, to be elegant. GF will emphasize another A / B TEST. If the load balance to set, give a simple example, through the F5 or other LVS load balancing to set the source IP to select the new version or the old version is no problem. But the source IP can be deceived, as before Pokemon go out, people are not abroad, but engage in a foreign IP can also be on. So in the case of acceptable applications, grayscale publishing should be done by the application, for example, each account generates a unique ID, determined by the ID whether they are using the new version or the old version. Try not to use load balancing to do grayscale release.

Elastic expansion and availability

Now the flexibility to expand at least two points: one is the business time point, such as nine to ten o'clock this business time point, you can change the container from ten to twenty; the other is through the monitoring strategy to automate flexibility expansion. In fact, it is very simple, from ten into twenty no problem. But after the expansion to be back, such as to deal with a holiday "double 11" to find a specific point in time to add it to OS, but after the completion of the OS after the need to retract a window of time, or from the F5 on the Disable then Recycling.

But to the container, if the laissez-faire automatic recovery, automatic shrinkage, is it really desirable? And just mentioned the destruction of the same time, whether with the traditional monitoring, service available platform to do a docking and then shrink If you can do is really able to shrink, rather than the page now choose five shrink three, it really shrink the two, at least we are in the production which is not the way to do so.

Balanced resources. Assuming the traditional operation, the Docker host to the traditional operation and maintenance of the monitoring platform to monitor, but then the monitoring platform to determine the host has CPU usage 90%, Docker scheduler and the traditional operation and maintenance of the monitoring platform Do the docking time, in order not to affect the application services need to move to the container, is all moved or the CPU consumption of high move, or to start the longest move? As a programmer will never dare to say that their own procedures to run after a period of time will not run just more stable than when. This strategy is critical in production practice and requires a game of the process. Now many manufacturers themselves support a variety of flexibility, but flexibility is really able to support and not have business impact, is to be elegant.

At present, no one vendor said, "as long as the use of my platform, including the accumulation of my platform to the traditional operation and maintenance, you can do shrink, do not let the user account lost money," do not let the salesman To do a rollback operation. In the traditional operation and maintenance inside, if the application development project team can be in accordance with a variety of elegant closed features to write applications, no problem. But after all, business procedures are written by people, people are not controllable.


After the traditional application container is running, there will be a problem with archiving and viewing the log. The current project team to change it into standard output, and then by a collection of automatic platform, such as GF Bank is now using a logman of a number of logagent automatic logstash, and then stored in the ES, by Kibana to show. If you want to dock the traditional operation and maintenance, you can also let the project team to the application directory into a plug-in file. Traditional monitoring inside a plug-in support in a path to monitor all the configuration file, so just put it in a plug-in shared storage or distributed storage, and then the plug-in path as a docking entrance to the traditional log Management platform inside it.

The other is done before CloudFoundry, they will emphasize that the application of all the log as a stream written to the platform inside, such as FLUME. If only a simple and crude to the application of the log written in, there will be time out of order, of course, this problem can be solved, but if the A container instance with the B container instance belongs to the same application, there will be here After a sentence, where the next sentence came again. To intercept an encrypted log, developers need to cooperate with the log inside a variety of signs, such as now to query a log of a business volume, according to the business code to check, after checking it can be drawn out In the same container inside the log, how to do in the same container? No doubt when writing the log, but also need to put the container logo inside the log.

If you do real-time, you can use syslog. If only the log collection, ELK can also be met. In order to reduce the application of transformation, single application log to redirect to the standard output. If it is more than one log, now consider the program is to put it in a plug-in directory, and then by a special container to send on the delivery, and not through other agents. Plug-in directory can also dock the traditional application log monitoring platform, the traditional operation and maintenance inside there can be a monitoring platform to monitor the log update update there is no application of the keyword, if there is this keyword, it will send text messages to remind Said the application of what happened abnormal.


In the traditional industry which to dock the traditional monitoring must be inevitable. Docking process can be divided into several latitudes, one Docker and the platform's own monitoring, through the interface to dock, send data. The other is the host of the monitoring, the host is a physical OS, the traditional monitoring inside how to toss the OS is already very standard or very natural action.

Container monitoring, you can try cadvisor or other applications in the industry more things. Application monitoring is more difficult, the traditional operation and maintenance will pay attention to the application of CPU and memory usage, as well as the data source of the conjunction, or some number of threads, when a certain value, the alarm. And to the inside of the container, it should be pulled out. To give an example, it can Tomcat inside the Apache published that a bunch of indicators through the Tomcat interface to expose. Exposure to where is the need to add to the customization, to collect the container inside the Tomcat, so the different tools to do a docking process.

In the container network, in addition to some exposed ports, Docker DB would not have said. But in the application of the port inside, assuming that HaProxy, HAProxy port can not be used to express the corresponding container services are available, you can try to directly control the port service HaProxy at the same time directly to the application of service exposure of an interface, direct monitoring applications The interface of the return code is 200 or non-200 on it.

Application transformation

Application transformation only four points, does not mean that the application of transformation only four points, but efforts to make the application only four points, only concerned about the four points on it.

The first is the node differentiation, in the environment, there are three application servers, one of the application server to do some other two application server does not have things. This situation in the GF or the financial industry are more. After Docker, try not to do this kind of thing.

The second is persistent, in the broadcast environment inside, give a data, OS inside the plug-in storage is rarely less than 500G. An OS if you want to do dynamic expansion, which the less the better storage, because there is no need to let it do other things. If you want to keep a file, because the IO performance problems need to plug it, or log does not belong to the container destroyed and persistent? Or the container in the A host running up, the next time in the B host running up, have to read and write the same thing, then moved to a file inside NFSDATA, select NAS data as a volume, configuration file with That path.

In the memory data inside, it is recommended to do a stripping, such as stripping to REDIS, but if there is a unified application development framework in the case of something to be divested out, as long as the framework of the application can be modified. If not, you can use some natural comparison to support this conversion, such as the data into the Redis, there are some frameworks directly support the change can be configured, and some other framework is not enough.

The third is variable. Before the traditional operation and maintenance will catch up the upstream IP, issued to the downstream. But the container's OS environment variable is variable, and the IP address gets the way it gets from the environment variable. Hostname is not recommended to use, to see the hostname as a standard to the log inside or directly used to log a shared directory of a name to run inside the cloud for a long time do not understand, at least IP names are not read , Because no one to intervene in its Hostname.

The fourth is easy to handle, quick start, elegant off. To micro-service transformation, it can be easy to handle, that is, at any time to start the time will not be more than a few seconds, at any time can be closed, if the application of Docker and to run well, it must go to consider this thing.

Share here, thank you all.

Heads up! This alert needs your attention, but it's not super important.