Kubernetes 1.3 current and future

Introduction <br /> This article discusses some of the new features of K8S 1.3, as well as the ongoing functionality. Readers should have an understanding of the basic structure of kubernetes.

Support more types of applications
1, Init container
Init container is the alpha feature in 1.3, which is intended to support a class of applications that need to be initialized by Pod before starting the pod "normal container". The container that executed the initialization task becomes the "init container". For example, before starting the application, initialize the database, or wait for the database to start. The following figure is a pod containing init container:

屏幕快照_2016-08-01_下午4.00_.21_.png

For such pod, kubernetes running the strategy as follows:
The initial container is executed in sequence, ie the container 1 -> 2 in the figure
If one of the initial containers fails to run, the entire Pod fails when all the initialization containers are running successfully, starting the container, ie containers A and B
In the alpha version of the use of init container need to use annotation, the following figure is from k8s an example (slightly cropped):

屏幕快照_2016-08-01_下午4.00_.55_.png

You can see that before we start the nginx container, use init container to get index.html, and then visit nginx will return the file directly. When the init container function is stable, k8s will be directly in the pod.spec add init Containers field, as follows:

屏幕快照_2016-08-01_下午4.01_.25_.png

Init container looks like a small function, but in the realization or need to consider a lot of problems, such as a few more important points:
Resource Question: How do I calculate the required resources when scheduling an init container Pod? Two extreme cases: If the init container and regular container needs the sum of resources, then when init container successfully initialize the Pod, it will not use the requested resources, and the system that in use, will cause a waste; the other hand, Do not calculate the init container resources will lead to system instability (init container used by the resources are not included in the scheduling resources). The current method is to take a compromise: since the initial container and the normal container do not run at the same time, the Pod's resource request is the maximum of both. For the initial container, because they are running in turn, so choose the maximum value; for ordinary containers, because it is running at the same time, select the sum of container resources.
Pod Status: Currently, Pod has Pending, Running, Terminating and other states. For pods that have initialized containers, if the Pending state is still used, it is difficult to distinguish whether the pod is currently running the initial container or the normal container. So, ideally, we need to add a state similar to Initializing. In the alpha version has not yet added.
Health check and usability check: How can we check the health of the container after init container? The alpha version closes both checks, but the init container is a container that is actually running on node, and theoretically it needs to be checked. For usability checking, closing off is a viable option because the availability of the init container is actually when it runs. For health check, node needs to know whether a pod is in the initialization phase; if it is in the initialization phase, then the node can carry out a health check on the init container. Therefore, kubernetes are likely to add the Initializing Pod state after opening the init container's health check.
Around the init container there are many problems, such as QoS, Pod updates, etc., many of which are to be resolved, there is not one by one started 🙂

2, PetSet

PetSet should be a long-awaited feature of the community, with the aim of supporting stateful and clustered applications, and the alpha stage. PetSet application scenarios, including similar zookeeper, etcd quorum leader election applications, similar to Cassandra's Decentralized quorum and so on. PetSet, each Pod has a unique identity, including: name, network and storage; by the new component PetSet Controller responsible for the creation and maintenance. Here's a look at how kubernetes is the only way to maintain Pod.
The name is easier to understand, when we create an RC, kubernetes will create a specified number of copies of the pod, when using kubectl to obtain Pod information, we will get the following information:

屏幕快照_2016-08-01_下午4.05_.50_.png

Among them, the 5-character suffix is ​​automatically generated for kubernetes. When the Pod reboots, we will get a different name. For PetSet, the Pod reboot must ensure that the name is the same. Therefore, the PetSet controller will maintain an identityMap, each PetSet in each Pod will have a unique name, when the Pod restart, PetSet controller can be aware of which Pod, and then notify the API server to create a new name Pod. The current perceptual method is simple, and the identityMap maintained by the PetSet controller pushes the pod from 0, and then the process of synchronization is like the number of reports, which number does not restart the number.

屏幕快照_2016-08-01_下午4.06_.18_.png

In addition, the number has another role, PetSet controller by number to ensure that the Pod boot sequence, only 0 Pod start, to start the first Pod.
The maintenance of network identities is maintained primarily through stable hostname and domain name, which are specified by the PetSet configuration file. For example, the following figure is a PetSet's Yaml file (with cropping), where metadata.name specifies the prefix of the Pod's hostname (the suffix is ​​the index from the beginning mentioned above), and spec.ServiceName specifies the domain name.

屏幕快照_2016-08-01_下午4.06_.58_.png

Through the above Yaml file to create two Pod: web-0 and web-1. Its complete domain name is web-0.nginx.default.svc.cluster.local, where web-0 is hostname, nginx is the domain name specified in Yaml, and the rest is the same as ordinary service. When the creation request is issued to the node, the kubelet will set the UTS namespace through the container runtime, as shown in the following figure (omitted some components such as apiserver).

屏幕快照_2016-08-01_下午4.07_.25_.png

At this point, hostname has been set up at the container level, the rest also need to increase the cluster for the client level resolution, and add domain name resolution, this part of the work of course to the kube dns. To understand Kubernetes readers should know that to add the resolution, we need to create a service; Similarly, here also need to create a service for the PetSet. The difference is that the ordinary service default backend Pod is replaceable, and the use of such as roundrobin, client ip way to select the back of the Pod, here, because each Pod is a Pet, we need to locate each Pod, We have to create a service must be able to meet this requirement. In PetSet, the kubernetes headless service is used. Headless service does not allocate cluster IP to load balance the backend pod, but adds records to the clustered DNS server: the creator needs to take advantage of these records. The following figure is the headless service we need to create, note that the clusterIP is set to None, indicating that this is a headless service.

屏幕快照_2016-08-01_下午4.07_.52_.png

Kube dns after some processing, will generate the following records:

屏幕快照_2016-08-01_下午4.08_.21_.png

You can see that accessing web-0.nginx.default.svc.cluster.local will return pod IP, and access to nginx.default.svc.cluster.local will return pods IP in all Pet. A common way is to access all the peers by accessing the domain, and then in turn communicate with the individual Pod.
Storage Identity This is done using the PV / PVC implementation. When we create a PetSet, we need to specify the data volume assigned to Pet, as shown in the following figure:

屏幕快照_2016-08-01_下午4.08_.47_.png

Here, volumeClaimTemplates specifies the storage resources that each Pet needs. Note that all Pet now have the same size and type of data volume. When the PetSet controller gets the request, it creates a PVC for each Pet and then associates each Pet with the corresponding PVC:

屏幕快照_2016-08-01_下午4.09_.23_.png

After the PetSet only need to ensure that each Pet is associated with the corresponding PVC can be together, other work, similar to the creation of data volumes, mount and other work, are handed over to other components.
By name, network, storage, PetSet can cover most cases. However, there are still a lot of need to improve the place, interested readers can refer to: https://github.com/kubernetes/ … 28718

3, Scheduled Job

Scheduled Job is essentially a cluster cron, similar to mesos chronos, using the standard cron syntax. Unfortunately, in 1.3 did not reach the release of the standard. Scheduled Job has been put forward very early, but at that time kubernetes focus on the API level, and even if there is a great demand, but also planned after the Job (1.2GA) to achieve. After the scheduled job is released after the release, the user can run the job on kubernetes with a simple command, for example: kubectl run cleanup -image = cleanup – runAt = "0 1 0 0 *" – / scripts / cleanup. Sh Some updates to the scheduled job can be found at: https://github.com/kubernetes/ … 25595

4, Disruption Budget

Disruption Budget is intended to provide a feedback mechanism to the Pod to ensure that the application will not be affected by changes in the cluster itself. For example, when a cluster needs to be re-scheduled, the application can use Disruption Budget to indicate whether Pod can be migrated. Disruption Budget is only responsible for the changes initiated by the cluster itself, and is not responsible for sudden events such as sudden dropped calls, or applications of their own problems such as constant restart changes. Disruption Budget is also not published in 1.3.
Similar to kubernetes' most resources, we need to create a PodDisruptionBudget resource from the Yaml file. For example, the Disruption Budget shown below shows all pods with app: nginx tags and requires at least three pods to run at the same time.

屏幕快照_2016-08-01_下午4.10_.33_.png

Controller manager has a new component Disruption Budget Controller, responsible for maintaining the status of all the budget, for example, the status shown in the above shows that there are currently four healthy pod (currentHealthy), the application requirements of at least three (desiredHealthy), a total of 5 pod (expectedPods). In order to maintain this state, the Disruption Budget Controller traverses all Budget and all Pods. With the status of the Budget, you need to change the Pod state of the components must first query. If the operation results in a minimum available number less than the application requirement, the operation is rejected.
Disruption Budget is very close to QoS. For example, if a very low QoS level application has a very strict Disruption Budget, the system should be how to deal with? At present, kubernetes has not dealt with this problem strictly, and a viable approach is to prioritize Disruption Budget to ensure that high-priority applications have a high priority Disruption Budget; in addition, Disruption Budget can join the Quota system, high priority Of the application can get more Disruption Budget Quota. Discussions on Disruption Budget can be found at https://github.com/kubernetes/ … 12611

Support for better cluster management
1, Cascading Deletion

Before kubernetes 1.2, the delete control unit will not delete the underlying resources. For example, after removing the RC through the API, the managed pod will not be deleted (using kubectl can be deleted, but kubectl there is reaper logic, will delete all the underlying Pod, in essence, the client logic). In another example, when you delete Deployment in the following diagram, ReplicaSet will not be automatically deleted, of course, Pod will not be recycled.

屏幕快照_2016-08-01_下午4.12_.11_.png

Cascading deletion refers to the deletion of the control unit, the management unit will also be recycled. However, the cascading deletion in kubernetes 1.3 does not simply say that the logic in kubectl is copied to the server side, but is done at a higher level: garbage collection. In simple terms, the garbagecollector controller maintains a list of almost all cluster resources and receives events that modify resources. Controller updates the resource diagram according to the event type and places the affected resource in the Dirty Queue or Orphan Queue. Specific implementation can refer to the official documentation and garbage collector controller implementation: https://github.com/kubernetes/ … on.md

2, Node eviction

Node / kubelet eviction refers to the process before the node will be overloaded, the Pod will be removed in advance, mainly for memory and disk resources. Before kubernetes 1.3, the kubelet does not "anticipate" the load of the node in advance, only the known problem is processed. When the memory is tight, kubernetes rely on the kernel OOM killer; disk aspects of the image and container on a regular basis for garbage collection. However, this approach has limitations, OOM killer itself need to consume a certain amount of resources, and the time has uncertainty; Recycle containers and mirrors can not handle the problem of container writing log: If the application keeps writing logs, it will consume all the disk, But will not be handled by the kubelet.
Node eviction solves the above problem by configuring the kubelet. When starting the kubelet, we ensure that the nodes are stable by specifying parameters such as memory.available, nodefs.available, nodefs.inodesFree, and so on. For example, memory.available <200Mi means that when the memory is less than 200Mi, the kubelet needs to start removing the pod (which can be configured to remove or delay remove, that is, hard vs soft). Kubernetes 1.3, node eviction features opt-in, the default off, you can configure the kubelet to open the relevant functions.
Although node eviction is a measure taken at the kubelet level, we must also consider the interaction with the entire cluster. One of the most important thing is how to feed this question to the scheduler, or the removed Pod is likely to be re-scheduled back. To this end, kubernetes added a new node condition: MemoryPressure, DiskPressure. When the state of the node contains any of them, the scheduler will avoid scheduling the new pod to that node. There is another problem, that is, if the node's resources are used near the threshold, the state of the node may be jitter between the pressure and Not Pressure. There are many ways to prevent jitter, such as smoothing filtering, that is, historical data are also taken into account, weighted evaluation. K8s currently uses a simpler approach: if the node is in the pressure state, in order to switch to the Not Pressure state, the resource usage must be below the threshold for a period of time (default 5 minutes). This method can cause false alarm, for example, if an application requests a piece of memory from time to time and then releases it quickly, it may cause the node to remain in the pressure state. But in most cases, the method can handle jitter.
Speaking of eviction pod, then another question that has to be considered is to find a bad pod. Here kubernetes defined a lot of rules, summed up mainly two points: 1. According to the QoS to determine the application of low QoS first consider; 2. According to the use of judgment, the use of the total amount of the request and the proportion of Pod priority. Specific details can be found at https://github.com/kubernetes/ … on.md

3, Network Policy

The purpose of the Network policy is to provide isolation between Pods. The user can define the communication rules between any Pods and the granularity of the ports. For example, the rules in the following figure can be interpreted as: Pods with the tag "db" can only be accessed by Pods with the tag "frontend" and can only access tcp port 6379.

屏幕快照_2016-08-01_下午4.14_.13_.png

Network policy is currently in beta and only API. That is, kubernetes does not really implement network isolation: if we submit the above Yaml file to kubernetes, there will not be any feedback, kubernetes just save the policy content. Real implementation of the policy function requires other components, such as calico to achieve a controller, will read the user to create the policy to achieve isolation, you can refer to: https://github.com/projectcalico/k8s-policy/ . For details on Network Policy, see https://github.com/kubernetes/ … cy.md

4, Federation

Federation cluster translated into Chinese called "joint cluster", will be a number of kubernetes cluster together into a whole, and does not change the original kubernetes cluster work. According to kubernetes official design document, federation design purpose is to meet the service is highly available, mixed cloud and other needs. Before version 1.3, kubernetes implemented federation-lite, where the machines in a cluster could come from different zones of the same cloud; in version 1.3, federation-full support was already beta, that is, each cluster came from a different cloud Or the same).
The core components of Federation are federation-apiserver and federation-controller-manager, which run in one of the clusters in Pod form. As shown in the following figure, the external request communicates directly with the Federation Control Panel, which is analyzed by Federation and sent to the kubernetes cluster.

屏幕快照_2016-08-01_下午4.15_.12_.png

In the application level, Federation currently supports federated services, that is, an application across multiple clusters of access, the specific details can refer to: http://blog.kubernetes.io/2016 …. Html and http://kubernetes.io/ Docs / admin / federation /

Concluding remarks
Kubernetes 1.3 brings a lot of features, here only cover one part. In terms of security, kubernetes already support RBAC, to achieve better control of permissions; PodSecurityContext also entered the beta version, to support the operation of the need for privilege Pod and so on. In terms of performance, due to the introduction of protocol buffere serialization, performance is improved several times, and is brewing in the etcd3 will improve performance further. I believe that after the version will bring us more surprises.
屏幕快照_2016-08-01_下午4.15_.58_.png Screen shot _2016-08-01_pm 4.10_.33_.png Screen shots _2016-08-01_pm 4.12_.40_.png

    Heads up! This alert needs your attention, but it's not super important.