DockOne to share (12: 2): Explore Kubernetes network principles and programs

In 2016, the ClusterHQ container technology application survey showed that the proportion of container technology used in production increased by 96% over a year and Kubernetes was 40%, making it the most popular container arrangement tool; then Kubernetes what is it then? It is a container cluster for the automated deployment, expansion and operation and maintenance of the open source platform; then what can be done through Kubernetes? It can quickly and expected to deploy your application, speed up your application, seamless docking new application features, save resources, optimize the use of hardware resources. With Kubernetes king of the arrival of the era, computing, network, storage, security is Kubernetes not open the topic, this exchange and share the Kubernetes network principles and programs.

The Docker technology has been recognized by more and more people, and its application is more and more extensive. This training combined with our theory, from the Docker should be the scene, continuous deployment and delivery, how to improve the efficiency of the test, storage, network, monitoring, security and other aspects.

First, Kubernetes network model

There are two kinds of IP (Pod IP and Service Cluster IP) in the Kubernetes network. The Pod IP address is actually located on a network card (which can be a virtual device). Service Cluster IP is a virtual IP, which is composed of kube-proxy Use the Iptables rules to redirect to its local port, and then equalize to the backend Pod. The following talk about Kubernetes Pod network design model:

1, the basic principles

Each Pod has a separate IP address (IPper Pod), and assumes that all Pods are in a directly connected, flat network space.

2, design reasons

Users do not need to consider how to establish a connection between the Pod, do not need to consider the container port mapping to the host port and other issues.

3, network requirements

All containers can communicate with other containers without NAT; all nodes can communicate with all containers without NAT; the address of the container is the same address as the address that others see.

Second, Docker network foundation

Linux network nouns

  1. Network namespace : Linux in the network stack to introduce the network namespace, the independent network protocol stack isolated to a different command space, can not communicate with each other; Docker use this feature to achieve different containers between the network isolation.
  2. Veth device pair : Veth device pair is introduced in order to achieve communication in different network namespaces.
  3. Iptables / Netfilter : Netfilter is responsible for executing various hooks in the kernel (filtering, modifying, discarding, etc.) and running in kernel mode; the Iptables mode is a process that runs in user mode and is responsible for helping to keep the Netfilter in the kernel The rules of the table; through the cooperation of the two to achieve the entire Linux network protocol stack in the flexible packet processing mechanism.
  4. Bridge : The bridge is a two-tier network device, through the bridge can be Linux to support the different ports connected, and to achieve a similar switch as many to many communications.
  5. Routing : Linux system contains a complete routing function, when the IP layer in the processing of data sent or forward, will use the routing table to decide where to send.

Docker Eco Technology Stack

The following figure shows the location of the Docker network throughout the Docker eco-technology stack:

Docker network implementation

  1. Stand-alone network model : Bridge, Host, Container, None, here specifically do not go into details.
  2. Multi-machine network model : one is Docker in the 1.9 version of the introduction of Libnetwork project, the original support for cross-node network; one is plug-in (plugin) way to introduce third-party implementation, such as Flannel, Calico and so on.

Third, Kubernetes network foundation

1, communication between containers

The same Pod container shares the same network namespace, and the access between them can be accessed using the localhost address + container port.

2, the same Node between the Pod communication

The default route for the Pod in the same Node is the address of the docker0. Since they are associated with the same docker0 bridge, the address network segments are the same, and all of them should be able to communicate directly.


3, different nodes in the Pod communication between

Different nodes in the Pod communication to meet the two conditions: Pod IP can not conflict; the Pod IP and the Node of the IP associated with the link through the Pod can visit each other.


4, Service introduction

Service is a set of Pod service abstraction, the equivalent of a group of Pod LB, responsible for the request to the corresponding Pod; Service will provide an LB for this, commonly known as ClusterIP.

5, Kube-proxy introduction

Kube-proxy is a simple network proxy and load balancer, its role is mainly responsible for the realization of Service, specifically, is to achieve the internal from the Pod to Service and external from NodePort to Service access.

Method to realize:

  • User space is in the user space, through kuber-proxy LB agent services, this is the initial version of kube-proxy, more stable, but the efficiency is naturally not too high.
  • Iptables is pure use Iptables to achieve LB, is the default way kube-proxy.

The following is the Iptables mode kube-proxy implementation:

  • In this mode, kube-proxy monitors the Kubernetes master server to add and remove services and endpoint objects. For each service, it installs the iptables rule, captures traffic to clusterIP (virtual) and port traffic, and redirects traffic to one of the backend collections of the service. For each Endpoints object, it installs the iptables rule for selecting the backend Pod.
  • By default, the backend selection is random. You can select a client IP-based session association by setting service.spec.sessionAffinity to "ClientIP" (the default is None).
  • As with the user space agent, the end result is bound to the service's IP: any traffic on the port is proxied to the appropriate backend, and the client does not know anything about Kubernetes or services or pod. This should be faster and more reliable than the user space agent. However, unlike the user-space proxy, if the originally selected Pod does not respond, the Iptables agent can not automatically retry another Pod, so it depends on having a work-ready probe.

6, Kube-dns introduction

Kube-dns is used to assign a subdomain to the Kubernetes Service, which can be accessed by name in the cluster. Normally kube-dns assigns a record of Service called "service name .namespace.svc.cluster.local" Clustered ClusterIP for Service.

Kube-dns component:

  • Before the Kubernetes v1.4 version by the "Kube2sky, Etcd, Skydns, Exechealthz" four components.
  • In Kubernetes v1.4 version and later by the "Kubedns, Dnsmasq, exechealthz" three components.


  • Access SkyDNS, for dnsmasq provide query service.
  • Replace the etcd container, using a tree structure to save DNS records in memory.
  • Monitor the Service resource changes and update the DNS records through the Kubernetes API.
  • Service port 10053.


Dnsmasq is a compact DNS configuration tool.

The role in the kube-dns plugin is:

  1. Through the kubedns container to obtain DNS rules in the cluster to provide DNS query service
  2. Provide DNS cache to improve query performance
  3. Reduce the pressure of the kubedns container and improve stability

Dockerfile in GitHub Kubernetes organization of the folks, located in the dnsmasq directory.

In the kube-dns plug-in configuration file can be seen, dnsmasq through the parameters –server = 10053 designated upstream for the kubedns.


  • Provide health check in the kube-dns plugin.
  • The source code is also in the contrib repository, located in the exec-healthz directory.
  • The new version of the two containers will be a health check, more perfect.

Four, Kubernetes network open source components

1, technical terms

IPAM : IP address management; the IP address management is not unique to the container, the traditional network, such as DHCP is actually an IPAM, to the container era we talk about IPAM, the mainstream of the two methods: IP address based on CIDR assigned to Or assign IP to each container exactly. But in short, once the formation of a container host cluster, the above containers have to assign it a globally unique IP address, which involves the IPAM topic.

Overlay : in the existing two or three networks to build up a separate network, the network will usually have their own independent IP address space, exchange or routing implementation.

IPSesc : a point-to-point of an encrypted communication protocol, the general use of Overlay network data channel.

VXLAN : VMware, Cisco, RedHat and other joint such a solution, the solution is the most important solution to the number of VLAN virtual network support (4096) too little problem. Because every tenant in the public cloud has a different VPC, 4096 obviously not enough. There is a vxLAN, it can support 16 million virtual network, basically the public cloud is enough.

Bridge Bridge : connecting two peer networks between the network equipment, but in today's context refers to the Linux Bridge, is the famous Docker0 this bridge.

BGP : the autonomy network routing protocol, today with the Internet, the Internet consists of many small autonomous network, autonomous network between the three-tier routing is implemented by BGP.

SDN, Openflow : A term that defines a network, such as a flow chart, a control plane, or a forwarding plane that we often hear, are the terms in Openflow.

2, the container network program

Tunnel Solutions (Overlay Networking)

Tunneling scheme in the IaaS layer of the network is also more applications, we consensus is that with the size of the node will increase the complexity of the increase, but also out of the network problems are too cumbersome, large-scale cluster case this is a point to consider.

  • Weave: UDP broadcast, the machine to establish a new BR, through the PCAP interoperability
  • Open vSwitch (OVS): based on VXLAN and GRE protocol, but the performance of the loss is more serious
  • Flannel: UDP broadcast, VxLan
  • Racher: IPsec

Routing scheme

Routing scheme is generally from 3 or 2 layers to achieve isolation and interoperability between the host container, the problem is also very easy to troubleshoot.

  • Calico: BGP-based routing scheme, support very detailed ACL control, the hybrid cloud affinity is relatively high.
  • Macvlan: from the logic and Kernel layer isolation and performance of the best solution, based on the two-tier isolation, so the need for Layer 2 router support, most cloud service providers do not support, so the hybrid cloud is more difficult to achieve.

3, CNM & CNI camp

Container network development to the present, the formation of the two camps, is Docker CNM and Google, CoreOS, Kuberenetes leading CNI. First of all clear that CNM and CNI is not a network implementation, they are the network specification and network system, from the perspective of R & D they are a bunch of interfaces, you are using Flannel at the bottom, with Calico Ye Hao, they do not care, CNM and CNI is concerned with the issue of network management.

CNM (Docker LibnetworkContainer Network Model)

Docker Libnetwork advantage is native, and Docker container life cycle closely; shortcomings can also be understood as being native, was Docker "kidnapped."

  • Docker Swarm overlay
  • Macvlan & IP networkdrivers
  • Calico
  • Contiv
  • Weave

CNI (Container NetworkInterface)

CNI's advantage is compatible with other container technology (eg rkt) and the upper layer system (Kubernetes & Mesos), and the community active momentum, Kubernetes plus CoreOS main push; the disadvantage is non-Docker native.

  • Kubernetes
  • Weave
  • Macvlan
  • Calico
  • Flannel
  • Contiv
  • Mesos CNI

4, Flannel container network

Flannel can build kubernets rely on the underlying network, because it can achieve the following two points:

  • It assigns docker containers on each node to IP addresses that do not want to conflict with each other;
  • It can give these IP addresses to establish a coverage between the network, with the overlay network, the data packets intact passed to the target container.

Flannel introduced

  • Flannel is a CoreOS team designed for Kubernetes a network planning service, in simple terms, its function is to make the cluster in the different nodes to create the Docker containers have a full cluster unique virtual IP address.
  • In the default Docker configuration, the Docker service on each node is responsible for the IP allocation of the node container. One of the problems that causes this is that the containers on different nodes may get the same internal and external IP addresses. And between these containers can be found through the IP address between each other, that is, ping each other.
  • Flannel is designed to re-plan the usage rules for IP addresses for all nodes in the cluster, so that containers on different nodes can get the same "intranet" and "non-duplicate" IP addresses, and let the nodes on different nodes The container can communicate directly through the intranet IP.
  • Flannel is essentially a "overlay network" (overlaynetwork) ", that is, TCP data in another network packet inside the routing and forwarding and communication, has now supported UDP, VXLAN, host-gw, aws-vpc, GCE and Alloc routing and other data forwarding mode, the default data communication between nodes is UDP forwarding.


5, Calico container network

Introduction to Calico

  • Calico is a pure 3-layer data center network solution, and seamless integration like OpenStack this IaaS cloud architecture, can provide controllable VM, container, bare metal communication between. Calico does not use overlapping networks such as Flannel and Libnetwork overlay network drivers, it is a pure three-tier approach, the use of virtual routing instead of virtual exchange, each virtual route through the BGP protocol to reach the information (routing) to the remaining data center.
  • Calico uses Linux Kernel at every compute node to implement an efficient vRouter to handle data forwarding, and each vRouter is responsible for passing the traffic information of the workload that runs on the entire Calico network through the BGP protocol – a small scale deployment can be done directly Interconnection, large-scale can be done through the specified BGP route reflector.
  • Calico node networking can directly use the data center network structure (whether L2 or L3), no additional NAT, tunnel or Overlay Network.
  • Calico also provides a rich and flexible network policy based on iptables to ensure that multi-tenant isolation, security groups, and other reachability limits for Workload are provided through ACLs on each node.

Calico Architecture:

Fifth, the network open source components performance comparison analysis

Performance comparison analysis:
Performance comparison Summary:

CalicoBGP program is best, can not use BGP can also consider the Calico ipip tunnel program; if CoreOS is able to open UDP Offload, Flannel is a good choice; Docker native Overlay there are many areas need to be improved.

Q & A

Q: How does the Pod connect to the Pod of B? What does kube-dns do? Kube-dns if calling kube-proxy?

A: B and B should be referred to Service, A Service in the Pod and B Service Pod communication between the container environment variables can be defined in the Service IP or Service Name to achieve; because Service IP do not know in advance , The use of the introduction of kube-dns do service discovery, its role is to monitor the Service changes and update the DNS, that Pod through the service name can query DNS; kube-proxy is a simple network agent and load balancer, its role is mainly responsible for Service implementation, specifically, is to achieve the internal from Pod to Service and external from NodePort to Service access, we can say kube-dns and kube-proxy are for the Service service.

Q: network problem docker default is the bridge mode (NAT) if the routing model, so the Pod gateway will be docker 0 IP? Which is also between Pod 1 and Pod 2 routing, which will make a large routing table? Flannel network is not it possible to put all the Node on the equivalent of a distributed switch?

A: Docker to achieve cross-host communication can be bridged and routing way, the bridge is docker0 bridge in the host network card, and routing directly through the host network port forwarding; Kubernetes network Pod and Server, Pod network to achieve the way Many, you can refer to the CNI network model, Flannel is essentially a "overlay network (Overlay Network)", that is, TCP data package in another network packet routing and forwarding and communication.

Q: How to ensure the safety of large-scale container clusters? Mainly from several aspects to consider?

A, a large-scale container cluster from the security considerations, can be divided into several areas: 1, cluster security, including high availability cluster; 2, access security, including authentication, authorization, access control, etc .; 3, Including the multi-tenant, etc .; 4, network security, including network isolation, traffic control, etc .; 5, mirror security, including container vulnerabilities; 6, container security, including port exposure, privileged privileges.

Q: How does SVC perform client shunt, A network segment access Pod1, B network segment access Pod2, C network segment access Pod3, 3 pod are in SVC Endpoint?

A: The implementation from inside Pod to Service is done by kube-proxy (simple network proxy and load balancer). The kube-proxy defaults to the polling method, or by setting service.spec.sessionAffinity to " ClientIP "(the default is" no ") to select the client IP-based session association, the current segment can not be specified.

Q: For Ingress + HAProxy, the Ingress controller polls the Pods state behind Service and regenerates the HAProxy configuration file, and then restarts HAProxy to achieve the purpose of service discovery. This principle is not for the HAProxy service will be temporarily interrupted. Is there a good alternative? Before seeing Golang to achieve Træfik, can be seamless docking Kubernetes, and do not need Ingress. What is the solution?

A: As the micro service architecture and Docker technology and Kubernetes programming tools in recent years began to become popular, so the beginning of the reverse proxy server such as Nginx / HAProxy did not provide its support, after all, they are not a prophet, so there will be IngressController This kind of thing to do Kubernetes and front-end load balancers such as Nginx / HAProxy to do between the convergence, that is, Ingress Controller exists in order to be able to interact with Kubernetes, but also write Nginx / HAProxy configuration, but also reload it, this is a fold And the recent start of the Traefik is born to provide support for Kubernetes, that Traefik itself will be able to interact with Kubernetes API, perception of back-end changes, so the use of Traefik do not need Ingress Controller, this program of course feasible The

Q: 1, a POD inside the multiple Container is the same service? Or by a different service composition? What is the distribution logic? 2, Flannel is to achieve more than one host on the N multi-service and Pod inside the various Container IP uniqueness? 3, Kubernetes with load balancing effect. Do not you think about Nigix?

A: Pod is Kubernetes basic operation unit, Pod contains one or more related containers, Pod can be considered an extension of the container expansion, a Pod is a separator, and Pod internal contains a group of containers are shared (Including PID, Network, IPC, UTS); Service is Pod routing agent abstraction, can solve the Pod service discovery problem; Flannel is designed for all nodes in the cluster to re-plan the use of IP address rules, The containers on different nodes can get the same "intranet" and "non-duplicate" IP addresses and allow the containers on different nodes to communicate directly over the intranet IP; the Kubernetes kube-proxy implements the internal L4 layer polling Mechanism load balancing, to support L4, L7 load balancing, Kubernetes also provides Ingress components, through the reverse proxy load balancer (Nginx / HAProxy) + Ingress Controller + Ingress can be exposed to external services, and the use of Traefik program to achieve Service Load balancing is also a good choice.

Q: How does kube-proxy load? Where does the Service Virtual IP exist?

A: kube-proxy has two modes to achieve load balancing, one is userspace, through Iptables redirect to the kube-proxy corresponding port, and then by the kube-proxy to send data to one of the Pod, the other Is Iptables, pure Iptables to achieve load balancing, kube-proxy default using the polling method to allocate, you can also service.spec.sessionAffinity set to "ClientIP" (default is "no") to select the client-based IP Session association IP; it is a virtual IP, is used by the kube-proxy Iptables rules to redirect to its local port, and then balance to the back-end Pod, through the apiserver startup parameters –service-cluster-ip-range Set up by the kubernetes cluster internal maintenance.

Q: Kubernetes network complex, if you want to achieve remote debugging, how to do, port mapping will be what kind of hidden dangers?

A: Kubernetes network This is the CNI specification, the network plug-in, very flexible, different methods of network plug-in debugging is not the same; port mapping the biggest hidden danger is likely to cause port conflict.

Q: RPC service registration, the local IP registered to the registration center, if the container will be registered inside the virtual IP, outside the cluster can not call, what is a good solution?

A: Kubernetes Service to Pod communication is distributed by the kube-proxy agent, and Pod in the container communication is through the port, the communication between different services can be through the DNS, do not have to use virtual IP.

Q: I now use CoreOS as the bottom, so the network is used Flannel But the upper layer with Calico as Network Policy, recently a Canal structure and this is similar, can introduce what, if you can, in detail about CNI Principles and Callico's Policy Implementation?

A: Canal is not very understanding; CNI is not a network implementation, it is a network specification and network system, from the perspective of research and development it is a bunch of interfaces, concerned about the network management issues, CNI implementation depends on two Plugin, Is the CNI Plugin is responsible for the container connect / disconnect to the host vbridge / vswitch, the other is the IPAM Plugin is responsible for configuring the container Namespace in the network parameters; Calico's policy is based on Iptables, to ensure that the nodes through the ACLs to provide workload Multi-tenant isolation, security groups, and other accessibility limits.

Q: How does CNI manage the network? Or how does it work with the network program?

A: CNI is not a network implementation, it is the network specification and network system, from the perspective of research and development it is a bunch of interfaces, you are using Flannel the bottom of the same, with Calico Ye Hao, it does not care, it is concerned about the network management CNI implementation is dependent on two plugin, one is CNI Plugin is responsible for the container connect / disconnect to the host vbridge / vswitch, the other is the IPAM Plugin is responsible for configuring the container Namespace in the network parameters.

Q: Service is a physical component? What parts of the Service configuration file, what parts to implement it?

A: Services is the basic operating unit of Kubernetes, which is the abstraction of real application services. The service IP range is specified by the –service-cluster-ip-range parameter when configured by the kube-apiserver service, which is maintained by the Kubernetes cluster itself.

The above content according to the evening of May 18, 2017 micro-credit group to share content. Share people Yang Yunsheng, there Yun Yun product manager. (Rancher / Kubernetes) and its related storage, network, security, log, monitoring and other solutions work, which is related to cloud computing technology, such as system, storage, network, virtualization, container and so on . DockOne organizes targeted technical sharing every week, and welcomes interested students. We are interested in the topic of liyoujiesz, who you want to hear or want to share.

This article source:

    Heads up! This alert needs your attention, but it's not super important.