How does Netflix manage micro services on tens of thousands of machines?

[Editor's Note] This article introduces the Netflix OSS contribution Eureka, Hystrix and Ribbon, the purpose is to read Netflix in the realization of large-scale expansion of the problems encountered, analyze their solutions for the future problems encountered when they provide some Ideas, ideas and vision are sometimes more important than the tool itself.

The Docker technology has been recognized by more and more people, and its application is more and more extensive. This training combined with our theory, from the Docker should be the scene, continuous deployment and delivery, how to improve the efficiency of the test, storage, network, monitoring, security and other aspects.

Any one of the ordinary service, put on Netflix large-scale cluster (tens of thousands of machines) run, if not special treatment, there will be a variety of problems to achieve a movie recommended service as an example, the traditional program:
In the traditional scenario, you will use a fixed DNS domain name resolution service, a set of fixed IP on the load balancing list. Service registration and discovery are written in the configuration file, once the service hangs, depending on the service of other services will be affected, the traditional approach can only be a new server, and then to change the other machine configuration file, and Reboot the associated service.

In a small cluster, this approach may be tolerated, but in the tens of thousands of servers in the cluster, the management of more than 500 kinds of services, the situation will become very complicated, Netflix through years of practice, contributed a lot of open source projects, For example: Eureka, Hystrix, Feign, Ribbon, etc., to solve the problem of large-scale cluster service management.

Use Eureka as a service discovery tool


What is Eureka?

Eureka is a tool that Netflix contributes to open source middle tier load balancing and service discovery. Eureka based on the Java implementation, can be very convenient in the Spring application statement Server and Client for service registration.

Eureka solves the problem

The Eureka server is a registry of services that improves fault tolerance and availability of service discovery in large-scale cluster environments. And can solve the problem of service registration and discovery between data centers.

Netflix recommends building an Eureka cluster in each Region, with at least one Eureka Server available in each Region, which ensures that service registration information for any of the available zones is replicated to each available area for high availability of service information. The client can access the registration information of the service in any available area. After the client accesses the server, the client caches information about the service locally and periodically (30 seconds) refreshes the status of the service.

Eureka will enter self-protection mode if there is a large network failure in the cluster (for example, because the switch fails to communicate between subnets). Each Eureka node will continue to provide services (Note: ZooKeeper not): Receive new service registrations while providing them to downstream service discovery requests. This can be achieved in the same subnet (Same side of partition), the newly released services can still be found and accessed.

In the version of Eureka V1.0, data synchronization between Eureka is full synchronization, and each client has information about all services in the Eureka cluster. In version V2.0, the service information that supports client preference is synchronized The But also enhance Eureka's read and write separation and high availability.

With Eureka, how does Netflix do red and black?

Netflix is ​​released in red and black. If you monitor the problem of deploying services online, it takes 5-15 minutes to roll back a service in the traditional way. Netflix uses Eureka to dynamically drop offline / on-line service.

Services are available in two ways: REST services and non-REST services. If the off-line service is REST service, then the situation is relatively simple, through Eureka can be real-time service line and on-line.

If the service is non-REST services, such as the implementation of Batching tasks or fast service Transaction, etc., you can not simply mark the service off the assembly line, with Spring provided EventListener (event listener), Eureka can pass
EurekaStatusChangeEvent event, to help developers in this event listener to do the corresponding service off the assembly line.

Netflix in the implementation of red and black when the release will be part of the service off the assembly line, if these services have some Batching tasks, through the event listener to stop these tasks.

Why does Netflix choose to use ZooKeeper to do service discovery?

Because when ZooKeeper in dealing with thousands of nodes, due to the number of failures is not high, may be able to deal with, but reached tens of thousands of nodes, ZooKeeper performance as Eureka, because in this volume of the cluster, the cluster failure is always In the event of the cost of re-election, Eureka will adopt a final agreement based on the AP strategy in the CAP theory.

Second, Eureka provides a REST endpoint support service registration, which solves the problem of non-Java service registration.

Hystrix doing service downgrade

Hystrix is ​​a Netflix open source component that can help service calls between timeout and error, preventing problems from spreading and avoiding avalanches. In the case of the user without awareness of the service downgrade processing.
For example, when you try to make a movie recommendation for your users, Hystrix can define a variety of strategies to determine whether the service is healthy for some reason that the service call has not returned (which may depend on the User service). For example: Hystrix preset a timeout time, if the service call returns the results beyond this time, Hystrix will decide to trigger the fuse mechanism, tentative service call, and return a generic list of movies as a recommendation, rather than let users endless And so on, so as to improve the user experience.

Of course, the timeout is only Hystrix make a decision to make a fuse decision, you can set a number of conditions for the Hystrix to determine whether a service call is normal, such as service corePoolSize, maximumPoolSize and keepAliveTime can be used as Hystrix fuse strategy.
Hystrix provides Circuit Break to detect the health of the service, Circuit Break solves the following problems:

  1. Check the status of the service.
  2. Supports isolation of threads and resource access. When the concurrent access to the service is particularly large (hundreds of connections per second), Circuit Break will isolate the threads, or restrict access to resources to ensure service availability.

The following is the circuit break Open / Close decision process:

  1. If the error rate of the service call is higher than the pre-set error rate.
  2. The status of the Circuit-breaker changes from CLOSED to OPEN.
  3. When the Circuit-breaker status is OPEN, all incoming requests are blocked.
  4. After a while, some of the individual requests will come in (Half-Open). If the service call still fails, the Circuit-breaker will enter the OPEN state again. If the request succeeds, the Circuit-breaker status becomes CLOSED and re-enters the first step.

At present, Hystrix is ​​in the cross-service transaction (Transaction) processing to optimize.

Ribbon as load balancing

Ribbon is Netflix OSS contributes to handling soft load balancing for RPC calls. In addition to the traditional load balancing capabilities, it can solve the following problems:

  1. When there are nine servers in the cluster to provide the same service, three of which have a clear response, the Ribbon can temporarily remove the three servers from the load balance until the three machines return to normal response.
  2. Can be the fastest response to the server weight, the more traffic to the fastest response node.
  3. Support the implementation of a variety of load balancing strategy at the same time, the load balancing effect debugging to the best.
  4. Customize the setup retry mechanism.

Although the Ribbon project is in a state of maintenance, but its implementation ideas are still worth learning from.

to sum up

This article introduces the Erawka, Hystrix and Ribbon, which are contributed by Netflix OSS. Due to space limitations, other components will be introduced in a follow-up article. These open source components and Spring Boot / Spring Cloud are well integrated. File that addresses common problems encountered in managing large-scale services.

The purpose of this paper is to provide some ideas, ideas, and views that are more important than the tool itself by interpreting Netflix's problems in achieving large-scale expansion, analyzing their solutions, and providing ideas for future problems.


  • Https://
  • Https://

This article was reproduced from the public: JFrog Roger DevOps, the original link: Netflix how tens of thousands of machines in the management of micro-service? (Author: Wang Qing)

Author: Wang Qing, JFrog China chief architect, before IBM, love Qiyi, Sina, VIPKID do research and development framework, currently focused on DevOps and micro-service floor.

    Heads up! This alert needs your attention, but it's not super important.