The practical experience of deploying Prometheus in the CaaS environment

[Editor's Note] The monitoring system plays a vital role in the cloud platform's maintenance team. The emergence of Docker has produced a great impression on the whole ecosystem. How to monitor the short-lived docker is the main problem to be proved in this paper.

This article is a continuation of my previous article in the CaaS environment deployment ELK stack practical experience to continue to CaaS platform for evaluation. After finishing the log (the last article is just a PoC, from the real can be placed in the production environment on the log management system for a long distance), I turned to the monitoring system.

Monitoring system for the importance of operation and maintenance team I think we should all understand that personal understanding of an easy to use monitoring system, including the following characteristics:

  1. Unified interface management, clear dashboard – no one wants to open 1000 browser pages to monitor 1000 services
  2. Quickly locate the problem of the service – here includes the integration with the alarm system and the rapid positioning of interested machines or services (filtering)
  3. Customize your own portfolio of interests
  4. Can automatically respond to the expansion of monitoring objectives, without having to manually add a new machine or service

We now use the monitoring system (xymon) is a server-oriented monitoring system that can serve those long-standing physical machines or virtual machines, but for the short-lived containers will encounter many problems, so need to turn to a face Service monitoring system. With the above requirements, I decided to use Prometheus.

At present, most monitoring systems are used to push the client, that is, the monitoring program to collect monitoring data after the push to a centralized management of the monitoring server. In contrast, Prometheus uses a regular rotation approach, and the Prometheus server periodically rotates the monitoring target, hoping to obtain data through the HTTP interface to monitor the target. Since I did not put Prometheus into a real cluster for testing, the pros and cons of the two methods I have not yet been able to provide detailed data.

After selecting the target, I began to deploy Prometheus to my test target. May be seen here to be ready to close this page, the reasons may be as follows: Prometheus has done a container, the container no difference in the deployment of different platforms to the advantages of the official container, the following should be nothing new. I was thinking of this before I started doing it, so I gave myself half an hour to deploy. But really when I started to do it, I found the following questions:

  1. Prometheus needs to communicate with the docker daemon through the unix domain socket to get all the docker containers that are currently running, which can not be implemented on a common cloud CaaS platform. Because the docker daemon on the same host may serve different customers, and multiple docker containers of the same customer may also be deployed on multiple hosts
  2. Prometheus needs to get the current contaienr metrics through the cgroup. This method can not be implemented on the common cloud CaaS, the same as the one on the same.

So for the common cloud CaaS, I need to customize Prometheus exporter to get 1) the user's all docker container 2) this container's metrics. Sparkling Cloud provides APIs to get all the services of the current user, each instance of the service, and each instance of the metrics, more convenient is the source of the python-based API. In this experiment, I directly call the alaudacli library and prometheus_client library to achieve their own exporter. The specific code can be found in github .

You need to add environment variables ALAUDA_USERNAME and ALAUDA_PASSWORD for authentication when deploying services

  Def alauda_login (username, password, cloud = 'cn', endpoint = 'https: //'): 
Alaudacli.commands.login (username, password, cloud, endpoint)

After the login is successful, you can get all the services of the current user, where namespace is the user name.

  Def alauda_service_list (namespace): 
Service_list = alaudacli.service.Service.list (namespace, 1)
Return service_list

Each service can have multiple instances of load balancing, and metrics is for each instance of the statistics, so the next step need to obtain an instance of each service.

  Def alauda_instance_list (namespace, name): 
Service_inst = alaudacli.service.Service.fetch (name, namespace)
Instances = service_inst.list_instances ()
Instance_list = []
For data in instances:
Instance = json.loads (data.details)
Instance_list.append (instance)
Return instance_list

The last step is to get the statistics for each instance, and this API is not implemented in alaudacli, so I call alauda open API directly.

  Def alauda_get_instance_metrics (namespace, name, instance_uuid, start_time, end_time, interval): 
Service_inst = alaudacli.service.Service.fetch (name, namespace)
Url = service_inst.api_endpoint + 'services / {0} / {1} / instances / {2} / metrics? Start_time = {3} & end_time = {4} & point_per_period = {5}'. Format (service_inst.namespace, service_inst. Name, instance_uuid, start_time, end_time, interval)
R = requests.get (url, headers = service_inst.headers)
If r.text:
Data = json.loads (r.text)
Return data
Return None

Get the data of interest after you can expose these data to the prometheus server for inquiries (here lazy only CPU and memory data out :)). It should be noted that each of the statistical objects are set to label in order to do prometheus on the filtering and grouping.

  G_cpu_usage = Gauge ("cpu_cumulative_usage", "CPU Cumulative Usage", ["service", "instance"]) 
G_cpu_utilization = Gauge ('cpu_utilization', "CPU utilization", ["service", "instance"])
G_memory_usage = Gauge ('memory_usage', "Memory Usage", ["servie", "instance"])
G_memory_utilization = Gauge ('memory_utilization', "Memory Utilization", ["service", "instance"])

The deployment of the serviceexporter service to the source can be obtained through the browser to run all the services of the current statistical data.
Finally, the need to deploy Prometheus service, this image is based on prom / prometheus after the official mirror image changes. Because the Prometheus server periodically rotates the API of the monitored object, prometheus needs to know the IP address and port of the monitored object. The mirror after the modification can be obtained through the service link (link) to the serviceexporter service IP address and port. (Here lazy not through the link, but directly to the serviceexpoter service IP address and port write dead …).

By accessing the Prometheus server's UI, you can get graphical statistics for each service.
You can set the filtering only to see kibana service statistics
Multiple kibana instances can be aggregated by the sum function
Through the above deployment, I can monitor all the services running on Lingaiyun in a unified UI and set the corresponding threshold alarm (not covered in this experiment), I can customize the dashboard that I am interested in, and can quickly locate the problem service.

This experiment is only a PoC purpose, from the real production environment there is distance, welcome to provide more advice and solutions.

Author introduction <br /> Duhang, Websense cloud infrastructure group development manager, focused on Openstack and Docker.

    Heads up! This alert needs your attention, but it's not super important.