SRE Engineering Practice – Alarm based on time series to store data

[Editor's Note] to build intelligent operation and maintenance platform, operation monitoring and fault alarm is an important part of the two around the past. This sharing is mainly introduced after the introduction of SRE concept based on time series data storage alarm engineering practice.

SRE alarm introduction

Today I share the theme is SRE based on time series data alarm practice, since it is based on time series.

First of all, I would like to briefly explain what is the time series data.

Time series data is a series of ordered data. Usually sampled at equal intervals. The simplest definition of time-series storage is the data in the data format that contains the timestamp field. Time series data in the query, for the time series will always bring a time range to filter data. At the same time the query results will always contain timestamp field.

The monitoring data is presented in a large number of time series data characteristics, so in order to cope with complex monitoring data format, in each data plus time field. Different from the traditional relational database, time series data storage, query and show a special optimization, resulting in a very high data compression capabilities, excellent query performance, in particular, need to deal with massive time series data Internet applications Scenes.

Google's monitoring system, after 10 years of development, has undergone a new model of monitoring from the traditional probe model, graphical trend model, and now based on time-series data information. This model will collect time series information as the primary task of the monitoring system, while developing a time series information operation language, by using the language of the data into icons and alarms instead of the previous probe script.

Monitoring and alarm are inseparable from the two parts, before our company's CTO Xiao De was done on the time series based on data monitoring practice sharing, in this sharing will not repeat the front of the monitoring part, interested students You can go to see the old Xiao's article .

The operation and maintenance team understands the runtime status of the application service through the monitoring system to ensure the availability and stability of the service. The monitoring system will also provide metrics data for the Dashboard display service. Although the various line graphs are interesting, the most valuable aspect of the monitoring system is that when the service is abnormal or the target value exceeds the set threshold, the operation and maintenance The team received an alarm message, timely intervention and restore the service to the normal state of the time.

SRE team that the monitoring system should not rely on people to analyze the alarm information should be automatically analyzed by the system, issued by the alarm to be operational, the goal is to solve a problem that has occurred, or to avoid the problem.

Monitoring and alarm

Monitoring and alarming allows the system to notify us in the event of a failure or near failure. When the system can not automatically repair a problem, you need a person to investigate the alarm to determine whether there is a real fault, take a certain way to alleviate the failure, analyze the fault phenomenon, and ultimately find out the cause of the failure. The monitoring system should provide fault information, ie, phenomena and causes, from two aspects.

Black box monitoring and white box monitoring

Black box monitoring: Monitor by testing some kind of external user-visible system behavior. This is a behavior-oriented monitoring that provides problems that are happening and sends an urgent alert to employees. For not yet happening, but the imminent problem, black box monitoring can do nothing.

White box monitoring relies on some of the performance indicators that are exposed to the system. Including log analysis, a monitoring interface provided by the Java virtual machine, or an HTTP interface that lists internal statistics. White box monitoring can detect the upcoming problems by analyzing the index values ​​of the system's internal information. White box monitoring is sometimes oriented, sometimes for reasons, depending on the information provided by the white box monitoring.

Google's SRE relies heavily on white-box monitoring.

Set up several principles of alarm

Often, we should not be alerted simply because "something looks a little problem".

The handling of an emergency alert will take up valuable time for the employee, and if the employee is in the working hours, the handling of the alarm will interrupt his original workflow. If the employee is at home, the handling of an emergency alert will affect his personal life. Frequent alarms will allow employees to enter the "wolf" effect, suspect the effectiveness of the alarm and ignore the alarm, or even missed the real failure.

Set the principles of the alarm rules:

  • The alerts issued must be real, urgent, important, and operational.
  • The alarm rules are to show you the problems or upcoming problems of your service.
  • Clear problem classification, whether the basic function is available; response time; data is correct and so on.
  • Symptom Alarms, and provide as detailed details and reasons, do not directly on the cause of the alarm.

Based on time series data for effective alarm

Traditional monitoring, by running the script on the server, store the return value for graphical display, and check the return value to determine whether the alarm. Google uses Borgmon as a monitoring alarm platform.

Outside of Google, we can use Prometheus as a tool for monitoring alarms based on time series data, and implement the white-box monitoring concept provided by SRE.

Monitoring alarm platform structure diagram:

Monitor the alarm components

  • CAdvisro provides users with the tools to understand the resource usage and performance characteristics of the container runtime. CAdvisor as a backstage program to collect, aggregate, process and export container runtime information.
  • Prometheus is an open source system monitoring alarm toolkit developed by SoundCloud. Prometheus from the cAdvisor's HTTP interface to collect container runtime information, stored in the internal storage, the use of PromQL timing data query display and set the alarm. Alarm information is pushed to Alertmanager.
  • Alertmanager handles the alarms sent by the Prometheus service for de-weighting, grouping, routing, silencing, and noise reduction.
  • Alerta is a user-friendly alarm visualization tool for displaying and managing alarm data pushed from Alertmanager.

Set up a test environment

In order to facilitate the test, we use the container on the test server to run the above components, test the server address

  1. Start two Nginx containers, and assign different labels to identify an application belonging to the Dev group, an application belonging to the Ops group.
  2. Start the cAdAdvisor container, port mapping 8080.
  3. Start the Alertmanager container, port map 9093, and specify the address of the Alerta as the notification address for the Webhook in the configuration file.
  4. Start Prometheus container, port mapping 9090, CMD specify "-alertmanager.url" address is Alertmanager address.
  5. Start MongoDB as the alerta database
  6. Start Alerta, port mapping to 8181

Container operation screenshots:

Application metrics collection

CAdvisor native to provide http interface exposed Prometheus need to collect the metrics, we visit .
Configure the address of cAdvisor as the target address in the Prometheus configuration file. You can view the status of Targets on the Prometheus Web page.
In the Prometheus Graph page, you can query the collected data and graphical display.

Alarm rule configuration

We configure the alarm rules for the CPU usage of the container application. The rules are as follows:
In the figure, the alarm rules and alarm rules of the application container are set for the dev group and the ops group respectively.

  • "Alert" is the name of the alarm rules, there can not be spaces between names, you can use underlined links;
  • "Container_cpu_usage_seconds_total", label "container_label_dataman_service" is equal to "web", label "container_label_dataman_group" is equal to "dev", and the function irate () is used to calculate the index in the last 5 The ratio of the difference in CPU usage time per second in minutes. Simply calculate the percentage of CPU time spent. Here the two alarm rules in the expression is somewhat different, is to distinguish between the two groups of applications.
  • "FOR" is the alarm state for more than 1 minute, the alarm from the state "PENDING" to "FIRING", the alarm will be handed over to Alertmanager processing.
  • "LABELS" is the custom data, where we specify the level of the alarm and the value of the expression in "IF".
  • "ANNOTATIONS" for the custom data, we here to provide the phenomenon of alarm and the reasons for the introduction.

Trigger alarm

We use stress on the two containers of the CPU pressure, making the container CPU usage exceeds the alarm threshold. On the Prometheus page we saw the alarm generated.
See the alarm from Prometheus on the Alertmanager page.
You can see Alertmanager also push the alarm message to the alerta.

Alarm message display

Alerta saves and displays the received alarms.
Select an alarm message, you can enter the details, in the details page can be on the alarm Ack, close and other operations.
After the alarm is over, you can view the history of the alarm in alerta, that is, in the closed state of the alarm.

Concluding remarks

Here we briefly describe how to use cAdvisor, Prometheus, Alertmanager and Alerta to achieve Google SRE described in the timing of data based on the alarm practice, the indicators for the performance of the alarm is only the most basic way, we will follow up how to configure and acquisition Application of the internal data indicators, and monitoring the alarm configuration. Application system monitoring is a complex process, need to constantly adjust to deal with the operation of the service and quality of service, we also need to continue to learn from the SRE operation and maintenance concept and landing in practice. SRE can be said to be DevOps in the operation and maintenance of the specific implementation, which includes both the concept, culture, including monitoring and alarm, such as the specific operation and maintenance and engineering practice. Now more and more domestic companies are concerned about how SRE can provide continuous support for the project throughout the life cycle. But how can we let the SRE concept in the local landing, how to find the road for their own SRE, several people are constantly exploring and continuing to share the existing experience to everyone, we hope that we can learn together SRE nutrition to keep Enhance the operation and maintenance of enterprises and the level of engineering practice. Thank you!

Q & A

Q: Does the system have the ability to automatically resolve the alarm report after the alarm message is received? Still need to solve the problem manually? Thank you

A: This is the case, a good mechanism is the alarm should be issued a new problem, and then through the feedback mechanism, so that the same problem no longer occur, or by the monitoring system itself to solve.

Q: InfluxDB series of programs are considered, Grafana latest version also has a very good alarm mechanism, whether there is a try?

A: The TICK combination of InfluxDB has been considered and implemented, and it is very convenient to realize the complete process of data collection and storage processing. By contrast, we found that Prometheus was more in line with Google SRE's concept of monitoring, and that its community was very active and turned to Prometheus. Grafana achieved a powerful visual configuration of the alarm rules of the function, for the original only as a display tool, is a good enhancement, this is also a great inspiration to us, but also in the study.

Q: What is the syntax of the alarm rule configuration and whether it can be visualized?

A: Prometheus is the description of the alarm rules in the configuration file. You can achieve their own visualization.

Q: how to solve the situation of large amounts of data, such as million machines, 500 indicators of data and other minutes a point 60 24 30 500 10000 data, how to save, how to quickly query data. What kind of architecture and hardware do you need?

A: simple answer, Prometheus can be grouped to support large-scale cluster, but to achieve a certain size, it needs to give the answer.

Q: Will the monitoring of the alarm has not considered or practiced intelligent early warning, such as based on historical monitoring data, through machine learning, to achieve early warning and so on?

A: This is not the SRE recommended way, the alarm should be simple, advanced features will blur the real intention.

Q: What is the size of the host and container based on this program deployment, based on what frequency to monitor the collection?

A: This is the sharing of the test environment, small scale. Prometheus regularly collects data from cAdvisor, crawling frequency 5s.

Q: cAdvisor acquisition of data performance how, take the host of resources Well?

A: performance is excellent, worried about the occupation of resources, you can start the container when the resource constraints.

Q: APP own business logic needs to monitor the data, such as Counter, Gauge, etc., the traditional Zabbix can be used for data collection. I understand that cAdvisor is a collection of data for the Container. But is it possible to combine the APP's own monitoring with Container monitoring?

A: follow-up topic, we will practice the application of the monitoring alarm. Prometheus logic is to regularly capture data from the exporter, and then the storage of timing data, through the PromQL query and analysis, it is possible to achieve the APP's own monitoring and container monitoring of the combination.

The above content according to the evening of February 21, 2017 micro-credit group to share content. Share people Douzhong strong, several people cloud research and development engineers. Years of operation and maintenance experience, familiar with the configuration management, continuous integration and other related technologies and practices, is currently responsible for several people cloud platform monitoring alarm components of the research and development work. DockOne organizes targeted technical sharing every week, and welcomes interested students. We are interested in the topic of liyoujiesz, who you want to hear or want to share.

    Heads up! This alert needs your attention, but it's not super important.