Tencent beam: DevOps last rod, effective construction of massive operations of continuous feedback

May 6, the number of people cloud and excellent peacekeeping technology hosted the [DevOps & SRE beyond the traditional operation and maintenance of the Shenzhen station], June Beijing station please pay attention ~ This article is Tencent SNG operation and maintenance responsible person – the beam share the DevOps Finally A bar, how to effectively build a massive operation of the continuous feedback capacity. In the same period the number of guests in the traditional enterprises in the practice of landing.

Image

Liang Dingan, Tencent Xunyun responsible person, currently working in Tencent social network operations, open operation and maintenance alliance members, Tencent cloud evangelist, Tencent classroom operation and maintenance lecturer, EXIN DevOps Master lecturer, Phoenix project sand table coach, Fudan University guest lecturer.

DevOps last stick

Preface

Image

This diagram summarizes the entire DevOps system, which is the last part of the operation and the end of the link. For the understanding of operation and the end, I think we should include two dimensions: the first is the quality and operation of the operation and operation of the end; the second is the product of the technical operation and the end of the life cycle.
Today talk about the technical operation phase before the end of the product life cycle, how to build a quality system, to achieve continuous feedback and optimization goals.

Monitoring, alarm and operation and maintenance

Continuous feedback in the operation and maintenance of understanding

Image

◆ Monitoring – Coverage, status feedback, metric metrics

Monitoring to do 360 degrees without dead ends, the business what problems can be found, with the monitoring feedback, you can see the status of real-time monitoring, while the indicators change when the need to pay attention to feedback.

◆ alarm – timeliness, accuracy, reach

Business more and more complex, more and more layers, each monitoring point will produce data indicators, the state is abnormal, will receive more and more alarm. Do not see or see the unprocessed need to take responsibility, because the receipt is not a false alarm.

◆ Operation – RCA, Event Management, Reporting / Assessment

The problem must occur again from the root cause. Through the event management mechanism to ensure that RCA can land, and finally through the statements and assessment to give the operation and maintenance rights, to promote the relevant optimization activities, including the structure and code optimization.

Monitoring, alarm and operation and maintenance

Comprehensive monitoring point

Image

Tencent business in accordance with the different levels of management, bottom-up, a server layer, database, logical layer. In the middle of this layer, there are access layer, load balancing, room, DNS service, client, client, etc., in order to achieve no dead ends, we layout a lot of monitoring points.

To achieve public opinion monitoring, the monitoring point to do 100% coverage, but can not sit back and relax, because when the monitoring point to do more and more three-dimensional, 360 degrees without dead ends, each of the most detailed points have indicators to measure Data explosion is likely to be another potential for monitoring risks.

Monitoring, alarm and operation and maintenance

Operational phase to solve the problem

Image

〓 繁 – Simplified

In the specific production process will produce operation and maintenance of the event or failure, often there will be crashes, as well as the layers of monitoring alarm, these cumbersome alarm, failure, how to simplify?

〓 Pan – fine

For example, under a core switch, assume that there are 1000 machines connected to the data layer, logical layer, access layer, etc., when the switch failure is not available, due to the existence of three-dimensional monitoring , Each monitoring point will produce a lot of alarm information, how to find these alerts which is caused by the failure of the core switch?

〓 chaos – order

As the indicators of the collection and the amount of different data, a direct result of the monitoring flow processing efficiency is not the same. Alarms are not in the same order, how do you sort and validate priorities?

So in the era of lack of monitoring to actively engage in construction, but the alarm when the flood to learn to filter.

Monitoring, alarm and operation and maintenance

Monitor objects and metrics

Image

Tencent business to monitor the object as shown above, in accordance with business logic from bottom to top, the following is a common monitoring level, network, server, virtualization and application, the application includes some of the components of the monitoring.

Here is a case for the application of QQ number of business scenarios, assuming that the user initiated the application on the PC side QQ number of business requests, request to WEB front end, and then registered services, registration QQ contains three information: personal information, personalized settings, Value-added services. Is not a member of QQ, whether to open a similar member of the service, this is the business logic.

Based on three-dimensional monitoring, assuming the use of component monitoring, whether QQ or QQ space, QQ music, there are some common indicators can be measured. So, how much is the open memory? What is the number of long connections? User process, throughput, traffic, CPU, business level return code are what? What is the distribution of the success rate of provinces and cities? This has nothing to do with the specific business logic.

When monitoring, the indicators are divided into two categories:

Low-level indicators

The public, infrastructure and other indicators under the business logic as low-level indicators, such as network, hardware, virtualization and so on.

The lower the level of indicators to monitor the system or the greater the noise caused by the alarm. In the planning monitoring or optimization of monitoring strategies, as far as possible to low-level indicators of automated processing and convergence, as far as possible to high-latitude indicators to alarm, because this is the core of the most need to focus on the most feedback business availability of the alarm. If a company with low-level indicators to replace the role of high-level indicators, then the quality is very difficult to manage.

High-level indicators

High-level indicators to more directly feedback business availability of the situation, such as success rate, delay, request rate and so on.

High-level indicators, to be able to real-time feedback business real situation, in the massive scale of the business operation and maintenance of the scene, people can not see the stand-alone level, we must see the level of the cluster.

The module is a unified operation and maintenance object, the module is to provide a single business functions of the cluster. Why manage to the cluster? Simple understanding is the operation and maintenance of objects to abstract, do subtraction. Take Tencent's SNG, the 100,000 + server, abstract into the module after only more than 10,000 modules, compared to the previous face of 100,000 operation and maintenance object N indicators of the alarm, and now face a million module alarm Relaxed a lot, if the low-level alarm optimization, may only face the 5000 alarm.

In the high-level indicators, but also an effective distinction between single-service high-level indicators, and business functions of high-level indicators. To clarify the two concepts, reliability and usability.

Reliability refers to the number of failures of a single service, because the failure of a single service does not necessarily affect the entire QQ number for business service usability decline, because the micro-service itself has failed to retry the logic of Tencent's operation and maintenance experience, we Will make a choice between reliability and availability.

Low-level indicators, although more basic or can be automated solution, but often a number of high-level indicators of the root causes of the problem, make good use of low-level indicators can help speed up the positioning of high-level indicators failure.

Monitoring, alarm and operation and maintenance

The essence of monitoring

Image

Monitoring is nothing more than monitoring a lot of values ​​and rates. The value and the rate of separation is considered, because the value of the newspaper is a value, the rate is calculated after a certain rate, in fact, are flattened information packaged into high-level indicators.

The ultimate goal of monitoring is to analyze the state and find the anomaly, from the graph, table or data, analyze the current business situation, analyze whether the service is now abnormal.

Monitoring, alarm and operation and maintenance

Error warning solution

Image

Three-dimensional monitoring, will bring the monitoring indicators of the explosion, more likely to bring out the alarm data out of control, if not properly handled, it will notify the alarm into a "wolf", lost the original alarm effect. Want to effectively solve the alarm more, mistakenly alarm to face several points:

Correlation analysis

Put some really important, need to pass events, activities, indicators extracted. Do not have anything to warn out, and thus excessive consumption of the integrity of the alarm.

No mistake

How to use the convergence strategy, shielding strategy to the extreme, if necessary, the combination of the two to achieve a more enhanced effect.

◇ continuous operation

Do a good job of continuous operation is to do a good job to follow up to ensure that important things with others, some people measure to prevent the recurrence of the problem, in the process of a guaranteed mechanism.

This requires a quality system to closed-loop management, when the monitoring found that the business structure is unreasonable, unreasonable code and other issues, through the quality system to promote business, development, operation and maintenance to optimize the measures landing, which is for the final business Value, which is DevOps point of view.

Monitoring, alarm and operation and maintenance

Case: Mass data analysis ideas

Image

This is a mobile phone Qzone a multi-dimensional monitoring case. When the client first connected to the server, there will be a heartbeat package, it is a command word, we measured the success rate of its quality, in fact, is to consider it to maintain long connection reliability. (If the long connection to disconnect the mobile client with the server side to establish a long connection with the base station, at least 3,4 seconds consumed, and friends have no way to receive.) Figure, the general function, we require three 9 quality The But do not be deceived by the average, and together to see the real situation.

Image

Tencent's service is more and more live, there are some distributed in the relatively small AC point, and some distributed in the larger DC point. According to the national user access service endpoints, Tencent called SET. Speak the average press the dimension of the start, why "no SET" success rate is only 2 9? And then start again.

Image

According to APN (access methods WIFI, 4G, 3G, etc.) to start, the quality of service is getting worse, only two green and found that 4G is 100%, WIFI environment why only two 9?

Image

According to the operator to start, the quality data is more red (poor), although in line with expectations, but the final problem has not been found.

Continue to expand by region, found all red, but still no clue.

Image

When again by area to start, expand to the Zhejiang area, found that all errors are Andrews version. And IOS version of the 100% success rate, common problems ready to come out.

Image

This time back to review the idea of ​​troubleshooting, may open the way wrong. In the third step when the direct start, as if the truth has been out, in fact, a few versions of Andrews may have such a hidden danger, resulting in the heartbeat logic problems.

Here to explain a problem, dealing with massive multidimensional data processing, analysis program is very important in the planning and construction of monitoring system, should consider this point. Today to bring you three tips, hoping to give everyone to do monitoring data analysis help.

Mass analysis of three techniques

Mass analysis of three techniques

Image

Mass monitoring data analysis techniques: traceability, roots and preferences.

In order to speed up the processing of the amount of alarm information is often the format of the monitoring protocol, formatted after processing and then further formatting, many of the original data of the clues lost, resulting in no way to find the real problem. Because the format to do before the monitoring data will be distorted, affecting the efficiency of troubleshooting, so when reporting the agreement as much as possible to retain the field.

Mass analysis of three techniques

Traceability analysis

Image

〓 high dimensional and dimension reduction

High-dimensional and down-dimensional strike, the results of a measure of the value or rate of different latitudes to start, to each dimension of the indicators of the combination of abnormal state of the alarm, it is very unrealistic, because the root processing, however. But many dimensions of the index can be summarized through the daily summary of the report can be found abnormal, and then through the assessment to continue to promote the abnormal indicators to straighten out, optimize, this is the high-dimensional, reduced dimension of the blow.

〓 cascade analysis

The network has a word called microburst, the network suddenly congested, resulting in a large wave of low-level and high-level alarm generated. For example, a switch exception, causing the next server explosion alarm. When such a situation occurs, unified alarm platform all ignore, do a good job of global convergence, to ensure the effectiveness of monitoring the alarm is not affected.

〓 reverse thinking

Can not just look at the results of the data, to return to the original data. If you want to do reverse thinking to take effect, the flow of the cluster in the real processing, the results of the data stored before the most basic analysis, that part of the log backup to the big data platform to do offline calculation, and then the results of data and then go normal Flow, do the alarm Ye Hao, abnormal fluctuations Ye Hao, because many unusual things must see the original data. We have in-depth analysis of the album upload photos of the water log, found a large number of abnormal user photos, thus saving a lot of operating costs, these are the results of the data can not do the effect.

Mass analysis of three techniques

Root cause analysis

Image

High-level alarm with the convergence of low-level alarm with the same cluster under a low-level alarm, but also produced a high-level alarm, low-level alarm without hair.

With the main tone of the alarm convergence of the alarm module A call module B, B hung up, A by the unaffected? From the point of view of securing the availability of the business, if A does not generate an alarm, it proves that the scenario is only a reliability alarm for B, and the alarm is notified of development rather than operation and maintenance. But if B hung up, A also produced a warning, operation and maintenance should receive a warning, B or alarm to the development. To promote the alarm classification (score, grading, sub-sub, sub-channel) mechanism, in fact, slowly to some of the operation and maintenance to do things to the development, operation and maintenance only look at the core, software reliability of these development, reliable Sex is the development of the problem, the availability of operational quality is the problem.

With the cause of the alarm convergence phenomenon alarm:
In the business logic of the call in the call, with the cause of the alarm convergence of the phenomenon of alarm.

Activate the activity to block the object of the alarm:
Some alarms are caused by changes in behavior, to be convergent. If the upgrade is causing the alarm, the operation and maintenance system to be able to link these events and alarm. There are high-level alarm, low-level alarm, as well as operational activities of the event, put these together, through the weight of the algorithm, there is a sort of decision that the alarm should be called this link, rather than each object are repeated Alarm.

Mass analysis of three techniques

Preferred indicators

Image

Core index theory

Preferred indicators should be the first external share, Tencent internal system code called DLP, is a manual to filter the core indicators of the method, in the era of large data today, this approach is slightly less elegant. Such as a module may have 300-400 indicators, which 300-400 indicators, including low-level indicators, high-level indicators, but when the module is a problem, the 300-400 indicators may produce an alarm , Then how should it converge? If we have been in advance of the module has been the core indicators of artificial screening, this indicator can represent the most realistic indicators of the module.

Monitoring the relevance

Monitoring is related, for example, 300 indicators of the alarm, and the core of that will be the alarm, the core of the alarm that the 300 indicators can not alarm, just look at the core can, and why people preferred core indicators, because the temporary There is no way to identify it manually.

Alarm hierarchical management

Based on the core indicators of the alarm to do grading, non-core development of their own income, the core of the maintenance, so that high-standard protection.

Reduce the amount of flow monitoring

The more the monitoring point, the greater the flow of data, the entire monitoring flow processing cluster size is very large, 100,000 machines just flow processing clusters are close to 1500 units, when the operating cost pressure, you can also focus on the protection of DLP indicators Priority to calculate resources, to ensure priority given capacity support.

SNG weave cloud monitoring system

User opinion analysis and monitoring

Image

There is a very core indicator, is weaving cloud user public opinion monitoring system. A brief introduction to this system, the user opinion monitoring is the name of the user is to monitor the voice and feedback. User feedback sources can be divided into several parts, one is the entrance of AppStore, the other is the App embedded feedback portal, there is Tencent's user feedback forum, all the data will be collected to weave cloud public opinion monitoring platform, And then through the machine to achieve automatic classification. The system will be similar to "QQ space can not open", "QQ space with bad" and other words for semantic analysis and classification, and then unified alarm into "QQ space exception", the time interval is 15 minutes granularity, technical details Highlights.

When the realization of the user public opinion monitoring, we basically have to say that business monitoring is 360 degrees without dead (assuming the user will feedback, and do not consider the time factor). But this set of monitoring inherently have a threshold, because the user should be based on the initiative feedback behavior, while the need for more user feedback data volume, Tencent users a large base, the amount of user feedback is also great. Public opinion monitoring can be used to monitor technical quality problems and can also be used to monitor product experience or interaction issues.

SNG weave cloud monitoring system

There is a strategy to be more automated

Image

Alarm automation is based on the premise of standardized operation and maintenance system, in the SNG woven cloud monitoring system, all the alarm processing will be pre-processing strategy, and then through the unified alarm platform strategy and algorithm, and ultimately the decision will be issued.

SNG weave cloud monitoring system

The algorithm and strategy of precision trial

Image

In the definition of indicators of abnormal state, our experience is to try not to use a fixed threshold, use is also a dynamic threshold, or in the monitoring of the threshold management will have a lot of manual management costs. Other recommendations are shown in the figure.

SNG weave cloud monitoring system

Common business monitoring graphics and strategies

Image

We are in the daily operation and maintenance work, the face of the monitoring graphics as shown above, the trend of small fluctuations, burr, irregular, it is recommended to apply targeted monitoring strategy, so that monitoring alarm more accurate.

SNG weave cloud monitoring system

Case: monitor self-healing

Image

Share a weave cloud monitoring process to achieve self-healing case, the process of the module in the deployment, the operation and maintenance automation process will be the process and port information registered to the CMDB, and then monitor the service will read the module needs to monitor the process and Port information, and the monitoring configuration sent to each machine monitoring agent, the local monitoring agent through the timing Ps detection process and port operation, if an exception occurs, then automatically through the standardized management to find the command to start, if the success of the start To achieve the process of self-healing.

If you can not start to send a unified alarm platform, unified alarm platform to decide whether the need for alarm. When the cause of the alarm is due to the infrastructure is changing the impact, it will not issue an alarm. A series of monitoring self-healing programs are built in the weaving cloud of the automatic operation and maintenance system.

SNG weave cloud monitoring system

Common convergence algorithm

Image

◇ glitch convergence

In the weaving cloud monitoring, the alarm strategy in order to prevent the impact of the burr, the alarm strategy will be defined as 10 minutes 3 times a similar pattern.

◇ similar convergence

A module has 300 monitoring examples, resulting in 300 warnings, as long as there is a tell to the operation and maintenance, for the same operation and maintenance of convergence.

◇ time convergence

Production environment has a lot of timing tasks, such as regular run will cause I / O steep increase and other anomalies, this can be targeted convergence out.

◇ day and night convergence

There are some warnings, distributed services in the high availability of the structure, the evening does not need to alarm out, you can wait for the day when the alarm, more humane management.

◇ change convergence

If the alarm time point of operation and maintenance activities, it is necessary to converge it. How did you do it? Depending on the operation and maintenance activities are closed in the standard operation and maintenance of the platform, operation and maintenance platform for the production process should be changed in the change log write center, and then unified alarm system can be associated with change records to decide whether the decision is convergence or alarm The

SNG weave cloud monitoring system

Weaving cloud monitoring to build the quality system

Image

Weaving the cloud to monitor the construction of the quality system, divided into the client, client, server, the basis of the end, the definition of the core indicators DLP, and make use of hierarchical alarm, sub-channel alarm, combined with SMS, QQ, WeChat and telephone channels to achieve alarm notification, The entire quality control system is around the early warning, self-healing, analysis, obstacles to build capacity.

SNG weave cloud monitoring system

Weaving cloud monitoring: quality system

Image

Weaving cloud monitoring quality system, hoping to create a closed loop, to achieve continuous feedback, measurement, optimization, so that the team can effectively work together, more efficient and more effective.

Monitoring capability

Look at the overall situation, what kind of monitoring capabilities and monitoring points, at the same time to clarify how indicators are layered, which indicators are important? And ultimately turn it into a business to understand the high-level indicators.

Business availability

Operation depends on what depends on reliability or availability, if the size does not depend on reliability can be, but in the massive situation of reliability is too small, to measure the core indicators to measure, to measure the availability. Reliability can be measured through the system to measure and management, combined with the strength of QA and the boss to promote the development team to invest and optimize.

user experience

Do the technical operation will have a perspective of the blind spot, will often obsessed with the availability of data is 4 9,5 9, but this does not fully represent the quality of service is good, when the user is not connected to our service side, a few 9 meaning Not big. This is a very real problem, so the user experience monitoring must be done, because the internal availability is not high on behalf of the user is not good.

Technical solution

There are technical solutions and automation tools, but also to help users troubleshooting tools, as well as the analysis of the algorithm platform and so on.

Statistical Analysis

The final formation of measurable indicators, can be assessed, can be displayed, it is best to show DIY, monitoring data statistics / reporting capabilities, should play a more role to use the monitoring data, rather than limited to the operation and maintenance role The

keep improve

The final improvement is whether it is a problem with the architecture, the problem of the code, or because the problem of standardization or non-functional landing can not be the problem, are required to measure and promote the number. Best, this number should be able to indirectly feedback the value of business, that is, DevOps advocate ideas.

Finally, the quality system is certainly the reaction to monitor the ability to form such a closed loop, with the development of how to communicate? How to communicate with the product? With QA, with how to communicate with customer service? To take them up, to point them to the point of concern, and ultimately fall to the operation and maintenance to achieve the goal is to achieve the best, it is DevOps, but also leveraging the boss's thinking, for the support from top to bottom Good quality system construction.

DevOps last stick

Conclusion

We often say that DevOps is hard to land, why is it hard? Because we always want to influence the boss, first change the culture and then change the working methods, but it is easier said. If the operation and maintenance and development can be combined, the first focus on the business from the start, with the data to reflect DevOps can bring business to the final business value, may play a multiplier effect.

Heads up! This alert needs your attention, but it's not super important.