Netease video cloud expert sharing: Apache kylin profile

"Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop care extremely large datasets".
Apache kylin (kylin) is ebay Chinese team open source based on hadoop olap system, support for SQL interface, can handle large data sets. The current popular SQL-on-Hadoop program needs to scan some or all of the data to complete the query, query delay is very large, and kylin on the basis of SQL-on-Hadoop, through the pre-calculated cube way to space for a significant reduction in query delay , Thus making up for the shortcomings of the existing program. The project began in September 2013, the end of 2014 has become apache incubator project, the project has a lot of applications in ebay, the development prospects.
Kylin Key Features • Ultra-fast OLAP engine for massive amounts of data, reducing query latency for billions of data sizes.
• ANSI SQL query interface that supports most ANSI SQL query functions.
• Interactive query capabilities. Query latency control in sub-second, for hadoop to provide interactive query capabilities.
• MOLAP. The user defines the cube model in advance and builds the cube offline from the original data set using kylin.
• BI tools are seamlessly integrated and are currently able to integrate with tableau (using ODBC).
As shown below, kylin built on hadoop hive and hbase above, to achieve the query routing function: as far as possible through the hbase pre-calculated olap cube to meet the query can not be hbase satisfied inquiries are sent to hadoop hive. Hbase in the olap cube according to hive star data off-line calculation, space for time to speed up the query. Kylin query to accelerate the user transparent, from the user point of view, kylin query and hive not much difference.

44444.JPG

The goal of precomputing olap cube is to combine complex SQL queries into simple KV queries in advance according to the metrics of each dimension to avoid scanning too much data and improving query efficiency. The cube contains the time, item, location, supplier four dimensions, kylin generated cube contains 16 cubeoid, each cubeoid corresponds to a dimension combination. N-dimensional cube has 2 ^ N cubeoid, space occupies very impressive, when N exceeds a certain amount, the space consumption is unacceptable. Kylin through the partial cube to reduce the number of dimension combinations, balance the storage space and query performance. The basic idea is to split the dimension into multiple aggregation groups. Only the cube is calculated in the group, the query efficiency is high and the efficiency of the cross-group query is poor. Therefore, it is necessary to define the aggregation group according to the business scenario. In addition, kylin also supports subtracting the high cardinality attribute from the cube to reduce the storage overhead.

2.png

Cube computing is very time-consuming, the new data into the system when the full amount of refactoring cube higher cost, so kylin designed incremental cube building technology to accelerate the efficiency of offline cube. The principle is to save the basic cube, and multiple incremental cube, each cube on behalf of a period of time the new data, the new data into a new cube, try to avoid cube overall reconstruction. Query access to multiple cube for data aggregation, the more cube the number of query performance worse, so the system according to a certain strategy merge small cube become a big cube lower query costs.
to sum up
Apache kylin is built with hadoop hive, hbase above the open source olap system, and kylin in eBay company a large number of applications, of which the largest use cases have more than 120 billion records, cube data volume of more than 14TB, this size under 90 % Of the query request can be returned within 5 seconds. The overall look kylin the prospects are better, but the kylin project to create the time is not long, compared to Google Mesa, kylin in data update capabilities, data partition, metadata online changes, query performance, there are still a lot of room for optimization.
Reference • Jiang Xu, Kylin: Hadoop OLAP Engine – Tech Deep Dive
• Luke Han, Apache Kylin Introduction
Kylin official website Http://www.kylin.io/
Kylin officially released: the ultimate OLAP engine for large data, http://www.csdn.net/article/2014-10-25/2822286

    Heads up! This alert needs your attention, but it's not super important.