Netease video cloud: impala notes
Impala is an interactive MPP SQL engine on hadoop and the best-performing open source SQL-on-hadoop program. As shown below, impala performance over SparkSQL, Presto, Hive.
Impala and hadoop ecological combination of close (1) HDFS impala is the most important data source. In addition, impala also supports HBase, and even supports S3 storage.
(2) impala table definition stored in the hive metastore, support read hive table definition.
(3) support Parquet, RCFile, sequence file, txt and other common file format, which Parquet is the storage format, the best performance.
(4) integrated YARN.
Support most of the SQL92 select statement, and SQL2003 standard analysis function. Does not support DELETE and UPDATE, but supports bulk load data (insert into select, LOAD DATA) and bulk delete data (drop partition). In addition, users can also directly operate HDFS files to achieve data loading and cleaning.
The architecture is shown above.
• impalad is the core component that is responsible for receiving user query requests (ODBC protocol), generating query plans, coordinating other impalad execution query plans, and summarizing query results back to the user. Impalad deployed on each datanode, in general impalad only read native data, try to avoid remote access to HDFS file data.
• statestore is responsible for cluster meta-notification and distribution. Metadata includes catalog and cluster membership, etc. SQL queries depend on these metadata, so metadata is cached on each impalad node. Statestore through the publish / subsriber mechanism to ensure impalad timely get the latest metadata, but it does not need to understand the metadata itself, do not need to persist metadata. When statestore is down, the metadata can be reconstructed from an authoritative data source or from impalad, so statestore is a global single point but does not affect usability.
• catalogd is responsible for database, table and other catalog information, to achieve DDL function, catalog updates distributed by statestore impalad.
Impalad divided into frontend and backend two levels, frondend with java implementation (through the JNI embedded impalad), responsible for the query plan generated, and backend with C ++ implementation, responsible for query execution.
The frontend generation query plan is divided into two phases: (1) Generate a standalone query plan, and the stand-alone execution plan is the same as the relational database execution plan, and the query optimization method used is similar. (2) generate a distributed query plan. According to the stand-alone execution plan, generate a truly executable distributed execution plan, reduce data movement, try to put data and computing together.
The above figure is a SQL query example, the goal of the SQL is based on the three table join aggregation, and in accordance with the aggregation column to take topN. Impala's query optimizer supports the cost model: using table and partitioning cardinality, the number of distinct values per column, and so on, impala can estimate the execution plan cost and generate a better execution plan. On the left side of the figure is the frontend query optimizer generated stand-alone query plan, unlike the traditional relational database, stand-alone query plan can not be directly implemented, must be converted into the right part of the figure shown in the distributed query plan. The distributed query plan is divided into six segments (colored borderless rectangles in the figure), each segment is a plan subtree that can be executed independently by a single server.
Impala supports two kinds of distributed join mode, table broadcast and hash redistribution: table broadcast mode to keep a table of data fixed, another table broadcast to all the relevant nodes (Figure t3); hash redistribution principle is According to the join field hash value re-distribution of two table data (such as the figure t1 and t2). Aggregate functions in distributed plans are split into two phases. The first step is to aggregate the local data (Pre-AGG) to reduce the amount of data, and the data step by step, the second step, and further summarize the previous aggregation results (mergeAgg) to calculate the final result. Similar to the aggregate function, topN is also divided into two stages of execution, (1) local ranking to take topN, to reduce the amount of data; (2) merge sort to get the final topN results.
Backend from frontend receive the plan segment and execute, perform the performance is critical, impala takes the query performance optimization measures that have the vector execution. Once a getNext deal with a batch of records, multiple operators can do pipeline.
• LLVM compiles execution, CPU-intensive query efficiency is improved by more than five times.
• IO localization. Using the HDFS short-circuit local read function, to achieve local file read • Parquet deposit, compared to other formats up to 5 times the performance.
Impala is usually run on a cluster with MR isocratic tasks, through YARN unified management of resources, how to meet both the interactive query and offline query two needs are more challenging. YARN through the global unique Resource Mananger scheduling resources, the advantage is that RM has the entire cluster of global information, can make better scheduling decisions, the disadvantage is the lack of performance of resource allocation. Impala needs to allocate resources for each query. When the number of queries per thousand is thousands, the response time of the YARN resource allocation becomes very long and affects the query performance. At present through two measures to solve this problem: (1) the introduction of fast, non-centralized query access mechanism, control query concurrency. (2) LLAM (low latency application master) through the cache resources, batch allocation, incremental allocation, etc. to reduce the resource allocation delay.
Related system comparison