Performance Evaluation of Distributed SQL Query Engines and Query Time Predictors
More Info
expand_more
Abstract
With the decrease in cost of storage and computation of public clouds, even small and medium enterprises (SMEs) are able to process large amounts of data. This causes businesses to increase the amounts of data they collect, to sizes that are difficult for traditional database management systems to handle. Distributed SQL Query Engines (DSQEs), which can easily handle these kind of data sizes, are therefore increasingly used in a variety of domains. Especially users in small companies with little expertise may face the challenge of selecting an appropriate engine for their specific applications. A second problem lies with the variable performance of DSQEs. While all of the state-of-the-art DSQEs claim to have very fast response times, none of them has performance guarantees. This is a serious problem, because companies that use these systems as part of their business do need to provide these guarantees to their customers as stated in their Service Level Agreement (SLA). Although both industry and academia are attempting to come up with high level benchmarks, the performance of DSQEs has never been explored or compared in depth. We propose an empirical method for evaluating the performance of DSQEs with representative metrics, datasets, and system configurations. We implement a micro-benchmarking suite of three classes of SQL queries for both a synthetic and a real world dataset and we report response time, resource utilization, and scalability. We use our micro-benchmarking suite to analyze and compare three state-of-the-art engines, viz. Shark, Impala, and Hive. We gain valuable insights for each engine and we present a comprehensive comparison of these DSQEs. We find that different query engines have widely varying performance: Hive is always being outperformed by the other engines, but whether Impala or Shark is the best performer highly depends on the query type. In addition to the performance evaluation of DSQEs, we evaluate three query time predictors of which two are using machine learning, viz. multiple linear regression and support vector regression. These query time predictors can be used as input for scheduling policies in DSQEs. The scheduling policies can then change query execution order based on the predictions (e.g., give precedence to queries that take less time to complete). We find that both machine learning based predictors have acceptable performance, while a baseline naive predictor is more than two times less accurate on average.