Distributed Query Optimization

Many companies work with vast quantities of data and find that the most cost effective way to store data is by using a large number of commodity machines in a distributed setting. Some companies often choose to use custom made file systems and distributed file structures, such as BigTable in the case of Google, instead of adopting a relational approach. We believe that the complexity of the distributed query optimization problem is the main factor that causes companies to shy away from using distributed relational database systems.

The query optimizer is widely considered to be the most important component of a database management system. It is responsible for taking a user query, searching through the entire space of equivalent execution plans and returning the plan with the lowest cost. Equivalent query execution plans can vary significantly in cost therefore it is important for the optimizer to find a plan that will not inflict much unnecessary work on the executor. When optimizing queries in a distributed database system the optimization process must consider the additional dimension of choosing which execution site to perform each execution operation. This added dimension causes the search space to increase vastly and introduces new opportunities for parallelism, which contribute to the increasing complexity of the optimization problem. However, with the addition of sites comes an addition of processing power meaning that all sites can be used during the optimization of a single user query.

This project investigates optimization techniques for large queries in a distributed context. The research directions taken so far are as follows.

We present a cost model that allows inter-operator parallelism opportunities to be identified within query execution plans. This involves converting a plan into a schedule, which details when and where different parts of the plan should be executed. As a result, the response time of a query plan can be estimated more accurately.
We merge two existing centralized optimization algorithms DPccp and IDP1 to create a practically more efficient algorithm IDP1ccp.
We propose the novel Multilevel (ML) optimization framework that combines heuristics with existing centralized optimization algorithms. ML algorithms recognise that centralized optimization algorithms have limitations correlated to the number of relations and number of execution sites involved in a query. Using this observation ML algorithms break the query problem into manageable optimization tasks, which can be distributed throughout the system and optimized individually and in parallel by all optimization sites in the system. The results can then be combined to obtain the overall execution plan for the given user query in a customizable time.

See Robert Taylor's thesis for details as well as an extensive experimental evaluation of our optimization framework.

System prototype

Our optimization algorithms are implemented in Java. The latest version is available here.

Thesis

Robert Taylor: Query Optimization for Distributed Database Systems [PDF]
MSc in CS, Oxford 2010.

Current team

Dan Olteanu
Robert Taylor