Towards a Learning-based Query Optimizer

Dr. Zoi Kaoudi

Query optimization is at the core of any data management/analytics system. It is the process of determining the best way to execute an input query or task (i.e., execution plan). Query optimization is composed of several three sub-processes: (i) The enumeration of the different execution plans, (ii) the cost of each subplan required to determine which one is the best, (iii) the cardinality estimation of subplans (i.e., how many tuples a subplan will output) which is crucial because it affects the cost of the plan. Recent research in the field of data management has begun to leverage the power of machine learning (ML) to solve these tasks more effectively and efficiently. In this blog post, we will focus on using ML to estimate the cost of subplans.

Traditional optimizers come with a cost model. This means mathematical formulas that encode the cost of each operator and aggregate these costs to estimate the cost of a query plan. However, coming up with a cost model in a federated setting, as the one Blossom is built for, is not only very challenging but may also lead to suboptimal performance. There are several reasons for that: (i) Cost-based optimizers assume linear functions which do not depict the real system behaviour, (ii) they require access to statistics stored on the several platforms which may not be possible, and (iii) they need fine-tuning to really model the system behaviour which can be very time-consuming, yet very important. The plot below shows up to an order of magnitude better performance with a well-tuned cost model.


Cost model comparison based on compute time
Cost model comparison based on compute time

A well-tuned cost-based optimizer in a federated setting can lead to an order of magnitude better performance. Yet, it is very tedious and time-consuming.

Tuning a cost model can be very tedious and time consuming. For this reason, we look at replacing the mathematical formulas of the cost model that model the system performance with an ML model that predicts the runtime of a (sub)plan. Although this sounds trivial, it comes with two main challenges which we analyze in the following.

First, the enumeration process constructs thousands or millions of query plans, and we need to know how costly they are. To do so, we have to feed each one of them to the ML model. However, there is an “ impedance mismatch”: The query plans are normally objects or structures, while the input of the ML model is a numerical vector (features). Thus, we have to transform each enumerated plan into a feature vector. These plan transformations can easily be in the order of millions due to the exponential size of the enumeration search space, leading to large overheads in the query optimization time. Note that query optimization happens at query runtime and has to be a small fragment (often in msec) of the query runtime.

The second challenge of using an ML model for estimating plan costs is the need for training data, i.e., query plans with their runtime, to be able to build such a model. We already discussed in our previous post how, by using DataFarm, we can efficiently generate training data. With DataFarm, we were able to get high quality training data in only 4 hours, while the naive approach of executing all query plans to get their label required 40 hours!

To solve the first challenge, we have proposed to completely redesign our enumeration algorithms to be able to perform operations with vectors [1]. For this reason, we have devised a vector-based plan enumeration, thereby killing two birds with one stone: we avoid the continuous plan transformation, and we can leverage primitive operations and SIMD (Single Instruction Multiple Data) to perform multiple vector operations with a single CPU instruction (vectorized execution), further speeding up the query optimization process. The figure below shows the improvement factor in query optimization time of using vectors with an ML-based optimizer.

Vector based execution plan
Vector based execution plan

A vector-based plan enumeration approach can lead to significant improvement in query optimization time for an ML-based optimizer.

Given a set of vector-based operations, we can define an efficient vector-based plan enumeration. The figure below shows the general architecture of an ML-based optimizer with a vector-based plan enumeration. The logical plan is given as input to the optimizer, which first transforms the plan into a vector. Then, vector-based enumeration and pruning take place, also consulting the ML model for pruning inefficient plans. The cheapest (i.e., most efficient) plan vector is output and then transformed into an execution plan that the system can actually execute.


How vector based plans work
How vector based plans work


A learning-based optimizer can lead to significant results. See our preliminary results below. For k-means, an ML-based optimizer can choose better plans and achieve 7x better runtime performance than a highly-tuned cost-based optimizer!


Benchmarking vectorized cost plans after learning
Benchmarking vectorized cost plans after learning



We are currently working on incorporating this architecture into Blossom’s optimizer. This will not only speed up the query optimization time but will also lead to better execution plans, meaning faster execution runtimes. Stay tuned!‍

References

[1] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, Sanjay Chawla: ML-based Cross-Platform Query Optimization. ICDE 2020: 1489-1500.

About Scalytics

Modern AI demands more than legacy data systems can deliver. Data silos, scalability bottlenecks, and outdated infrastructure hold organizations back, limiting the speed and potential of artificial intelligence initiatives.

Scalytics Connect is a next-generation Federated Learning Framework built for enterprises. It bridges the gap between decentralized data and scalable AI, enabling seamless integration across diverse sources while prioritizing compliance, data privacy, and transparency.

Our mission is to empower developers and decision-makers with a framework that removes the barriers of traditional infrastructure. With Scalytics Connect, you can build scalable, explainable AI systems that keep your organization ahead of the curve. Break free from limitations and unlock the full potential of your AI projects.

Apache Wayang: The Leading Java-Based Federated Learning Framework
Scalytics is powered by Apache Wayang, and we're proud to support the project. You can check out their public GitHub repo right here. If you're enjoying our software, show your love and support - a star ⭐ would mean a lot!

If you need professional support from our team of industry leading experts, you can always reach out to us via Slack or Email.
back to all articlesFollow us on Google News
Unlock Faster ML & AI
Free White Papers. Learn how Scalytics streamlines data pipelines, empowering businesses to achieve rapid AI success.

Ready to become an AI-driven leader?

Launch your data + AI transformation.

Thank you! Our team will get in touch soon.
Oops! Something went wrong while submitting the form.