Revisiting Reuse in Main Memory Database Management System

Categories: Technology

Essay, Pages 7 (1557 words)

Views

Abstract

The primary objective of the two papers discussed in this report is to enhance the query processing efficiency of a database management system. This improvement is achieved through the utilization of complex joins, various clauses, and techniques such as data mining and OLAP. The results are computed using SQL, PostgreSQL, and hash tables. These methods offer significant advantages, including increased system efficiency and performance enhancement, as well as the removal of garbage and cache memory.

Introduction

Query processing in database management systems often involves reusing intermediate results to expedite query execution time, which can be further utilized by materialized operators.

Don't use plagiarized sources. Get your custom paper on

“ Revisiting Reuse in Main Memory Database Management System ”

Get high-quality paper

NEW! smart matching with writer

However, this approach is less suitable for modern databases that are highly optimized. Reusable techniques are advantageous as they leverage existing payloads without incurring additional costs. The key components discussed in these papers include hash tables, query optimization, and tuples. SQL and PostgreSQL baselines are essential for executing queries in a commercial database system. Nevertheless, it is observed that iceberg queries can be computationally expensive to implement, leading to their optimization through GROUP BY and HAVING clauses.

K-Skyband Query

One of the prominent query types discussed is the k-skyband query, which aims to retrieve objects that are not dominated by k other objects.

Sweet V

Verified writer

Proficient in: Technology

4.9 (984)

“ Ok, let me say I’m extremely satisfy with the result while it was a last minute thing. I really enjoy the effort put in. ”

+84 relevant experts are online

Hire writer

This query can involve complex inequality joins. For example, a k-skyband query may involve a table called "Object" with dimensions x and y, representing attributes such as price, rating, and availability. An example query is as follows:

SELECT L.id, COUNT(*) FROM Object L, Object R WHERE L.x<=R.x AND L.y<=R.y AND (L.x

HashStash Database System

The HashStash database system employs internal data structures, specifically hash tables, and hash joins and hash aggregation operators for query processing.

Get to Know The Price Estimate For Your Paper

Topic

Deadline: 10 days left

Number of pages

Email Invalid email

By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email

"You must agree to out terms of services and privacy policy"

Write my paper

You won’t be charged yet!

It supports two models: single query reuse and multiple query reuse.

Single Query Reuse

In the single query reuse model, a user submits a single query to the HashStash DBMS. Unlike traditional DBMS, it identifies reuse-aware plans and divides them into three components:

A cache table containing specific information.
A new operator that utilizes a new cost model to determine which hash table to use in each operator, minimizing query processing time.
A garbage collector that terminates unused cache entries.

Multiple Query Reuse

In the multiple query reuse model, multiple queries are reused simultaneously to explore different aspects of the datasets. A shared batch plan is employed to minimize query optimizer time. Payloads are independent of reuse capability, and when the reuse capability is low, it can negatively impact performance. Cache pages may have less storage than the base table, potentially causing slower processing.

Components in the HashStash Model

The HashStash model comprises several key components, including:

Reuse query optimizer: This component checks the optimization of queries and employs dynamic programming to reduce query size, enhancing optimization without incurring additional costs.
Hashtable manager: Responsible for managing cache information, nodes operating operators, and data flow. It also determines if any hash table values are unused and removes them, evicting them from the garbage collector.
Garbage collector: This component removes queries that are no longer being reused from the hash table. It initiates the release process when the memory of the hash table exceeds the peak value, working on page granularity.

The HashStash Support Five Different Cases for Reuse Operators

The HashStash database system supports five different cases for reuse operators, which are as follows:

Exact: This case includes all tuple information reused for query processing, allowing hash join and hash aggregation operators to be reused. However, there may be instances where the sub-plan is evicted.
Subsuming: Subsuming cases may produce false results due to an increase in tuple requirements. It employs filtering to mitigate the impact of false results and uses aggregates instead of false ones for query optimization.
Partial: In partial cases, some tuples are missing, and HashStash automatically adds tuples for query execution, utilizing different reuse cases for different results.
Overlapping Reuse: Overlapping and partial cases are similar, as they require tuple reuse to replace missing tuples.

The Cost Estimation

Cost estimation involves assessing the actual and component costs of the optimizer. It groups costs to enable cache reuse of values and minimize overall costs.

HashStash Table Component

Three orthogonal techniques are employed within the HashStash table component, namely:

Generalized A-Priori: Inspired by Apriori, this technique is applied to HAVING constraints. It does not restrict original iceberg queries from running on smaller inputs while reducing costs using complex joins. The HAVING clause is applied only when applicable.
Cache-Based Pruning: This technique utilizes properties of k-skyband query processing. It combines computation with previous computations, introducing an operator known as NLJP (Nested loop join with pruning) that employs nested loops and provides pruning predicates for each new outer input tuple.
Memoization: Memoization enables caching within the NLJP operator, avoiding unnecessary computations. It is implemented alongside database queries and constraints, allowing the direct retrieval of pruning predicates.

Reuse Aware Hash Joins

Reuse-aware hash joins are designed to build a hash table from one of their inputs and take the hash table for each tuple. These joins differ from traditional hash joins in two significant ways: they may add missing tuples during the building phase and filter false positive tuples during the probe phase.

The Cost Models That Can Resize

The cost estimation for reuse-aware hash joins includes the following components:

cRHJ = cresize(HT) + cbuild(HT) + cprobe(HT) where RHJ represents reuse-aware hash joins.
cinsert(HT) = |NewKeys| · (1 - contr(HT)) | #tuples to insert · ci(htSize, tWidth) where ci(htSize, tWidth) is the size cost and insert/delete cost for a single tuple value.

Execution and Optimization

Execution and optimization of iceberg queries in database management systems are discussed in this section. The NLJP (Nested Loop Join with Pruning) operator is employed for implementing memoization and pruning. The NLJP operator is specified by three types of queries: Binding query, Inner query, and Pruning query.

Benefit-Oriented Optimization

HashStash implements several benefit-oriented optimizations, favoring one plan over another based on the benefits it offers for future reuse. These optimizations include Additional Attributes, which allows post-filtering of false positives, and Aggregate Rewrite, which supports partial- and overlapping-reuse by creating slightly larger hash tables when required.

Reuse Aware Shared Plans

In this approach, multiple queries are compiled into a single shared plan rather than being compiled individually. To enable the reuse operator, hash plans are extended to share and reuse hash tables. Each operator in the shared plan launches logic for multiple queries, and each tuple is tagged with a query ID for identification.

Garbage Collector

The garbage collector plays a crucial role in removing queries that are no longer reused from the hash table. It initiates the release process when the memory of the hash table exceeds the peak value and operates on the granularity of pages. It uses the least recently used policy to remove entire hash tables rather than individual entities in the hash table.

Cost Models

The cost model for optimizer assumes the cost of runtime for reused queries. It includes three components: Resize cost, the cost to insert the first tuple, and the update cost for each tuple.

Efficiency of Single Query

The efficiency of single query reuse depends on the payload's size, with higher potential for non-reused strategies. Materialized strategies may introduce a penalty due to larger payloads and additional materialized costs. For materialized strategies, metrics such as the footprint of temporary tables and hit ratios per table can be used for evaluation. In contrast, HashStash tables are evaluated based on the footprint of each table and hit ratios per table.

Multi-Query Efficiency

In multi-query systems, various queries are executed with different modes:

First Mode: Queries are executed sequentially without using cache in hash tables.
Second Mode: This mode involves executing each query individually.
Third Mode: Multiple queries are bundled into a shared plan, allowing them to execute as a batch. This mode offers the highest speed and efficiency.

The efficiency of the third mode surpasses that of single-query and individual-query execution because it involves a shared scan that saves time. Efficiency is significantly higher in this mode.

Conclusion

In modern database systems, optimizing query performance is crucial for obtaining fast and accurate results. HashStash, a main-memory database management system, offers a promising solution to reduce unnecessary materialization costs. It significantly enhances performance and efficiency, making it a profitable choice for query optimization. The discussed papers explore various techniques, including complex joins and iceberg queries, which are executed using the NLJP operator. Different types of clauses and conditions are applied to address specific problems efficiently.

Recommendations

Future research in the field of database query processing could delve deeper into exploring additional factors that affect query performance, such as the impact of varying database sizes and complexities. Furthermore, investigating the applicability of HashStash and similar systems in real-world scenarios across different industries, including finance, healthcare, and e-commerce, could provide valuable insights into their practical use and scalability.

References

C. Binnig et al. SQLScript: Efficiently Analyzing Big Enterprise Data in SAP HANA. In BTW, 2013.
C. Binnig et al. Sqlscript: Efficiently analyzing big enterprise data in SAP HANA. In BTW, 2013.
B. Cuissart and J.-J. Hґebrard. A direct algorithm to find a largest common connected induced subgraph of two graphs. In GbRPR, pages 162"171, 2005.
G. C. Das and J. R. Haritsa. Robust heuristics for scalable optimization of complex sql queries. In ICDE, pages 1281"1283, 2007.