2 minute read

Joins

The goal of a good database design is to avoid unnecessary repetition of information. This is why tables are composed based on normalization theory. Joins are therefore needed to reconstruct the original tables without any information loss.

This post will cover inner equijoin algorithms for combining two-tables. An equijoin algorithm joins tables where keys are equal. These algorithms can be tweaked to support other joins. Multi-way joins exist primarily in research literature.

Operator Output

For a tuple $r \in R$ and a tuple $s \in S$ that match on join attributes, the join operator concatenates $r$ and $s$ together into a new output tuple.

In reality, contents of output tuples generated by a join operator varies. It depends on the DBMS’s query processing model, storage model, and the query itself. There are multiple approaches to the contents of the join operator output.

  • Data: This approach copies the values for the attributes in the outer and inner tables into tuples put into an intermediate result table just for that operator. The advantage of this approach is that future operators in the query plan never need to go back to the base tables to get more data. The disadvantage is that this requires more memory to materialize the entire tuple. This is called early materialization. The DBMS can also do additional computation and omit attributes which will not be needed later in the query to further optimize this approach.
  • Record Ids: In this approach, the DBMS only copies the join keys along with the record ids of the matching tuples. This approach is ideal for column stores because the DBMS does not copy data that is not needed for the query. This is called late materialization.

Cost Analysis

The cost metric used here to analyze the different join algorithms will be the number of disk I/Os used to compute the join. This includes I/Os incurred by reading data from disk as well as writing any intermediate data out to disk.

Note that only I/Os from computing the join are considered, while I/O incurred when outputting the result is not. This is because the output cost depends on the data, moreover, the output for any join algorithm will be the same and therefore the cost will not change among the different algorithms.

Variables used in this post:

  • $M$ pages in table $R$ (Outer Table), $m$ tuples total
  • $N$ pages in table $S$ (Inner Table), $n$ tuples total

In general, there will be many algorithms/optimizations which can reduce join costs in some cases, but no single algorithm which works well in every scenario.

Leave a comment