BTree data structure is generalization of self balanced tree where in, node can contain more than one child node. This generalization is useful in storing it on to the disk files. In this case node can be mapped to disk block and can easily be read and written to the disk.
Btree+ is further generalization of Btree where in indexes and records are stored in leaf nodes. Branch nodes only contain pointers to other leaf and branch nodes. This optimization is useful in further performance improvements in increasing the throughput since branch nodes don’t need to be locked except during rebalancing. Data is usually inserted or deleted in leaf nodes hence only leaf nodes need to be locked during CRUD operations.
Further in BTree+ leaf nodes can contain pointers to previous and next leaf node for index range traversals.
Node Types
Every node contains ordered list of items in ascending or descending order depending on how it is created.
Branch
Branch nodes contain next level ordered list of pointers to other nodes. Order list is based on max key of every node. For default block size of 2048 bytes, it stores pointers to 150 next level branch or leaf nodes. It also contains copy of maximum key value, (last key of the ultimate leaf node it is pointing to). This key is used to tree traversal to get to proper leaf node during query processing.
For example, say branch node B contains pointers to 2 leaf nodes X and Y. Max key of X node is say 100 and max key of Y is say 200 then node B will store these 2 nodes in order X and Y. Also, in this case max value of node B will be 200.
Root
Root node is special case of branch or leaf node. This is the entry point into the data structure. If all keys can be stored in one node, then it will be of leaf node. Else, usually root is a branch node.
Leaf
Leaf nodes contain ordered list of index or record key data. It also contains pointer to next and previous leaf nodes. Current implementation needs pointers to previous nodes but we are working on enhancement to remove that dependence.
Based on the size of the key, it contains as many keys that can fit in leaf block size. Default block size is 2048 but can be changed during the construction of the index by providing STORAGE keyword.
Max key value of leaf is usually last key in the node.
For Example, say leaf node can contain keys 100, 200 and 300 in the order. Also, its max key value in this case will be 300.
Operations
Insertion
During insertion it starts at the root node. And it traverses up to the leaf node (position in the ordered list of keys) where the item can be inserted.
Since every item in all nodes (branch and leaf) are ordered, it uses binary search to locate the item in the node where new item can be inserted. It follows this step recursively from root up to the leaf and inserts that item into the leaf.
Insertion balancing
If newly added item increases the size of leaf node more than the block size (default 2048 bytes), then it splits the node in to 2 leaf nodes and traverses upward to insert pointer of newly added leaf into the parent branch node. If after inserting this new pointer, if branch node needs to be split because of its, size then it recursively continues upward until root node to rebalance. Since rebalancing is changing whole structure of the BTree+, whole BTree+ needs to be locked during this operation.
Deletion
Similar to insertion, deletion also starts at the root node and traverses up to the leaf where item to be deleted is located. And it removes that item from the leaf node.
Delete rebalancing
If leaf node becomes empty after the deletion of the item then again, it needs to rebalancing to remove its pointer from the parent node and recursively up to the root node if required. During this time, whole BTree+ needs to be locked.
Update
Fortunately update is not required. Update can turn out to be delete and insert operation which will be two separate operations on the BTree+
Tree locking scenarios
Read
Following steps are performed
- Read lock on the tree.
- Traverse up to the leaf node
- Read lock leaf node.
- Unlock read lock on tree started on #1.
- Return iterator so that range (or PK lookup) query can walk through the range. Iterator properly unlocks and locks leaf nodes while traversing from node to node.
- At the end it is query processor (callers) responsibility to unlock the read lock on the leaf node.
Insert
Following steps are performed
- Read lock on the tree.
- Traverse up to the leaf node.
- Write lock leaf node.
- Try inserting an item. If it needs to split because of of size increase, then
- Release write lock on the leaf. (Step #3)
- unlock read lock (Step #1).
- acquire write lock on the tree, since rebalancing operation might be performed.
- Again Traverse up to the leaf node.
- Write lock leaf node. In theory this is not required since there is a lock on the tree itself. There are no other threads in this tree at this time.
- If rebalance is required. Rebalance the tree
- Unlock write lock on the tree
- Insert the item.
- Unlock write lock on leaf
- Unlock read lock on the tree.
Delete
- Read lock on the tree.
- Traverse to the leaf node.
- Write lock leaf node.
- Remove the item. If it needs to reblance
- Release write lock on the item
- Release read lock on the tree.
- Acquire write lock on the tree
- Traverse to the leaf
- Remove the item.
- Rebalance of required.
- Unlock tree
- Unlock leaf.
- Unlock tree.
Node levels in Wonderdb
Branch nodes store disk pointer to next node (about 10 bytes). So for default 2048 bytes, it stores about 200 items in branch node.
Leaf node on the other hand stores actual key value and the pointer to the actual record contents. So for key size of 100 bytes, it needs 110 bytes (100 for the key and 10 for the pointer to the record). So it stores about 100 keys in the leaf block.
Based on above assumptions,
2 level tree will store 100*200 = 20000 items, 200+1 = 201 blocks = 201*2048 ~ 400KB disk space
3 level tree will store 100*200*200 = 4000000 = 4M items, size of tree will be (200*200)*2048 ~ 80MB
4 level tree will store 100*200*200*200 = 1600000000 = 1.6B items, disk space = 200*200*200*2048 = 1.6 GB
From above calculations you can easily see why whole btree can be present in physical memory if we assume we have 50+ GB physical memory. Machines with 50+GB is considered commodity hardware nowadays.
This calculation is very important in choosing BTree+ vs Hash index if range query is not required. If configured properly, hash index can perform 2-3 times faster than BTree+ which will be huge improvement.