Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/adr/003-btree-multi-level-growth.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# ADR 003: B+ Tree Multi-Level Growth

## Status
Accepted

## Date
2026-05-05

## Context

The cloudSQL storage engine needed a durable on-disk B+ tree index capable of multi-level growth. Early phases implemented slot array format (Phase 1) and find_leaf() traversal (Phase 2), but inserts into a full leaf would fail silently or corrupt tree structure.

The problem: a B+ tree must handle arbitrary depth growth through a cascade of splits — leaf splits propagate to parent internal nodes, which may themselves split, recursively up to a new root.

## Decision

Implement a five-phase approach to multi-level B+ tree growth:

### Phase 1: Slot Array Format
- **Entries grow backward** from PAGE_SIZE end
- **Slots grow forward** from after NodeHeader
- Slot array: `SlotEntry { uint16_t offset, uint16_t length }` — 4 bytes each
- Binary entry format enables O(1) slot access without deserializing all entries

### Phase 2: find_leaf() with Binary Search
- Traverse from root to leaf by binary-searching internal node slots
- `compare_separator()` compares key against separator at slot position
- Returns leaf page number directly; no iteration needed

### Phase 3: Leaf Split (split_leaf)
- Split at midpoint: upper half entries copied to new right leaf
- Right leaf's `next_leaf` pointer chain maintained for range scans
- `pending_separator_` stores the separator key for parent insertion
- Returns new right page number so caller can wire up parent link

### Phase 4: Parent Propagation (insert_into_parent / split_internal)
- **Separator promotion**: entry at split_point is **promoted** to parent, not copied to children
- Left node: slots [0, split_point), children [0, split_point+1)
- Right node: slots [split_point+1, num_keys), children [split_point+1, num_keys+1)
- Child at split_point+1 becomes leftmost child of right node after split
- `update_child_parent()` updates parent_page pointers on all affected children
- Split cascade: if parent is also full, recurse with promoted separator

### Phase 5: Root Split Handling
- Root split detected when `parent_page == 0` (root has no parent)
- `create_new_root()` allocates new root as internal node with 1 separator
- Both split children updated to point to new root
- `root_page_` updated to new root page number

### Entry Format
- **Leaf entry**: `type(1) + key_len(4) + key_data(N) + page_num(4) + slot_num(2)` = 11+N bytes
- **Internal entry**: `type(1) + key_len(4) + key_data(N) + child_page_num(4)` = 9+N bytes
- `NodeHeader`: 12 bytes — type + num_keys + parent_page + next_leaf

### Slot Access
- `get_slot(buffer, slot_idx, out)`: returns SlotEntry at slot_idx
- `put_slot(buffer, slot_idx, entry)`: writes SlotEntry at slot_idx
- `get_data_start_offset(num_keys)`: returns start of entry data area (grows backward)
- `compute_entry_size(key)`: computes serialized entry size for a key

## Consequences

### Positive
- Multi-level tree growth handled correctly through split cascade
- Root split case properly distinguished from non-root splits
- Range scans remain correct via next_leaf chain maintained on split
- Slot array format enables binary search without full entry deserialization

### Negative
- Split cascade may cause multiple page writes per insert in worst case
- Internal node entries do not store slot_num (unlike leaf entries which store page_num + slot_num for RIDs)
- No balancing/redistribution between siblings — always splits at midpoint

### Neutral
- Depth grows only when root (and only root) splits — tree depth increments slowly
- All children of split internal nodes get correct parent pointers via update_child_parent()

## Alternatives Considered

### Alternative 1: Always split at first available slot, redistribute later
**Why rejected:** Redistribution adds complexity and requires additional writes. Midpoint split is deterministic and provides good balance.

### Alternative 2: Store full entries in internal nodes (not just separators)
**Why rejected:** Internal nodes store separator keys only — actual data lives in leaf nodes. This keeps internal nodes lean and maximizes branching factor.

### Alternative 3: Top-down splitting (split during descent)
**Why rejected:** Top-down splitting requires holding locks on multiple pages during traversal. Bottom-up (split on insert) defers splits and only touches affected pages.

## Implementation Phases

| Phase | Feature | Status |
|-------|---------|--------|
| 1 | Slot array format | Done |
| 2 | find_leaf() traversal | Done |
| 3 | split_leaf() | Done |
| 4 | insert_into_parent() / split_internal() | Done |
| 5 | Root split handling | Done |

## Test Results
- 29/29 BTreeIndexTests pass
- 1 pre-existing failure: BTreeIndexNextLeafTests.ScanIterator_NextLeaf (page format mismatch — raw test predates slot array)
57 changes: 54 additions & 3 deletions include/storage/btree_index.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,23 @@ class BTreeIndex {
NodeType type;
uint16_t num_keys;
uint32_t parent_page;
uint32_t next_leaf; // For leaf nodes
uint32_t next_leaf; // For leaf nodes: next leaf page. For internal: rightmost child.
};

/**
* @brief Slot entry — points to an entry in the data area of a page.
* Slot array grows forward from after NodeHeader.
* Entry data grows backward from end of page.
*/
struct SlotEntry {
uint16_t offset; // Byte offset from start of page to entry data
uint16_t length; // Entry size in bytes
};

static constexpr uint16_t kSlotSize = sizeof(SlotEntry); // 4 bytes per slot
static constexpr uint16_t kMaxSlots =
(Page::PAGE_SIZE - sizeof(NodeHeader)) / sizeof(SlotEntry); // ~1014 slots max

/**
* @brief Index entry (Key + TupleId)
*/
Expand Down Expand Up @@ -71,6 +85,7 @@ class BTreeIndex {
BufferPoolManager& bpm_;
common::ValueType key_type_;
uint32_t root_page_ = 0;
common::Value pending_separator_;

public:
BTreeIndex(std::string index_name, BufferPoolManager& bpm, common::ValueType key_type);
Expand All @@ -87,6 +102,7 @@ class BTreeIndex {

[[nodiscard]] const std::string& index_name() const { return index_name_; }
[[nodiscard]] common::ValueType key_type() const { return key_type_; }
[[nodiscard]] uint32_t root_page() const { return root_page_; }

bool create();
bool open();
Expand All @@ -103,12 +119,47 @@ class BTreeIndex {
private:
/* Internal B-tree logic */
[[nodiscard]] uint32_t find_leaf(const common::Value& key) const;
void split_leaf(uint32_t page_num, char* buffer);
// void split_internal(...) // TODO phase 2
[[nodiscard]] uint32_t split_leaf(uint32_t page_num, char* buffer);
bool split_internal(uint32_t page_num, char* buffer, uint16_t insert_pos,
uint32_t left_child, uint32_t right_child,
uint32_t& out_right_page);

bool read_page(uint32_t page_num, char* buffer) const;
bool write_page(uint32_t page_num, const char* buffer);
[[nodiscard]] uint32_t allocate_page();

/* Slot array helpers */
[[nodiscard]] uint16_t get_data_start_offset(uint16_t num_keys) const;
[[nodiscard]] uint16_t compute_entry_size(const common::Value& key) const;
[[nodiscard]] bool get_slot(const char* buffer, uint16_t slot_idx, SlotEntry& out) const;
bool put_slot(char* buffer, uint16_t slot_idx, const SlotEntry& entry);
bool append_entry_at(char* buffer, uint16_t slot_idx, const SlotEntry& entry,
const common::Value& key, HeapTable::TupleId tuple_id);

/* Entry serialization */
[[nodiscard]] bool serialize_entry(const common::Value& key, HeapTable::TupleId tuple_id,
char* out_buf, uint16_t buf_size,
uint16_t& bytes_written) const;
[[nodiscard]] bool deserialize_entry(const char* buf, uint16_t buf_size,
common::Value& out_key,
HeapTable::TupleId& out_tuple_id) const;

/* Key comparison */
[[nodiscard]] int compare_keys(const common::Value& a, const common::Value& b) const;

/* Internal node navigation */
[[nodiscard]] uint32_t find_child_for_key(const char* buffer, const common::Value& key, uint16_t num_keys) const;
[[nodiscard]] uint32_t get_child_page(const char* buffer, uint16_t slot_idx) const;
[[nodiscard]] int compare_separator(const char* buffer, uint16_t sep_idx, const common::Value& key) const;

/* Internal node insertion (Phase 4/5) */
[[nodiscard]] common::Value extract_key_from_entry(const char* entry_ptr, uint16_t entry_length) const;
[[nodiscard]] bool serialize_internal_entry(const common::Value& key, uint32_t child_page_num,
char* out_buf, uint16_t buf_size,
uint16_t& bytes_written) const;
bool insert_into_parent(const common::Value& sep_key, uint32_t left_page, uint32_t right_page);
bool create_new_root(const common::Value& sep_key, uint32_t left_child, uint32_t right_child);
bool update_child_parent(uint32_t child_page, uint32_t parent_page);
};

} // namespace cloudsql::storage
Expand Down
Loading