It’s all just (idealized) bits.
Intro Caveats
O vs Θ (vs Ω vs …)
For now, I write O everywhere.01
Data Structures vs Abstract Data Types
Writing about stacks and queues, I knew there were multiple ways to implement them. It turns out, there’s a formal-ish dichotomy of abstract data types (ADTs), and implementing data structures. For example, Map
is abstract (ADT), and Hash Map
is concrete (data structure).
This is generally clear and not that interesting, so I omit most discussion of it. The one real gotcha I had was in implementing a Set, which I instinctually did with a hash table, but turns out you can with other data structures to optimize for different operations.02
Simplifying Assumptions
We generally consider two related aspects when we store bits somewhere: location and speed.
Location. There’s stack/heap, and the physical memory hierarchy.
Speed: The memory hierarchy has varying sizes and access speeds, from registers to cache to RAM to disk. Plus caching effects.
Much more on both in the hardware-software interface.
In data structures, we typically ignore these:
- We assume heap
- We don’t worry about the memory hierarchy
- Assume data structures fit in memory
- Ignore different speeds for cache/RAM
- Ignore caching effects (i.e., repeated or nearby reads)
Instead, unless otherwise noted, we just assume there’s instant O(1) access to anything we have an address for. We’ll sometimes note anticipated effects where relevant, like pointer jumping being likely bad for caches.
Static Array
Let’s consider a C array. Just a contiguous range of memory.
For an array to be useful, we must know or store how much memory was allocated—i.e., the array’s capacity—else we’ll overrun.
// hardcoded length
int a[10]; // stack
int *b = malloc(10 * sizeof(int)); // heap
// variable-len arrays (C99+)
int capacity = 10;
int a[capacity]; // stack
int *b = malloc(capacity * sizeof(int)); // heap
We might want to use the raw array to build a higher-level data structure. E.g., use it to make a stack or dynamic array. To move towards those cases, we’ll also track the length, i.e., number of explicitly assigned elements.
Click to show a stack-only fixed-size array in C
To my surprise, a fixed-size array can live on the stack and be passed around into sub-functions. You must only remember to never use it after the defined function. When you pass it, callers just get a pointer to where it lives on your stack frame.
I added some basic push()
and pop()
functions to see what mutating it and passing around metadata feels like. This gives it lite stack (data structure) semantics.
#include <stdio.h>
/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/
int push(int *a, int capacity, int len, int value) {
if (len == capacity) {
printf("Warning: Can't push %d: array is at capacity (%d).\n", value, capacity);
return len;
}
a[len] = value;
return len + 1;
}
/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/
int pop(int *a, int len) {
if (len == 0) {
printf("Warning, can't pop: array is empty.\n");
return len;
}
// just leave the value there! :-)
return len - 1;
}
/**
* Prints `a`'s values in [0, len).
* prereq: 0 <= len <= capacity
*/
void display(int *a, int capacity, int len) {
printf("int* array at %p, capacity = %d, len = %d\n", (void*)a, capacity, len);
for (int i = 0; i < len; i++) {
printf("a[%d] = %d\n", i, a[i]);
}
printf("\n");
}
/**
* Fill all of `a` with 0.
*/
void zero(int *a, int capacity) {
for (int i = 0; i < capacity; i++) {
a[i] = 0;
}
}
int main() {
// dynamic stack array
int capacity = 10;
int len = 0;
int a[capacity]; // can't init (= {0}) here as dynamic
// zero(a, capacity); // instead. (optional.)
display(a, capacity, len); // nothing
for (int i = 0; i < capacity + 1; i++) {
len = push(a, capacity, len, 100 + i);
}
display(a, capacity, len); // 100 -- 109
for (int j = 0; j < 3; j++) {
len = pop(a, len);
}
display(a, capacity, len); // 100 - 106
return 0;
}
Doing it as a raw array on the stack is fun to immediately see pain points:
- Not being able to expand the array. (Allocating heap memory so it’s expandable makes sense.)
- Having to pass all metadata as function args is cumbersome. (Structs make sense.)
Raw (static, fixed-length) arrays are great as a densely-packed specialized data storage. But they’re just a block of contiguous memory, not really a general purpose data structure.
Dynamic Array
Let’s analyze a basic heap-allocated array that grows by doubling.
Normal operations
Operation | Time | Notes |
---|---|---|
Get by index | O(1) | |
Get by value | O(n) | search |
Insert at end | O(1) amortized | if grows, copy all, but amortized03 |
Insert at index | O(n) | need to shift everything after |
Delete at end | O(1) | |
Delete at index | O(n) | need to shift everything after |
Update any index | O(1) |
Special operations
Operation | Time | Notes |
---|---|---|
Merge | O(m) | to copy source (m) into a new bigger array 04 |
Split, dangerous hack | O(1) | just point to index k and partition capacities.05 |
Split | O(m) | copy over m elements (say, [k, n-1]) into new array |
All traversals (e.g., computing min/max) are trivially O(n) to visit all elements.
Space complexity
Type | Complexity | Notes |
---|---|---|
Overhead | O(1) | pointer, capacity, length |
Hash Map
The core elements are:
- hash function (hashable object → big hash)
- compression function (big hash → index in table size)
- collision strategy
- separate chaining: array, linked list
- open addressing: linear & quadratic probing, fancier variants
- track load factor, dynamically resize when it gets high
Keys must be stable, because if the hash changes, how you gonna find it?
Normal operations
Operation | Time | Notes |
---|---|---|
Get by key | O(1) | all O(1) operations can degrade to O(n)06 |
Get by value | O(n) | defeats the purpose. add a reverse map. |
Insert | O(1) amortized | amortized growth if geometric resize |
Delete | O(1) | |
Update | O(1) |
Special operations
Operation | Time | Notes |
---|---|---|
Insertion order iteration | O(n) | requires sidekick data structure (array, linked list) |
Merge | O(m) | copy source (m) in |
All traversals (e.g., computing min/max) are trivially O(n) to visit all items.
Space complexity
Type | Complexity | Notes |
---|---|---|
Overhead | O(1) | naive |
Total | O(n/a) | if using a as target load factor (e.g., 0.75) |
Overhead | O(n) | if caching hashes and a sidekick data structure |
Click for my verbose notes thinking through hash tables from first principles
First, we’ll assume the existence of a good hash function and compression function:
- a hash function h maps any item I we’d like to store into, let’s say, an int H within a fixed range of nonnegative integers [0, M] in Z. (E.g., 32-bit or 64-bit integers.)
- a good hash function makes it unlikely that two distinct items I and I’ produce the same hash H while there are < M items; i.e., hashes for input objects are distributed uniformly across [0, M]
- a compression function (AKA “index mapping”) maps [0, M] to the size of our actual backing key array of length K.
- Hmm, trivially this could be done by simply mod, with our array key H’ = H % K
- However, there exist good hash functions that become in practice bad because they collide after this mod
- Studied strategies (from looking up; would need to read more)
- division: H mod K, but choose K to so it’s unlikely to expose patterns in hash values (e.g., prime K away from power of 2). However, this seems bad because your backing away now won’t be near a power of two, which is good for memory alignment?
- multiplication: choose some constant e in (0, 1), do H * e, take the fraction bits f, then do f * K
- MAD (multiply-add-divide): (aH + b) mod p mod K (with p a prime > K)
I’m thinking this is backed by an array because an array gives us O(1) access once we’ve gotten the compressed hash.
I understand why hashes need to be stable. If an item’s hash changes, you wouldn’t be able to locate it in the map anymore.
Now, we can take any item we’d like to store and produce an index in [0, K)
of a backing array. Each entry in the array could store:
- the item itself (open addressing; need probing)
- a primitive if fixed length (e.g., <= 64 bits)
- the full item (will need space + copy)
- primitive or pointer (e.g., still always <= 64 bits, but potential optimization for primitive items)
- always pointers (simplest, natural for many langs)
- a (pointer to a) data structure (separate chaining; for collisions)
- dynamic array
- linked list
- another hash map (len K’, hopefully to avoid collisions with K)
- The sub-hash-map strategy seems sus. Feels like you should just grow the original hash map once there’s enough pressure. Though I guess could help if you have a pathologically bad initial H->K, then a great H->K’?
- between dynamic array and linked list, I like linked list. You’ll have O(n) retrieval and O(1) insertions at end regardless, so might as well have O(1) deletion rather O(n)
So far I’ve been assuming we want to store and retrieve an item I. This seems suitable to implement a set: O(1) lookup and retrieval. I think we also need an equality check for items, because after a collision we need to know whether the candidate and existing entry match.
However, we may want to hash user-provided key-value pairs (K, V), to implement a mapping. In this case, I think we simply hash only the key, and explicitly store the key alongside the value in the data structure so that we can explicitly use the key
Growing. If we track the number of elements in the data structure, we can decide to grow the data structure at some fullness threshold. (“Pressure?”) Naively, we’d
- choose a new K’ > K
- iterate over all items, and find their new index in K’
- if we also stored each item’s hash (which should be OK b/c stable), we wouldn’t have to re-hash. This feels like a minor optimization but could be good if hashing is expensive, like checksumming a lot of data
- regardless, I think we’d need to re-run the compression function
- we could skip this if we have an equation to transform a compressed value from an input of K to K’
- however, this is probably as fast as running the compression function again, because they seem simple
- then, we’d need to re-insert freshly into the new map
- if we have strong guarantees about the ordering of items under some K -> K’ assumptions (i.e., haven’t undergone a mod), AND we don’t have any collisions (no open addressing or chaining happened), we could potentially directly copy chunks of items
- … but this seems like an over-optimization that would be useful rarely. zero collisions before growing seems unlikely. and it seems like the chained data structures might need some new bookkeeping.
- in fact, the whole reason we’re growing the data structure in the first place implies we likely have collisions, which means naively copying is probably a bad idea
- this is another O(n) operation that I think amortizes well, because we tie resizes to inserts and use a multiplicative growth strategy.
Iteration.
- If we have only a backing array of len K, and we have n stored items with n << K, then O(K) dominates the iteration time. (We iterate over all buckets.) Once n >= K, then O(n) dominates. (I’m not sure how to write this in big-O.)
- We lack the ability to provide, e.g., iteration order by insertion
- To provide insertion order and strict O(n) iteration, we could keep another data structure. (Gets (value access) always O(1) from hash map.)
- We could use a dynamic array that we push new insertions to at the end.
- O(n) deletion
- shift and copy, same as array
- this is bad, because it ruins our O(1) hash map deletion
- O(1) get by index (what’s the 4th item?)
- interestingly: O(n) by value’s index (when was this value inserted?)
- Unless we kept the value’s index in the hash table as well, but this would require O(n) updates on every deletion (which is already true…), and if we did this might as well skip the array entirely
- O(n) deletion
- We could use a linked list
- O(n) deletion because it’s O(n) retrieval (where is node on linked list?)
- Fast list position retrieval could again be circumvented by again the position in the node, but would again require O(n) updates on deletion because of bookkeeping
- O(n) get by index (could be non-goal)
- O(n) deletion because it’s O(n) retrieval (where is node on linked list?)
- What if we used a doubly linked list within the hash map entries?
- Keep pointers to first and last items
- O(1) deletion: find and patch up neighbors’ links
- still O(1) insertion: added to end
- O(n) by index (4th item?) or value index (which item?)
- We could use a dynamic array that we push new insertions to at the end.
- Python’s uses tombstones in the backing dynamic array (which also holds the data!).
- This is interesting because it, and the backing doubly-linked list, try to match the time complexity of a hash map O(1) get / insert / delete, and support easy forward / backward iteration, but lack O(1) lookup by insertion index
Handling open addressing Open addressing I believe means upon collision you walk forward until you find an opening. I intuitively have never liked this because it makes the top level data structure messier.
- Naively, retrievals become “hit then walk”
- But the issue is deletions can leave holes
- To overcome the holes, we need some bookkeeping strategy. We could track the maximum we’ve ever walked, or a per-bucket max walk. Then always walk that far. Per-bucket seems better because you can actually decrement it later upon deletions (though maybe over-optimization to do so).
- Actually, it seems like leaving holes isn’t commonly done. The following strategies assume no holes.
- Tombstones (lazy deletion): keep entries as “occupied” so searching forwards keeps looking until a hole. This is cool because insertions can still use tombstones.
- Shifting (more typical?): when you’d delete, shift things back to avoid any holes.
- This seems… really complicated? Couldn’t you have to shift n entries, hopping over entries and clusters from other hashes? I think yes. But I guess you do have the nice invariant of “no holes within a cluster,” so as soon as you reach a hole you’re done.
Probing strategies
- linear probing (walk 1, clustering, great cache perf)
- double hashing (walk deterministic n computed from bucket (e.g., index), no clustering, bad cache perf)
- quadratic probing (walk 1,2,3,…; moderate clustering & caching)
Set
To me, a set is a container optimized for O(1) get, insert, and delete by value. It does not provide an ordering or any methods by index.
Built with a hash table, we might call this a hash set. That implementation is very similar to a hash map: just use k = v.
Note that using a hash table to implement a Set optimizes for get/insert/delete. Specialized data structures seem to exist that optimize for other options like union, which is O(m+n) with a hash table. I do not plan on researching these.
Normal operations
Operation | Time | Notes |
---|---|---|
Get by value | O(1) | hash the values; as with hash map, all O(1) can degrade to O(n) |
Insert | O(1) amortized | amortized growth if geometric resize |
Delete | O(1) |
Note that there is no “update” in a set, because there’s no meaningful metadata that indexes a value (like a numerical index (a[i]) or key (h[k])) other than the value itself.
Special operations
Operation | Time | Notes |
---|---|---|
Insertion order iteration | O(n) | requires sidekick data structure (array, linked list) |
Intersection (m, n) | O(min(m,n)) | |
Union (m, n) | O(m + n) | |
Difference (m, n) | O(m) | i.e., which of m elements are not in the n |
Subset, superset (m, n) | O(min(m,n)) |
Space complexity
Type | Complexity | Notes |
---|---|---|
Overhead | O(1) | naive |
Total | O(n/a) | if using a as target load factor (e.g., 0.75) |
Overhead | O(n) | if caching hashes and a sidekick data structure |
Linked List
Hope you like dereferencing pointers.
A linked list is comprised of
- nodes, each with (pointers to)
- value
- next node
- previous node (if double)
- start
- end (if double)
I’m going to assume a doubly linked list for the analysis because it enables so many operations.
Normal operations
Operation | Time | Notes |
---|---|---|
Anything by index | O(n) | must traverse first. (e.g., get, insert, update, delete) |
Anything by value | O(n) | search |
Anything at start, node, end | O(1) | get, insert, update, delete |
Special operations
Operation | Time | Notes |
---|---|---|
Concatenation | O(1) | |
Merge two sorted lists (m, n) | O(m+n) | |
Split by node | O(1) | |
Split by index k | O(k) | Must traverse to k, but no copying |
Cycle detection | O(n) | Floyd’s algorithm |
All other traversals (e.g., reverse, computing min/max, finding middle element w/ two pointers) are trivially O(n) to visit all elements.
Space complexity
Type | Complexity | Notes |
---|---|---|
Overhead | O(n) | 2 pointers per node |
General characteristics.
Linked lists excel at inserting/deleting by start/end/known nodes, and are worse at:
- cache performance (b/c dereferencing lots of pointers)
- memory fragmentation (b/c values aren’t stored densely together).
- random access (must traverse)
Applications.
Linked lists are a natural fit for stacks and queues, which insert and delete from the start/end of the list. Example applications:
- undo/redo (stack)
- playlist (queue) (esp w/ reordering)
- browser history
For node-based operations, having direct external references to nodes is key. (E.g., a sidekick hash map to nodes.) Then, a queue like an LRU cache or task scheduler can insert at the front of the linked list in O(1).
Stack
Stacks are LIFO — last-in, first-out.
A stack is an abstract data structure that can be implemented with an underlying concrete one. For example, a linked list would work well, but an array would work fine too. We want the common operations to be O(1):
- push
- pop
other common operations, trivially also O(1):
- peek
- empty?
- size
Queue
Queues are FIFO — first-in, first-out.
Like stacks, queues are abstract, and naturally implemented with a linked list. Unlike stacks, a naive array implementation of a queue forces either enqueue or dequeue to be O(n) instead of O(1). A circular (ring) buffer—with head and tail pointers—can make an array O(1) queue-compatible.
Main operations to be O(1):
- enqueue
- dequeue
other common operations, trivially also O(1):
- peek
- empty?
- size
Applications Conceptually, in addition to the above linked list-mentioned queue applications (playlist, LRU cache, task scheduling), queues make sense for ordered streams like:
- buffering (inputs, network packets)
- producer / consumer pipelines
Deque
AKA double-ended queues. Delightfully pronounced “deck,” which seems fitting because a deck of cards is a natural analogy.
Deques support pushing and popping from either end. Wikipedia says they’re a generalization of a queue; to me, they feel like the powers of a stack and a queue combined.
front back
↓ ↓
| | | | | | |
push() ---> | | | | | | | ---> eject()
| | | | | | |
pop() <--- | | | | | | | <--- enqueue()
dequeue()
Deque nomenclature that makes sense to me.
Reading from and writing to both ends with the above operations should be implemented in O(1) time. This makes a doubly linked list a natural fit. A circular buffer would work too.
Python implements it in collections.deque
. Wikipedia says Python implements the deque using a doubly linked list of fixed-length subarrays. This is a cool choice, because it seems like it’d avoid copying when growing, and give O(1) access while near the ends (i.e., within a single fixed length).07 Python’s deque
uses the following operation names:
front back
↓ ↓
| | | | | | |
popleft() ---> | | | | | | | ---> pop()
| | | | | | |
appendleft() <--- | | | | | | | <--- append()
↑ ↑
[0] [-1]
Python’s deque
operations.
Heap
A heap implements a priority queue.
We’ll focus on the binary heap. At least a dozen heap variants exist, basically to turn more O(log n) operations into O(1) ones.08
Binary heaps are cool because they can be implemented implicitly on top of an array with no extra space required. This is because they are complete trees.
Binary heaps are binary trees with two constraints:
- shape property the tree is complete (filled out, level by level, left to right)
- heap property each node is higher in the ordering than its children
Heaps can be min-heaps or max-heaps, which trivial differences lead to a proliferation of language like, “bigger (less than) the max (min) of its…” etc. I will avoid this throughout and use it as a max-heap.09
The key idea of a binary heap is to keep the max node as the root, so it’s always available O(1). Then, patching up the tree for operations on it take O(log n). Iterative operations like build / meld / find take O(n). There’s no ordering between siblings or cousins. The only meaningful relationship is parent/child. This is how we easily get O(log n) operations.
Click to show a binary max-heap implementation in Python
Because a binary heap implementation isn’t completely trivial, it’s worth sketching out.
The core logic of the operations is quite compact. Unfortunately, basic comments and invalid arg checking easily doubles the size.
"""
Binary (max, WLOG) heap, implements a priority queue.
throughout:
v = value
i = index
Storing `int`s as values directly for simplicity.
If this were tracking an order on actual objects, I think there'd be three main
approaches:
1. store objects themselves in the backing array. access `.priority` to compare,
swap whole objects
2. store `{priority, pointer}` in the backing array. access `.priority` to compare,
swap just mini structs.
3. store indices into an array `a` of objects in the backing array. `a[i].priority`
to compare, swap just ints.
(I think 3. is probably the most elegant.)
"""
import random
# index operations for backing the binary heap with an array.
# existence of indices not guaranteed!
def parent(i: int) -> int:
return (i - 1) // 2 # floor
def children(i: int) -> tuple[int, int]:
return (2 * i + 1, 2 * i + 2)
# normally not a big class guy, but it beats passing the data array everywhere
class BinaryMaxHeap:
def __init__(self):
# dynamic array, handles resizing. thanks python
self.values: list[int] = []
def _bubble_up(self, i: int):
"""Recursively swap i with its parent if i is bigger.
AKA up-heap, percolate-up, sift-up, trickle-up, swim-up, heapify-up,
cascade-up, fix-up
O(log n)
"""
if i == 0:
return # at root
p = parent(i)
if self.values[i] > self.values[p]:
self.values[i], self.values[p] = self.values[p], self.values[i]
self._bubble_up(p)
def _bubble_down(self, i: int):
"""Recursively swap i with its largest child if either is bigger than it.
AKA down-heap, percolate-down, sift-down, sink-down, trickle down,
heapify-down, cascade-down, fix-down, extract-min or extract-max, or
just heapify
O(log n)
"""
c1, c2 = children(i)
candidates = [c for c in [c1, c2] if c < len(self.values)] # validate
if len(candidates) == 0:
return # at leaf
max_val = max(self.values[c] for c in candidates)
if max_val > self.values[i]:
max_c = [c for c in candidates if self.values[c] == max_val][0]
self.values[i], self.values[max_c] = (
self.values[max_c],
self.values[i],
)
self._bubble_down(max_c)
# operations
#
# going with ValueError because I usually have invalid operations return
# None, so trying exceptions instead as something new.
def find_max(self):
"""O(1)"""
if len(self.values) == 0:
raise ValueError()
return self.values[0] # it's me, Max. I'm always right here.
def insert(self, v: int):
"""O(log n)"""
self.values.append(v)
self._bubble_up(len(self.values) - 1) # the inserted index
def size(self) -> int:
"""Only seemed polite since invalid requests raise ValueError.
O(1)
"""
return len(self.values)
def extract(self) -> int:
"""AKA "delete max;" remove and return the max val, O(log n)"""
if len(self.values) == 0:
raise ValueError()
if len(self.values) == 1:
return self.values.pop()
ret = self.values[0] # should I call find_max()?
self.values[0] = self.values.pop() # end --> root
self._bubble_down(0)
return ret
def insert_then_extract(self, v: int) -> int:
"""insert a new value and extract the new largest. O(log n)"""
# if what we're inserting is the new max, we can just return it
if len(self.values) == 0 or v > self.values[0]:
return v
# else, we return the root, and bubble down
ret = self.values[0]
self.values[0] = v
self._bubble_down(0)
return ret
def delete(self, i: int) -> int:
"""deletes value at index i and returns it. O(log n)"""
if i >= len(self.values):
raise ValueError()
deleted = self.values[i]
self.values[i] = self.values.pop() # last element -> here
if self.values[i] > deleted:
# the last element is bigger than what was here, so we may
# need to move it up
self._bubble_up(i)
else:
# smaller, so may need to move it down
self._bubble_down(i)
return deleted
def search(self, v: int) -> int:
"""Finds the first index i that has the value v, or ValueError if doesn't exist.
O(n)
Felt right to add this since so many functions are passed an index i,
but practically it seems like you'd want to find the index i based on
a value v."""
return self.values.index(v)
def decrease_key(self, i: int, v: int):
"""Decreases the value at index i to be v."""
if i >= len(self.values):
raise ValueError()
self.values[i] = v
self._bubble_down(i) # as smaller, maybe move down
def increase_key(self, i: int, v: int):
"""Increases the value at index i to be v."""
if i >= len(self.values):
raise ValueError()
self.values[i] = v
self._bubble_up(i) # as larger, maybe move up
def modify_key(self, i: int, v: int):
"""Modifies the value at index i to be v.
Seems like you'd just want this as an API right?
"""
if i >= len(self.values):
raise ValueError()
if v < self.values[i]:
self.decrease_key(i, v)
else:
self.increase_key(i, v)
def meld(self, other: "BinaryMaxHeap"):
"""AKA 'merge', combines `other` (m) into this (n) heap.
Does not mutate `other`.
Binary heaps have no special meld runtime. The two main options are:
1. insert each of m - O(m log(n + m))
2. heapify both from scratch - O(n + m)
Here we simply do 1.
"""
# copy values first in case we're doing `h.meld(h)`
for v in other.values[:]:
self.insert(v)
@staticmethod
def build_heap(vs: list[int]) -> "BinaryMaxHeap":
"""Constructs a heap from provided values vs.
Rather than inserting n elements (O(n log n), Williams), we
immediately construct a binary tree (shape property) then order it
(heap property) (O(n), Floyd).
Running time is O(n) due to a complicated analysis. Naively we're
skipping the leaves (~n/2) and doing h (log n) operations per node,
but this ends up being good enough to reduce us from O(n log n) to
O(n).
"""
h = BinaryMaxHeap()
h.values = vs[:] # copy over. complete binary tree achieved!
if len(h.values) == 0:
return h # avoid no values edge case range would hit
# now, make sure every node is bigger than its children, or
# moves down if not. trivially true for leaf nodes, so start at
# penultimate level and work back up to root node, inclusive.
for i in range(len(h.values) // 2, -1, -1):
h._bubble_down(i)
return h
def extract_all(h: BinaryMaxHeap) -> list[int]:
"""Calls extract() on h until it's empty. Returns list of results."""
ordered = []
while h.size() > 0:
ordered.append(h.extract())
return ordered
def main():
# build from shuffled [1, 10]
vs = list(range(1, 11))
random.shuffle(vs)
h = BinaryMaxHeap.build_heap(vs)
ordered = extract_all(h)
print("original: ", vs)
print("max heap'd: ", ordered)
# build again, this time from inserts, and merge
for v in vs:
h.insert(v)
h.meld(h)
print("self-melded: ", extract_all(h))
# try more operations
h = BinaryMaxHeap.build_heap(vs)
h.delete(0) # delete 10
h.decrease_key(0, 2) # 9 -> 2
h.modify_key(0, 2) # 8 -> 2
h.increase_key(h.search(1), 2) # 1 -> 2
print("7-3, 2 2 2 2:", extract_all(h))
if __name__ == "__main__":
main()
From my source repo.
Top-k extensions
If we want to track the top-k elements and are OK without having an order, there’s a fun trick: use a min-heap with fixed size k to track the maximum k elements. The key idea is to have the smallest of the current largest k ready to throw out when inserting something new.
In detail, when inserting a new value v
:
- if size < k, insert it - O(log k)
- otherwise, if v < current min, ignore it - O(1)
- otherwise, remove the current min, put
v
there, and bubble down - O(log k)
The top-k is simply the value array at any given point. This can be sorted in O(k log k) either with a sorting algorithm, or (though low to high) by removing them all in order.
Say we want the in-order top-k available at any time? What if we just stored a sorted top-k array? For reference: maintaining a sorted array can be done by binary search to find the insertion/deletion spot (O(log k)), then shifting (O(k)), which is O(k) as dominated by the shifting. Inserting a new element thus looks like:
- if size < k, insert - search then shift O(k)
- otherwise, if v < current min, ignore it - O(1)
- otherwise, remove the current min (shift, O(k)) then insert (search + shift, O(k))
Since most operations here take O(k) instead of O(log k), it’d only really make sense if we need the ordered top-k a lot. We can make it available any time in O(1) (assuming we don’t need to copy it). If instead we have tons of updates, and just want to be able to retrieve it sometimes, keeping a heap updated in O(log k) per operation, then an O(k log k) retrieval to get the sorted order isn’t bad.
Graph
Abstract. A set of vertices V (nodes) and edges E. Edges may be directed.
Here are some ways of representing graphs:
(1) Adjacency list — store edges in vertices
class Vertex:
def __init__(self, adjacent: list[Vertex]):
self.edges = adjacent
(2) Adjacency matrix – store vertices in rows (src) and columns (dst) of a square 2D matrix
f | v1 | v2 | v3 |
---|---|---|---|
v1 | 0 | 1 | 1 |
v2 | 0 | 0 | 1 |
v3 | 0 | 1 | 0 |
Seems traditionally binary, or cost-storing.
(3) Incidence matrix — in a 2D matrix, store vertices in rows and edges in columns. each column should have a sum of two, as each edge connects two vertices
f | e1 | e2 |
---|---|---|
v1 | 1 | 1 |
v2 | 1 | 0 |
v3 | 0 | 1 |
For a directed graph, can use -1 for leaves and 1 for enters.
Some important graph concepts:
- cycles occur when you can find a loop
- a directed acyclic graph (DAG) is a common variant. note that here it must only lack directed cycles.
- applications in dependency graphs: scheduling, spreadsheets, makefiles, Bayesian networks.
- Also, family “trees,” version control (Git’s history).
- should be possible for citation graphs, but I bet with modern papers directed cycles happen
Tree
Abstract.
From a graph perspective, a tree is a connected, acyclic graph.
Fascinatingly, a tree can be constructed by satisfying any of the following interchangeable constraints on a graph:
- it’s got n vertices and n - 1 edges, and fulfills either one of (will then fulfill both):
- connected (can’t make cycles)
- acyclic (won’t be disconnected)
- it’s minimally connected; removing any edge disconnects it
- it’s maximally acyclic; adding any edge creates a cycle
- there’s a unique path between any two nodes
Graph-theoretic tress are (apparently) undirected, so we don’t talk about parents and children.
Once we pick a root and orient all edges away from it, we end up with a rooted tree = tree data structure, and can talk about stuff like:
- each non-root node has exactly one parent
- the root has no parents
- each node can have zero or more children (its degree = number of children)
- the parent/child relationship forms a hierarchy
Picking any node of a tree, it forms a valid tree beneath it.
class TreeNode:
def __init__(self, parent: Optional[TreeNode], children: list[TreeNode]):
self.parent = parent
self.children = children
root = TreeNode(None, [])
c1 = TreeNode(root, [])
root.children.append(c1)
Minimal sketch of a Tree’s data in Python, with adjacency list representation.
Binary Tree
Binary trees are a common variant which restricts nodes to 0, 1, or 2 children. Some terms for binary trees:
- perfect = maximally filled to its level (i.e., all non-leaves have 2 children, all leaves at the same level)
- perfect → full, complete, and balanced
- full = every node has 0 or 2 children
- complete = filled in-order (i.e., non-final levels completely filled, final level filled left-to-right)
- particularly amenable to array representation. see binary heap above.
- complete → balanced
- balanced = height difference of at most 1 for left and right subtrees of any node
class BinaryTreeNode:
def __init__(
self,
parent: Optional[BinaryTreeNode],
left: Optional[BinaryTreeNode],
right: Optional[BinaryTreeNode]
):
self.parent = parent
self.left = left
self.right = right
root = BinaryTreeNode(None, None, None)
c1 = BinaryTreeNode(root, None, None)
c2 = BinaryTreeNode(root, None, None)
root.right = c2
root.left = c1
Minimal sketch of a Binary Tree’s data in Python, with adjacency list representation. Backing with an array instead is cool if it will be relatively balanced, like a binary heap.
Binary Search Tree
A Binary Search Tree (BST) is a binary tree where, for each node:
- left contains everything less
- right contains everything greater
Searching in all BSTs simply follows left/right, max of height = O(log n) ops.
BSTs can be constructed degenerately and degrade to a linked list. So in practice, we look for self-balancing BSTs, which try to keep the height in check, like O(log n) height for AVL and red-black trees.
Self-balancing variants of BSTs exist, like AVL Trees and Red-Black Trees. They function mostly like vanilla BSTs, but with tree rebalancing (through “rotations”) to preserve a balance property.10 However, the balancing operations for both are too complex to be worth memorizing.
Click to show a (no-balancing) BST implementation in Python
"""Super basic BST, no balancing, just to get the BST flavor.
Had basically all functional, migrated more to being instance methods, but a few
functional-ish ones remain (_traverse(), _search(), maybe _replace() counts too).
"""
import random
from typing import Optional
class BinaryTreeNode:
def __init__(
self,
value: int,
parent: Optional["BinaryTreeNode"],
left: Optional["BinaryTreeNode"],
right: Optional["BinaryTreeNode"],
):
self.value = value
self.parent = parent
self.left = left
self.right = right
def __repr__(self) -> str:
return f"({self.value})"
def insert(self, new: "BinaryTreeNode"):
"""inserts new into self.
i preferred the non-class more recursive version from before, with the pattern:
n.left = insert(n.left, new)
n.left.parent = n
return n
... but this convention only makes sense if you recurse down to calling on None
and just returning new, which you can't do as an instance method. so making
this an instance method meant changing this to the more branching way.
"""
# equal; shouldn't happen
if self.value == new.value:
print("Error: assuming unique values. Not inserting.")
if new.value < self.value:
if self.left is not None:
self.left.insert(new)
else:
self.left = new
new.parent = self
else:
if self.right is not None:
self.right.insert(new)
else:
self.right = new
new.parent = self
def minimum(self: "BinaryTreeNode") -> "BinaryTreeNode":
"""Traverse to leftmost leaf to find minimum."""
# I kept the `self is None`` case in the BST's call, because we don't
# actually want to recurse until None, we want to stop at a leaf.
if self.left is None:
return self
return self.left.minimum()
def maximum(self: "BinaryTreeNode") -> "BinaryTreeNode":
"""Traverse to rightmost leaf to find maximum node."""
if self.right is None:
return self
return self.right.maximum()
def successor(self: "BinaryTreeNode") -> Optional["BinaryTreeNode"]:
"""returns the node with the smallest value > the self's value, None if max."""
# the logic here was quite tricky to convince myself of. x's successor is either:
# 1. _minimum(x.right) if x.right exists, else
# 2. move upwards until you go up-right (↗) once, then return that
if self.right is not None:
return self.right.minimum()
cur, parent = self, self.parent
while parent is not None and parent.right is cur:
cur = parent
parent = parent.parent
return parent
def predecessor(self: "BinaryTreeNode") -> Optional["BinaryTreeNode"]:
"""returns the node with the greatest value < the self's value, None if min."""
# x's predecessor is either:
# 1. _maximum(x.left) if x.left exists, else
# 2. move upwards until you go up-left (↖) once, then return that
if self.left is not None:
return self.left.maximum()
cur, parent = self, self.parent
while parent is not None and parent.left is cur:
cur = parent
parent = parent.parent
return parent
def delete(self: "BinaryTreeNode", node: "BinaryTreeNode"):
"""
Returns this subtree
Assumes:
- NO MORE node is not self (already checked in BST not root)
- --> NO MORE node has a parent
- node is in self's tree
"""
n_children = len([c is None for c in [node.left, node.right]])
# three cases:
# (1) if node is a leaf, it's simply replaced by None
if n_children == 0:
_replace(node, None)
return self if node is not self else None
# (2) node has only one child. it's simply replaced by its child.
# (makes sense because any values that would have been correctly
# routed to it are fine being routed to its only child, because its
# only child obeys the order node had with its parent.)
if n_children == 1:
child = node.left if node.left is not None else node.right
_replace(node, child)
return self if node is not self else child
# (3) node has two children. we'll use its successor S to replace it.
# (predecessor works too apparently.) two subcases:
# (a) node.right == S. this implies S.left is None (else S.left would be the
# successor), so S can replace node, adopting node's left child and
# keeping S's right child unchanged
#
# (b) S is deeper in node.right's subtree.
#
# we know this because the successor is either node.right.minimum() if
# node.right exists, or a traversal upwards, but node.right *must*
# exist because we have 2 children in this case. so S must be in
# node.right's subtree.
#
# again we know S.left is None. we:
# - replace S with S.right
# - S replaces node
s = node.successor()
assert s is not None # else node would be a leaf
if node.right is s:
# (a) case
s.left = node.left
_replace(node, s)
else:
# (b) case
_replace(s, s.right)
_replace(node, s)
s.left = node.left
s.right = node.right
return self if node is not self else s
def _replace(node: BinaryTreeNode, replacement: Optional[BinaryTreeNode]):
"""replace node by setting node.parent's child to be replacement instead of node
note that this doesn't alter any other children; that full operation is in delete().
this just handles the fact that node.parent.child isn't defined because there's two
children.
"""
if node.parent is None:
return # nothing to wire up
if node.parent.left is node:
node.parent.left = replacement
else:
node.parent.right = replacement
def _traverse(subtree: Optional[BinaryTreeNode]):
"""Returns list of sorted values at and below subtree"""
if subtree is None:
return []
return [
*_traverse(subtree.left),
subtree.value,
*_traverse(subtree.right),
]
def _search(
subtree: Optional[BinaryTreeNode], want: int
) -> Optional[BinaryTreeNode]:
"""Returns want's node if it's in subtree, else None"""
if subtree is None:
return None
if subtree.value == want:
return subtree
if want < subtree.value:
return _search(subtree.left, want)
else:
return _search(subtree.right, want)
class BST:
"""Somewhat awkward, likely unnecessary helper class to handle hanging onto the root
reference and checking if it's None everywhere."""
def __init__(self):
self.root = None
def insert(self, value: int):
if self.root is None:
self.root = BinaryTreeNode(value, None, None, None)
else:
self.root.insert(BinaryTreeNode(value, None, None, None))
def traverse(self):
return _traverse(self.root)
def search(self, want: int):
return _search(self.root, want)
def minimum(self) -> Optional[BinaryTreeNode]:
if self.root is None:
return None
return self.root.minimum()
def maximum(self) -> Optional[BinaryTreeNode]:
if self.root is None:
return None
return self.root.maximum()
def delete(self, node: Optional[BinaryTreeNode]):
if self.root is None:
print("Can't delete node when tree root is None")
return
if node is None:
return # allow passing None just for type checking simplicity
# NOTE: Could verify that node is in tree.
self.root = self.root.delete(node)
def main() -> None:
# shuffled [1, 10]
vs = list(range(1, 11))
random.shuffle(vs)
print("original: ", vs)
# build into BST
tree = BST()
[tree.insert(v) for v in vs]
# traverse
print("BST traversal:", tree.traverse())
print()
# search
print("BST search")
print("1 in tree: ", tree.search(1))
print("7 in tree: ", tree.search(7))
print("10 in tree: ", tree.search(10))
print("0 in tree: ", tree.search(0))
print("-5 in tree: ", tree.search(-5))
print("53 in tree: ", tree.search(53))
print()
print("BST Min/max")
print("minimum:", tree.minimum())
print("maximum:", tree.maximum())
print()
print("BST pred")
print("5's predecessor:", tree.search(5).predecessor()) # type: ignore
print("9's predecessor:", tree.search(9).predecessor()) # type: ignore
print("1's predecessor:", tree.search(1).predecessor()) # type: ignore
print("10's predecessor:", tree.search(10).predecessor()) # type: ignore
print()
print("BST suc")
print("5's successor:", tree.search(5).successor()) # type: ignore
print("9's successor:", tree.search(9).successor()) # type: ignore
print("1's successor:", tree.search(1).successor()) # type: ignore
print("10's successor:", tree.search(10).successor()) # type: ignore
print()
print("BST delete")
print("root:", tree.root)
tree.delete(tree.root)
print("new root:", tree.root)
print("traversal:", tree.traverse())
if __name__ == "__main__":
main()
From my source repo.
Trie
Prefix trees.
- Classic applications: autocomplete, IP routing
- Compare vs hash tables
- pros: tries require no hash function, there are no collisions, and ordered traversal is possible
- cons: way more pointer chasing, I should think
Fuzzy search—and now vector search—make the autocomplete angle less relevant imo.
Tries can be implemented with
- labeled pointers (as you’d expect)
- a character vocabulary-sized array of pointers
- some terrifying tricks like a “left-child right-sibling binary tree”
Click to show a trie implementation in Python
"""Trie - a basic prefix tree.
Stores a value for each string (default just 1). This conveniently doubles as
indicating that we've reached a terminus (i.e., we've hit a stored word),
necessary for non-leaf termination (e.g., if storing all of 'i', 'in', and
'inn')
"""
# import code
from typing import Optional
class TrieNode:
__slots__ = ("children", "value")
"""nice trick: for type checker to verify attrs"""
def __init__(self):
self.children: dict[str, TrieNode] = {}
self.value: Optional[int] = None
def insert(self, s: str, v: int = 1):
"""We store v (int) as payload data associated with s. For only storing
a set of strings, it might as well be a single bit to indicate possible
non-leaf termination (e.g., if storing all of 'i', 'in', and 'inn')."""
# terminal
if len(s) == 0:
self.value = v
return
# any continuations
c, rest = s[0], s[1:]
if c not in self.children:
self.children[c] = TrieNode()
self.children[c].insert(rest, v)
def traverse(self, accum: list[str] = []) -> list[str]:
"""Returns a list of all stored words."""
res = []
# terminal
if self.value is not None:
res.append("".join(accum) + f" ({self.value})")
# any continuations
for c, node in self.children.items():
res.extend(node.traverse(accum + [c]))
return res
def _depth_sets(self, level=0):
"""helper for fun viz"""
res = [(level, set(c for c in self.children.keys()))]
for node in self.children.values():
res.extend(node._depth_sets(level + 1))
return res
def print_depths(self):
"""fun viz (accumulate characters at each tree depth and print).
In lieu of figuring out how to actually print trees "graphically" in text.
"""
ss = self._depth_sets()
for i in range(max(d for d, _cs in ss)):
d_cs = set()
for cs in [cs for d, cs in ss if d == i]:
d_cs.update(cs)
print(i, d_cs)
def __repr__(self):
return "\n".join(self.traverse())
def search(self, s: str) -> Optional["TrieNode"]:
"""Returns node if it's found in the trie, else None."""
if len(s) == 0:
if self.value is not None:
return self # found
return None # not here (no terminal)
c, rest = s[0], s[1:]
if c not in self.children:
return None # no path
return self.children[c].search(rest) # continue
def delete(self, s: str) -> bool:
"""returns whether this node can be fully deleted"""
if len(s) == 0:
# end of the line. mark no longer terminal.
self.value = None
# now, check whether there's children. if so, we leave everything
# because continuations below us depend on us and our parents. but
# if not, we can start propagating deletes upwards.
return len(self.children) == 0
# send down to children to delete
c, rest = s[0], s[1:]
if c not in self.children:
raise ValueError(f"continuation {c} not found where expected")
child_can_be_deleted = self.children[c].delete(rest)
if child_can_be_deleted:
del self.children[c]
# continue propagating deletes up if we've no children and nonterminal
return len(self.children) == 0 and self.value is None
return False
def main() -> None:
t = TrieNode()
t.insert("hello", 47)
t.insert("goodbye", 19)
t.insert("i")
print(t)
t.print_depths()
print()
print("Search:")
print("hello:", t.search("hello"))
print("hell:", t.search("hell"))
print("goodbye:", t.search("goodbye"))
print("goodbyee:", t.search("goodbyee"))
print("", t.search(""))
print()
print("Adding hell, good")
t.insert("hell")
t.insert("good")
print(t)
t.print_depths()
print()
print("Delete")
print("delete hello")
t.delete("hello")
print(t)
t.print_depths()
print("delete good")
t.delete("good")
print(t)
t.print_depths()
print("delete goodbye")
t.delete("goodbye")
print(t)
t.print_depths()
# code.interact(local=dict(globals(), **locals()))
if __name__ == "__main__":
main()
From my source repo.
Compressed Trie
AKA: compact prefix tree, Patricia tree/trie, radix tree/trie (ish).
I think there’s two things going on:
- only children are merged with their parent
- there’s a limited branching factor (the radix)
I believe there’s some nuance here with explicitly modeling the radix,11 so I’m going to only focus on the first property: only children are merged with their parents.
I had to spend some time understanding the intuition behind why finding continuation edges is still O(1) time, when edges can now have long substrings. I (incorrectly) thought that you might have O(nk) worst case to compare all O(n) substrings below an edge to find the longest common prefix (O(k), k = key length) to follow when searching (a prerequisite for inserting and deleting). You can in fact do:
-
O(rk), where r is the radix (branching factor). For strings, this is the character set C. So if we’re storing lowercase English words, C = r = 26. The reason is that as soon as we have more than 26 words (apple, banana, cherry, …), by the pigeonhole principle, we’ll have another word with a common letter (apple, …, zebra, aardvark). At this point we’ll split into another node (a) which points to both continuations (pple, ardvark). Then, comparing each substring is O(k).
-
O(k), by using the fact that no children share the same first character, we can a hash table (or array if radix known in advance) to look up the single candidate continuation in O(1). Say we’ve stored apple, banana, cherry. If we are going to search for a word starting with a like artery, there’s actually only one candidate we could ever look at to find a common substring, which is the one that starts with a (apple). And by the same reasoning as just discussed just in the O(rk) implementation, there’ll be at most one word with this prefix, else we would have branched. So we can store a mapping like {“a”: (“apple”, node), “b”: (“banana”, node), …}. Then, we take the first letter from our search term (a from artery), use it to see whether there’s a candidate node, and compare the two substrings from there in O(k). Note that if we’d already branched on just “a”, the mapping would store {“a”: (“a”, node), …}.
Click to show a compressed trie (radix tree) implementation in Python
I made this a little more complex because I wanted to implement both the O(rk)
and O(k) approaches as subclasses. It’s a minor difference. The main algorithms
are in search()
, insert()
, and delete()
of the base class.
In implementing this, I didn’t reference any written material, only drew pictures and considered the cases. I couldn’t find a good reference description or implementation. So there certainly may be bugs. But my simple tests have worked correctly.
Before the full implementation, I’d like to pull out the central algorithms of
search()
, insert()
, and delete()
, and present them without comments or
asserts or correctness checks so that we may gaze upon the core idea of the
algorithm.
# Important: this is all pseudocode, because it's taken out of context.
# Imagine functions in a node class that holds:
# - .terminal (i.e., does a word end here?)
# - .children (some way of storing edges to children)
def _find_insertion_node(self, word: str) -> tuple[CTrieNode, str]:
"""returns node where word would be inserted, and how much is left."""
if len(word) == 0:
return (self, word)
match = self._matches_edge(word) # word substr exactly matching edge
if match is None:
return (self, word)
substr, node = match
remainder = word[len(substr) :]
return node._find_insertion_node(remainder)
def search(self, word: str) -> bool:
"""Returns whether word is found in this subtree"""
node, remainder = self._find_insertion_node(word)
return len(remainder) == 0 and node.terminal
def insert(self, word: str):
"""inserts word into this subtree"""
node, remainder = self._find_insertion_node(word)
if len(remainder) == 0:
node.terminal = True # already have node at terminal
return
subnode, edge, common = node._overlaps_edge(remainder)
if subnode is None:
node._add_child(remainder, True) # no match; fresh child
else:
# overlap, so split: (a) end at intermediate, or (b) new past
intermediate_is_terminal = common == len(remainder)
node._remove_edge(edge)
intermediate = node._add_child(
remainder[:common],
intermediate_is_terminal, # (a)
)
intermediate._add_child_node(edge[common:], subnode)
if not intermediate_is_terminal: # (b)
intermediate._add_child(remainder[common:], True)
def delete(self, word: str):
"""deletes word"""
parent, edge, node = self._find_parent(word)
node.terminal = False
if node._n_children() >= 2:
pass # keep node
elif node._n_children() == 1:
# link parent --> child (remove node)
child_edge, child = node._get_only_child()
parent._replace_edge(edge, edge + child_edge, child)
elif node._n_children() == 0:
parent._remove_edge(edge) # remove node
if parent._n_children() == 1:
# link grandparent --> sibling (remove parent)
grandparent, parent_edge, _ = self._find_parent(
word[: -len(edge)]
)
sibling_edge, sibling = parent._get_only_child()
grandparent._replace_edge(
parent_edge, parent_edge + sibling_edge, sibling
)
Core algorithms search()
, insert()
, and delete()
for a compressed trie. See below for the full implementation.
Now, here’s the full implementation:
"""Compressed trie, AKA: compact prefix tree, Patricia tree/trie, radix tree/trie (ish).
This is a simple variant of a vanilla (character-per-node) trie where each node that is
an only child is merged with its parent.
"""
from abc import ABC, abstractmethod
from typing import Optional
def longest_common_start(s1: str, s2: str) -> int:
"""returns number of characters of longest common starting substring of s1 and s2.
for example:
0 1 2 3 4 5 6 7
a r t e r i e s
a r t
-> returns 3
0 1 2 3 4 5 6 7
a r t e r i e s
a r t i f a c t
-> returns 3
0 1 2 3 4 5
a r t
b a n a n a
-> returns 0
"""
min_len = min(len(s1), len(s2))
for i in range(min_len):
if s1[i] != s2[i]:
return i
return min_len
class CTrieNode(ABC):
__slots__ = ("terminal",)
def __init__(self, terminal: bool):
self.terminal: bool = terminal
# The following abstract methods allow a concrete node subclass to store
# its edges and children in different data structures. Done this way
# demonstrate the O(rk) and O(k) lookup approaches.
@abstractmethod
def _matches_edge(self, word: str) -> Optional[tuple[str, "CTrieNode"]]:
"""returns (substring, CTrieNode) if there's a substring of word that
exactly matches an edge, else None.
"""
pass
@abstractmethod
def _overlaps_edge(
self, word: str
) -> tuple[Optional["CTrieNode"], str, int]:
"""Returns any overlapping edge's (node, edge, overlap len),
or (None, "", 0) if there's none.
At most 1 edge will have any common start overlap with word b/c no edges
have any prefix overlap because that's how a prefix data structure works.
"""
pass
@abstractmethod
def _add_child(self, edge: str, terminal: bool) -> "CTrieNode":
"""adds edge pointing to a new node w/ specified terminal set. returns child.
may validate no shared prefixes, so if modifying, delete before calling this.
"""
pass
@abstractmethod
def _add_child_node(self, edge: str, child: "CTrieNode") -> "CTrieNode":
"""adds new edge pointing to child. returns child.
may validate no shared prefixes, so if modifying, delete before calling this.
"""
pass
@abstractmethod
def _remove_edge(self, edge: str) -> None:
"""removes edge"""
pass
@abstractmethod
def all_children(self) -> list[tuple[str, "CTrieNode"]]:
pass
@abstractmethod
def _depth_sets(self, level=0) -> list[tuple[int, set[str]]]:
"""helper for fun viz. returns list of [depth, set(strs at depth) ]"""
pass
@abstractmethod
def _n_children(self) -> int:
"""conceptually, self.terminal counts as another child, so its included in the
total."""
pass
@abstractmethod
def _get_only_child(self) -> tuple[str, "CTrieNode"]:
"""gets the only child's (edge, child). should only be called when
self._n_children() == 1, and not self.terminal. only used before deeply
modifying."""
pass
# the following happen to be the same implementation, but could be broken
# out into subclasses if needed.
def _replace_edge(
self,
old_edge: str,
new_edge: str,
node: "CTrieNode",
) -> None:
"""replaces old_edge with new_edge pointing to node"""
self._remove_edge(old_edge)
self._add_child_node(new_edge, node)
# the following are the main concrete operations implementing the data structure.
def _find_insertion_node(self, word: str) -> tuple["CTrieNode", str]:
"""returns node where word would be inserted, and how much is left
returns (node, remainder) where remainder is any trailing substr of word. Note
that if the word was found, node.terminal() == True and remainder == ''.
"""
# terminal case
if len(word) == 0:
return (self, word)
# search edges case
match = self._matches_edge(word)
if match is None:
return (self, word)
substr, node = match
remainder = word[len(substr) :]
return node._find_insertion_node(remainder)
def search(self, word: str) -> bool:
"""Returns whether word is found in this subtree"""
node, remainder = self._find_insertion_node(word)
return len(remainder) == 0 and node.terminal
def insert(self, word: str):
"""inserts word into this subtree"""
node, remainder = self._find_insertion_node(word)
if len(remainder) == 0:
# already have node at terminal! ensure terminal True (already is if exists)
node.terminal = True
return
# otherwise, there's some remainder left. two main cases: (1) there's
# no matching substr below, so we make a fresh child. (2) we have a
# partial match with an edge, so we'll have to split.
subnode, edge, lcs = node._overlaps_edge(remainder)
if subnode is None:
# (1). means lcs == 0, so we add all remainder
assert len(edge) == 0 and lcs == 0
node._add_child(remainder, True)
else:
# (2) split. two suboptions:
# (a) our intermediate node is terminal if lcs == len(remainder).
# (like []--"artifice"-->[] + "art" = []--"art"-->[]--"ifice"-->[])
#
# (b) our intermediate node isn't terminal and we need another for rest.
# add intermediate node
intermediate_is_terminal = lcs == len(remainder)
node._remove_edge(edge) # do now for sanity checks
intermediate = node._add_child(
remainder[:lcs],
intermediate_is_terminal, # for (a)
)
# always add back the existing subnode with a shortened edge
intermediate._add_child_node(edge[lcs:], subnode)
# for (b), add a new node for this word
if not intermediate_is_terminal:
intermediate._add_child(remainder[lcs:], True)
def _find_parent(self, word: str) -> tuple["CTrieNode", str, "CTrieNode"]:
"""returns (parent, edge, node), where node specifies word. helper for delete().
very similar to _find_insertion_node(), but we don't have parent
references. so if we're to delete a node, we need to be looking down
towards it.
"""
# search edges case
match = self._matches_edge(word)
if match is None:
raise ValueError(f"word (remainder) '{word}' not found in ctrie")
substr, node = match
remainder = word[len(substr) :]
if len(remainder) == 0:
return self, substr, node # found case
return node._find_parent(remainder) # recurse case
def delete(self, word: str):
print(f"deleting '{word}'")
parent, edge, node = self._find_parent(word)
if node.terminal is False:
raise ValueError(f"Can't delete '{word}' as it wasn't stored")
node.terminal = False
if node._n_children() >= 2:
# easy case: node still used by children. done.
print(" delete: 2 children case (easy)")
pass
elif node._n_children() == 1:
# node was a terminal with a single option below (conceptually, 2
# -> 1 children). now that it's no longer a terminal, its should
# be compressed with its parent. example:
#
# deleted "art", no longer Terminal
# v
# []--"art"-->[]--"ifice"-->[] (-->)
# parent node child
#
# should now become:
#
# []--"artifice"-->[]
# parent child
#
# we don't need to worry about parent, because parent must have started with
# >= 2 children (including 1 if it's terminal). node was 1 of its children,
# and now child will *still* be 1 of its children.
print(" delete: 1 child case (medium)")
child_edge, child = node._get_only_child()
parent._replace_edge(edge, edge + child_edge, child)
elif node._n_children() == 0:
# node had no children, so it can be removed. this means parent's
# number of children decreases, which means it might need to be
# merged with its parent if it has 1 left (node's sibling). this will
# make grandparent point directly to sibling.
print(" delete: 0 children case (hard)")
parent._remove_edge(edge)
if parent._n_children() == 1:
# example: deleting "artifice":
#
# +----"ful"----> [] sibling
# |
# [] ---"art"---> [] --"ifice"--> [] node
# grandparent parent
#
# +----"ful"----> [] sibling
# |
# [] ---"art"---> []
# grandparent parent <-- ! only 1 child! merge w/ grandparent
#
# [] ---"artful"---> []
# grandparent sibling
# NOTE: this is inefficient: finding grandparent! should put
# in _find_parent() but it's annoying to track grandparents.
grandparent, parent_edge, parent_ = self._find_parent(
word[: -len(edge)]
)
assert parent_ is parent
sibling_edge, sibling = parent._get_only_child()
grandparent._replace_edge(
parent_edge, parent_edge + sibling_edge, sibling
)
else:
assert False # should never have < 0 children
def traverse(self, prefix: list[str] = []) -> list[str]:
"""O(n), return a list of all stored strings"""
res = []
if self.terminal:
res.append("".join(prefix))
for edge, child in self.all_children():
res.extend(child.traverse(prefix + [edge]))
return res
def __repr__(self) -> str:
"""O(n), returns all stored strings newline-separated"""
return "\n".join(self.traverse())
def print_depths(self) -> None:
"""fun viz (accumulate characters at each tree depth and print).
In lieu of figuring out how to actually print trees "graphically" in text.
"""
ss = self._depth_sets()
for i in range(max(d for d, _cs in ss)):
d_cs = set()
for cs in [cs for d, cs in ss if d == i]:
d_cs.update(cs)
print(i, d_cs)
class CTrieNodeEdgeIterate(CTrieNode):
"""This is a data structure for a Compressed Trie Node that looks up edges by
iterating over them, which is O(rk), where r is the radix, AKA the set of unique
characters, and k is the length of a key (stored word). (For example, if storing
lowercase English letters only, r = 26, and when storing 'apple', k = 5)
This class contains the methods for getting/storing edges, and doing substring
lookups. The core data structure methods search(), insert(), and delete() are
implemented on the superclass CTrieNode.
"""
__slots__ = ("_children",)
def __init__(self, terminal: bool):
super().__init__(terminal)
self._children: dict[str, CTrieNode] = {}
def _matches_edge(self, word: str) -> Optional[tuple[str, "CTrieNode"]]:
"""
This is one of the core O(rk) operations we can speed up to O(k)
"""
for substr, node in self._children.items(): # O(r)
if word.startswith(substr): # O(k)
return (substr, node)
return None
def _overlaps_edge(
self, word: str
) -> tuple[Optional["CTrieNode"], str, int]:
"""
This is one of the core O(rk) operations we can speed up to O(k)
"""
# due to that this is still a prefix trie, only one child will have a
# common start > 0. because if multiple children did, that shared
# prefix should already have been split down into its own node.
for edge, node in self._children.items(): # O(r)
lcs = longest_common_start(edge, word) # O(k)
if lcs > 0:
return node, edge, lcs
return None, "", 0
def _add_child(self, edge: str, terminal: bool) -> "CTrieNode":
return self._add_child_node(edge, CTrieNodeEdgeIterate(terminal))
def _add_child_node(self, edge: str, child: "CTrieNode") -> "CTrieNode":
self._children[edge] = child
assert sum(edge[0] == w[0] for w in self._children.keys()) == 1
return child
def _remove_edge(self, edge: str) -> None:
del self._children[edge]
def all_children(self) -> list[tuple[str, "CTrieNode"]]:
return list(self._children.items())
def _depth_sets(self, level=0) -> list[tuple[int, set[str]]]:
res = [
(
level,
set(f"{c} ({n.terminal})" for c, n in self._children.items()),
)
]
for node in self._children.values():
res.extend(node._depth_sets(level + 1))
return res
def _n_children(self) -> int:
return len(self._children) + (1 if self.terminal else 0)
def _get_only_child(self) -> tuple[str, "CTrieNode"]:
assert not self.terminal and self._n_children() == 1
for edge, node in self._children.items():
return edge, node
assert False
class CTrieNodeCharLookup(CTrieNode):
"""This is a data structure for a Compressed Trie Node that looks up edges by
indexing by the first letter, which is O(1), then comparing strings, which is O(k).
This class contains the methods for getting/storing edges, and doing substring
lookups. The core data structure methods search(), insert(), and delete() are
implemented on the superclass CTrieNode.
"""
__slots__ = ("_children",)
def __init__(self, terminal: bool):
super().__init__(terminal)
self._children: dict[str, tuple[str, CTrieNode]] = {}
"""dict from {edge's first char: (edge, node)}, like {'a': ('apple', node)}"""
def _matches_edge(self, word: str) -> Optional[tuple[str, "CTrieNode"]]:
"""
This is one of the core O(k) operations we sped up from O(rk)
"""
if len(word) > 0 and word[0] in self._children: # O(1)
edge, node = self._children[word[0]]
if word.startswith(edge): # O(k)
return (edge, node)
return None
def _overlaps_edge(
self, word: str
) -> tuple[Optional["CTrieNode"], str, int]:
"""
This is one of the core O(k) operations we sped up from O(rk)
"""
if len(word) > 0 and word[0] in self._children: # O(1)
edge, node = self._children[word[0]]
lcs = longest_common_start(edge, word) # O(k)
if lcs > 0:
return node, edge, lcs
return None, "", 0
def _add_child(self, edge: str, terminal: bool) -> "CTrieNode":
return self._add_child_node(edge, CTrieNodeCharLookup(terminal))
def _add_child_node(self, edge: str, child: "CTrieNode") -> "CTrieNode":
first = edge[0]
if first in self._children:
raise ValueError(
f"cannot add edge {edge} when {self._children[first][0]} already stored"
)
self._children[first] = (edge, child)
return child
def _remove_edge(self, edge: str) -> None:
del self._children[edge[0]]
def all_children(self) -> list[tuple[str, "CTrieNode"]]:
return list(self._children.values())
def _depth_sets(self, level=0) -> list[tuple[int, set[str]]]:
res = [
(
level,
set(
f"{c} ({n.terminal})" for c, n in self._children.values()
),
)
]
for _, node in self._children.values():
res.extend(node._depth_sets(level + 1))
return res
def _n_children(self) -> int:
return len(self._children) + (1 if self.terminal else 0)
def _get_only_child(self) -> tuple[str, "CTrieNode"]:
assert not self.terminal and self._n_children() == 1
for edge, node in self._children.values():
return edge, node
assert False
def main() -> None:
# test helpers
assert longest_common_start("art", "arteries") == 3
assert longest_common_start("artifact", "arteries") == 3
assert longest_common_start("art", "banana") == 0
# to switch between node implementations
# Node = CTrieNodeEdgeIterate
Node = CTrieNodeCharLookup
t1 = Node(False)
t1.insert("artifact")
t1.insert("artifice")
t1.insert("art")
t1.insert("artificial")
print(t1)
t1.print_depths()
# ctrie.delete("art") # 1 child (medium)
# ctrie.delete("artificial") # 0 children (hard)
# 2 children (following two lines): easy
t1.insert("artful")
t1.delete("art")
print(t1)
t1.print_depths()
t2 = Node(False)
t2.insert("julie")
t2.insert("july")
# t.delete("jul") # value error
t2.insert("julius")
t2.delete("julie")
# t.insert("abcdefghijklmnopqrstuvwxyz")
# t.insert("abra")
t2.print_depths()
if __name__ == "__main__":
main()
From my source repo.
Amortization
Amortization averages rare expensive operations with common cheap operations.
For example, growing an array of length n happens after n elements (filling it), and has a cost of n (copying everything over). The “obvious” thing to do is double the array’s size each time. This means that the expensive operation keeps happening rarely. Hand-waving math, for n elements we get n O(1) costs, then 1 O(n) cost. This means we have a total of 2n = O(n) operations, or an average of O(n)/n = O(1).
But critical to an actual analysis, I think, is the growing policy. Because imagine we only grow the array by 1 each time, rather than doubling it. Then, every new addition would take O(n), and O(n) would quickly dominate as the actual append cost every time. This is why some equations are necessary to actually measure the expensive operation frequency. I believe what you can do here is, rather than looking at a starting size (e.g., 10), you say: assume we have any starting size, an expansion policy, and we’ve done n pushes. How many times will we have doubled to support n pushes? Now, how much will that have cost? (For this case, you use a geometric series.)
(Wikipedia says there are three formal methods of analysis you can choose from to get the correct answer: “aggregate,” “accounting,” and “potential.” I don’t want to dig into amortized analysis further right now because I don’t yet find it fun or interesting.)
One interesting distinction is amortized analysis vs average-case analysis. They sounds similar, and amortized analysis does involve averaging. The key distinction is that average-case analysis considers (a distribution over) actual inputs, whereas amortized analysis is deterministic and involves counting operations.
Footnotes
I believe it’s correct to say that most of these could be written as Θ because the bounds are tight. However, I don’t currently enjoy doing true, rigorous, math-y big-O analysis, so I’m not going to. ↩︎
The dichotomy is strange. A priority queue is an ADT, and the data structure is called a heap. But there’s no other serious contenders for implementing a priority queue besides a heap. And in contrast, there are over a dozen heaps (binary heap, skew heap, etc.), which seems to make “heap” itself abstract, and those implementations the concrete data structures. ↩︎
Inserts that grow the array require copying old elements over (O(n)). The amortized time depends on how often this happens, i.e., the growing policy. E.g., growing by 1 bad, probably leads to O(n) amortized, but doubling leads to O(1) amortized. One idea for growing without copying would be extending to a more complex data structure that stitches multiple contiguous memory regions together. However, this is probably more trouble than it’s worth, and could affect cache performance. ↩︎
Merge: stitching two together would require higher level bookkeeping for traversal and memory fragments ↩︎
Split hack: This is a cool thing you could do in C. The problem in practice, however is that now freeing the first one (index 0) frees both, and freeing second (index k) likely freaks malloc out: malloc only knows about requested start addrs. For this to work, you’d need some higher level system that would manage these to ensure you don’t screw up memory. This makes most sense as a distinct concept: views of an array, rather than an actual split. A split implies a new, fully-functioning, first-class array. ↩︎
With bad design or bad luck, all O(1) hash table operations can become O(n). Imagine all keys have the same hash, and there’s a full load factor. Then, finding any entry (a prerequisite for access, update, delete) or empty space (for insertion) can take O(n) to traverse the whole table or separate chain data structure. ↩︎
Python’s deque docs say: “Indexed access is O(1) at both ends but slows to O(n) in the middle.” This ‘slows’ makes sense given their implementation. ↩︎
And meld (AKA merge) from O(n) into O(log n) and eventually O(1). ↩︎
Because I am Max. ↩︎
For example: the heights of the left and right subtrees at any node must differ by at most 1. ↩︎
I really don’t understand this, so there might not be. My evidence is, while scouring Wikipedia’s links to implementations to try to understand a compressed trie’s properties (didn’t help), I found (a) some implementation that said something like "this isn’t a true radix tree, so I renamed it to a Patricia tree," and (b) the fact that radix trees are used in fast lookups like IP tables, whereas my conceptualization is a little slow with pointer jumping and arbitrary-length substring parsing and comparison. Also, (c) while arguing with some LLM it told me these properties might be distinct. Due to all this, I’m going to just claim that I’m doing the merged thing. But I think it’s possible that if you make an assumption that you have a limited character set, you end up with the radix tree naturally. But perhaps really modeling a very limited radix—like r=2 or r=8—allows you do implement pieces of this more efficiently with better data structures (OK, an array) and faster comparison (OK, bit tricks). ↩︎