It’s all just (idealized) bits.
Simplifying Assumptions
We generally consider two related aspects when we store bits somewhere: location and speed.
Location. There’s stack/heap, and the physical memory hierarchy.
Speed: The memory hierarchy has varying sizes and access speeds, from registers to cache to RAM to disk. Plus caching effects.
Much more on both in the hardware-software interface.
In data structures, we typically ignore these:
- We assume heap
- We don’t worry about the memory hierarchy
- Assume data structures fit in memory
- Ignore different speeds for cache/RAM
- Ignore caching effects (i.e., repeated or nearby reads)
Instead, unless otherwise noted, we just assume there’s instant O(1) access to anything we have an address for.
We’ll take all of these assumptions below.
(Static) Array
Let’s consider a C array. Just a contiguous range of memory.
For an array to be useful, we must know or store how much memory was allocated—i.e., the array’s capacity—else we’ll overrun.
// hardcoded length
int a[10]; // stack
int *b = malloc(10 * sizeof(int)); // heap
// variable-len arrays (C99+)
int capacity = 10;
int a[capacity]; // stack
int *b = malloc(capacity * sizeof(int)); // heap
We might want to use the raw array to build a higher-level data structure. E.g., use it to make a stack or dynamic array. To move towards those cases, we’ll also track the length, i.e., number of explicitly assigned elements.
Click to show a stack-only fixed-size array in C
To my surprise, a fixed-size array can live on the stack and be passed around into sub-functions. You must only remember to never use it after the defined function. When you pass it, callers just get a pointer to where it lives on your stack frame.
I added some basic push()
and pop()
functions to see what mutating it and passing around metadata feels like. This gives it lite stack (data structure) semantics.
#include <stdio.h>
/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/
int push(int *a, int capacity, int len, int value) {
if (len == capacity) {
printf("Warning: Can't push %d: array is at capacity (%d).\n", value, capacity);
return len;
}
a[len] = value;
return len + 1;
}
/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/
int pop(int *a, int len) {
if (len == 0) {
printf("Warning, can't pop: array is empty.\n");
return len;
}
// just leave the value there! :-)
return len - 1;
}
/**
* Prints `a`'s values in [0, len).
* prereq: 0 <= len <= capacity
*/
void display(int *a, int capacity, int len) {
printf("int* array at %p, capacity = %d, len = %d\n", (void*)a, capacity, len);
for (int i = 0; i < len; i++) {
printf("a[%d] = %d\n", i, a[i]);
}
printf("\n");
}
/**
* Fill all of `a` with 0.
*/
void zero(int *a, int capacity) {
for (int i = 0; i < capacity; i++) {
a[i] = 0;
}
}
int main() {
// dynamic stack array
int capacity = 10;
int len = 0;
int a[capacity]; // can't init (= {0}) here as dynamic
// zero(a, capacity); // instead. (optional.)
display(a, capacity, len); // nothing
for (int i = 0; i < capacity + 1; i++) {
len = push(a, capacity, len, 100 + i);
}
display(a, capacity, len); // 100 -- 109
for (int j = 0; j < 3; j++) {
len = pop(a, len);
}
display(a, capacity, len); // 100 - 106
return 0;
}
Doing it as a raw array on the stack is fun to immediately see pain points:
- Not being able to expand the array. (Allocating heap memory so it’s expandable makes sense.)
- Having to pass all metadata as function args is cumbersome. (Structs make sense.)
Raw (static, fixed-length) arrays are great as a densely-packed specialized data storage. But they’re just a block of contiguous memory, not really a general purpose data structure.
Dynamic Array
Let’s analyze a basic heap-allocated array that grows by doubling:
Time complexity
Operation | Complexity | Notes |
---|---|---|
Index access | O(1) | |
Value access | O(n) | search |
Insert at end | O(1) amortized | if grows, copy all, but amortized (see below) |
Insert at index | O(n) | need to shift everything after |
Delete at end | O(1) | |
Delete at index | O(n) | need to shift everything after |
Update | O(1) |
Insert at end amortized: Inserts that grow the array require copying old elements over (O(n)). The amortized time depends on how often this happens, i.e., the growing policy. E.g., growing by 1 bad, probably leads to O(n) amortized, but doubling leads to O(1) amortized. Longer note below.
One idea for growing without copying would be extending to a more complex data structure that stitches multiple contiguous memory regions together. However, this is probably more trouble than it’s worth, and could affect cache performance.
Space complexity
Type | Complexity | Notes |
---|---|---|
Overhead | O(1) | pointer, capacity, length |
Special operations
(Also time complexity)
Operation | Complexity | Notes |
---|---|---|
Traversal | O(n) | visit all elements |
Merge | O(n) | to copy everything over. (see below) |
Split, dangerous hack | O(1) | just point to index k and partition capacities. (see below) |
Split | O(n) | copy over everything from k to n-1 into new array |
Min/max | O(n) | visit all elements |
Merge: stitching two together would require higher level bookkeeping for traversal and memory fragments
Split hack: This is a cool thing you could do in C. The problem in practice, however is that now freeing the first one (index 0) frees both, and freeing second (index k) likely freaks malloc out: malloc only knows about requested start addrs. For this to work, you’d need some higher level system that would manage these to ensure you don’t screw up memory. This makes most sense as a distinct concept: views of an array, rather than an actual split. A split implies a new, fully-functioning, first-class array.
Hash Map
Let’s consider a hash map.
First, we’ll assume the existence of a good hash function and compression function:
- a hash function h maps any item I we’d like to store into, let’s say, an int H within a fixed range of nonnegative integers [0, M] in Z. (E.g., 32-bit or 64-bit integers.)
- a good hash function makes it unlikely that two distinct items I and I’ produce the same hash H while there are < M items; i.e., hashes for input objects are distributed uniformly across [0, M]
- a compression function (AKA “index mapping”) maps [0, M] to the size of our actual backing key array of length K.
- Hmm, trivially this could be done by simply mod, with our array key H’ = H % K
- However, there exist good hash functions that become in practice bad because they collide after this mod
- Studied strategies (from looking up; would need to read more)
- division: H mod K, but choose K to so it’s unlikely to expose patterns in hash values (e.g., prime K away from power of 2). However, this seems bad because your backing away now won’t be near a power of two, which is good for memory alignment?
- multiplication: choose some constant e in (0, 1), do H * e, take the fraction bits f, then do f * K
- MAD (multiply-add-divide): (aH + b) mod p mod K (with p a prime > K)
I’m thinking this is backed by an array because an array gives us O(1) access once we’ve gotten the compressed hash.
I understand why hashes need to be stable. If an item’s hash changes, you wouldn’t be able to locate it in the map anymore.
Now, we can take any item we’d like to store and produce an index in [0, K)
of a backing array. Each entry in the array could store:
- the item itself (open addressing; need probing)
- a primitive if fixed length (e.g., <= 64 bits)
- the full item (will need space + copy)
- primitive or pointer (e.g., still always <= 64 bits, but potential optimization for primitive items)
- always pointers (simplest, natural for many langs)
- a (pointer to a) data structure (separate chaining; for collisions)
- dynamic array
- linked list
- another hash map (len K’, hopefully to avoid collisions with K)
- The sub-hash-map strategy seems sus. Feels like you should just grow the original hash map once there’s enough pressure. Though I guess could help if you have a pathologically bad initial H->K, then a great H->K’?
- between dynamic array and linked list, I like linked list. You’ll have O(n) retrieval and O(1) insertions at end regardless, so might as well have O(1) deletion rather O(n)
So far I’ve been assuming we want to store and retrieve an item I. This seems suitable to implement a set: O(1) lookup and retrieval. I think we also need an equality check for items, because after a collision we need to know whether the candidate and existing entry match.
However, we may want to hash user-provided key-value pairs (K, V), to implement a mapping. In this case, I think we simply hash only the key, and explicitly store the key alongside the value in the data structure so that we can explicitly use the key
Growing. If we track the number of elements in the data structure, we can decide to grow the data structure at some fullness threshold. (“Pressure?”) Naively, we’d
- choose a new K’ > K
- iterate over all items, and find their new index in K’
- if we also stored each item’s hash (which should be OK b/c stable), we wouldn’t have to re-hash. This feels like a minor optimization but could be good if hashing is expensive, like checksumming a lot of data
- regardless, I think we’d need to re-run the compression function
- we could skip this if we have an equation to transform a compressed value from an input of K to K’
- however, this is probably as fast as running the compression function again, because they seem simple
- then, we’d need to re-insert freshly into the new map
- if we have strong guarantees about the ordering of items under some K -> K’ assumptions (i.e., haven’t undergone a mod), AND we don’t have any collisions (no open addressing or chaining happened), we could potentially directly copy chunks of items
- … but this seems like an over-optimization that would be useful rarely. zero collisions before growing seems unlikely. and it seems like the chained data structures might need some new bookkeeping.
- in fact, the whole reason we’re growing the data structure in the first place implies we likely have collisions, which means naively copying is probably a bad idea
- this is another O(n) operation that I think amortizes well, because we tie resizes to inserts and use a multiplicative growth strategy.
Iteration.
- If we have only a backing array of len K, and we have n stored items with n << K, then O(K) dominates the iteration time. (We iterate over all buckets.) Once n >= K, then O(n) dominates. (I’m not sure how to write this in big-O.)
- We lack the ability to provide, e.g., iteration order by insertion
- To provide insertion order and strict O(n) iteration, we could keep another data structure. (Gets (value access) always O(1) from hash map.)
- We could use a dynamic array that we push new insertions to at the end.
- O(n) deletion
- shift and copy, same as array
- this is bad, because it ruins our O(1) hash map deletion
- O(1) get by index (what’s the 4th item?)
- interestingly: O(n) by value’s index (when was this value inserted?)
- Unless we kept the value’s index in the hash table as well, but this would require O(n) updates on every deletion (which is already true…), and if we did this might as well skip the array entirely
- O(n) deletion
- We could use a linked list
- O(n) deletion because it’s O(n) retrieval (where is node on linked list?)
- Fast list position retrieval could again be circumvented by again the position in the node, but would again require O(n) updates on deletion because of bookkeeping
- O(n) get by index (could be non-goal)
- O(n) deletion because it’s O(n) retrieval (where is node on linked list?)
- What if we used a doubly linked list within the hash map entries?
- Keep pointers to first and last items
- O(1) deletion: find and patch up neighbors’ links
- still O(1) insertion: added to end
- O(n) by index (4th item?) or value index (which item?)
- We could use a dynamic array that we push new insertions to at the end.
Handling open addressing Open addressing I believe means upon collision you walk forward until you find an opening. I intuitively have never liked this because it makes the top level data structure messier.
- Naively, retrievals become “hit then walk”
- But the issue is deletions can leave holes
- To overcome the holes, we need some bookkeeping strategy. We could track the maximum we’ve ever walked, or a per-bucket max walk. Then always walk that far. Per-bucket seems better because you can actually decrement it later upon deletions (though maybe over-optimization to do so).
- Actually, it seems like leaving holes isn’t commonly done. The following strategies assume no holes.
- Tombstones (lazy deletion): keep entries as “occupied” so searching forwards keeps looking until a hole. This is cool because insertions can still use tombstones.
- Shifting (more typical?): when you’d delete, shift things back to avoid any holes.
- This seems… really complicated? Couldn’t you have to shift n entries, hopping over entries and clusters from other hashes? I think yes. But I guess you do have the nice invariant of “no holes within a cluster,” so as soon as you reach a hole you’re done.
Probing strategies
- linear probing (walk 1, clustering, great cache perf)
- double hashing (walk deterministic n computed from bucket (e.g., index), no clustering, bad cache perf)
- quadratic probing (walk 1,2,3,…; moderate clustering & caching)
ok, i’m working through reinventing a hash map from scratch (i have a CS background just refreshing). keep answers short, and don’t give away information, i’m trying to reinvent and think through it myself, i just want small nudges.
Amortization
Amortization averages rare expensive operations with common cheap operations.
For example, growing an array of length n happens after n elements (filling it), and has a cost of n (copying everything over). The “obvious” thing to do is double the array’s size each time. This means that the expensive operation keeps happening rarely. Hand-waving math, for n elements we get n O(1) costs, then 1 O(n) cost. This means we have a total of 2n = O(n) operations, or an average of O(n)/n = O(1).
But critical to an actual analysis, I think, is the growing policy. Because imagine we only grow the array by 1 each time, rather than doubling it. Then, every new addition would take O(n), and O(n) would quickly dominate as the actual append cost every time. This is why some equations are necessary to actually measure the expensive operation frequency. I believe what you can do here is, rather than looking at a starting size (e.g., 10), you say: assume we have any starting size, an expansion policy, and we’ve done n pushes. How many times will we have doubled to support n pushes? Now, how much will that have cost? (For this case, you use a geometric series.)
(Wikipedia says there are three formal methods of analysis you can choose from to get the correct answer: “aggregate,” “accounting,” and “potential.” I don’t want to dig into amortized analysis further right now because I don’t yet find it fun or interesting.)
One interesting distinction is amortized analysis vs average-case analysis. They sounds similar, and amortized analysis does involve averaging. The key distinction is that average-case analysis considers (a distribution over) actual inputs, whereas amortized analysis is deterministic and involves counting operations.