General CS Notes

  1. Data Structures
  2. Python Data Structures
  3. Python Memory Management
  4. The Hardware-Software Interface

General CS Notes

Data Structures

It’s all just (idealized) bits.

Simplifying Assumptions

We generally consider two related aspects when we store bits somewhere: location and speed.

Location. There’s stack/heap, and the physical memory hierarchy.

Speed: The memory hierarchy has varying sizes and access speeds, from registers to cache to RAM to disk. Plus caching effects.

Much more on both in the hardware-software interface.

In data structures, we typically ignore these:

Instead, unless otherwise noted, we just assume there’s instant O(1) access to anything we have an address for.

We’ll take all of these assumptions below.

(Static) Array

Let’s consider a C array. Just a contiguous range of memory.

For an array to be useful, we must know or store how much memory was allocated—i.e., the array’s capacity—else we’ll overrun.

// hardcoded length
int a[10]; // stack
int *b = malloc(10 * sizeof(int)); // heap

// variable-len arrays (C99+)
int capacity = 10;
int a[capacity]; // stack
int *b = malloc(capacity * sizeof(int)); // heap

We might want to use the raw array to build a higher-level data structure. E.g., use it to make a stack or dynamic array. To move towards those cases, we’ll also track the length, i.e., number of explicitly assigned elements.

Click to show a stack-only fixed-size array in C

To my surprise, a fixed-size array can live on the stack and be passed around into sub-functions. You must only remember to never use it after the defined function. When you pass it, callers just get a pointer to where it lives on your stack frame.

I added some basic push() and pop() functions to see what mutating it and passing around metadata feels like. This gives it lite stack (data structure) semantics.

#include <stdio.h>

/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/

int push(int *a, int capacity, int len, int value) {
if (len == capacity) {
printf("Warning: Can't push %d: array is at capacity (%d).\n", value, capacity);
return len;
}
a[len] = value;
return len + 1;
}

/**
* Returns new len.
* prereq: 0 <= len <= capacity
*/

int pop(int *a, int len) {
if (len == 0) {
printf("Warning, can't pop: array is empty.\n");
return len;
}
// just leave the value there! :-)
return len - 1;
}

/**
* Prints `a`'s values in [0, len).
* prereq: 0 <= len <= capacity
*/

void display(int *a, int capacity, int len) {
printf("int* array at %p, capacity = %d, len = %d\n", (void*)a, capacity, len);
for (int i = 0; i < len; i++) {
printf("a[%d] = %d\n", i, a[i]);
}
printf("\n");
}

/**
* Fill all of `a` with 0.
*/

void zero(int *a, int capacity) {
for (int i = 0; i < capacity; i++) {
a[i] = 0;
}
}

int main() {
// dynamic stack array
int capacity = 10;
int len = 0;
int a[capacity]; // can't init (= {0}) here as dynamic
// zero(a, capacity); // instead. (optional.)

display(a, capacity, len); // nothing

for (int i = 0; i < capacity + 1; i++) {
len = push(a, capacity, len, 100 + i);
}

display(a, capacity, len); // 100 -- 109

for (int j = 0; j < 3; j++) {
len = pop(a, len);
}

display(a, capacity, len); // 100 - 106

return 0;
}

Doing it as a raw array on the stack is fun to immediately see pain points:

  • Not being able to expand the array. (Allocating heap memory so it’s expandable makes sense.)
  • Having to pass all metadata as function args is cumbersome. (Structs make sense.)

Raw (static, fixed-length) arrays are great as a densely-packed specialized data storage. But they’re just a block of contiguous memory, not really a general purpose data structure.

Dynamic Array

Let’s analyze a basic heap-allocated array that grows by doubling:

Time complexity

Operation Complexity Notes
Index access O(1)
Value access O(n) search
Insert at end O(1) amortized if grows, copy all, but amortized (see below)
Insert at index O(n) need to shift everything after
Delete at end O(1)
Delete at index O(n) need to shift everything after
Update O(1)

Insert at end amortized: Inserts that grow the array require copying old elements over (O(n)). The amortized time depends on how often this happens, i.e., the growing policy. E.g., growing by 1 bad, probably leads to O(n) amortized, but doubling leads to O(1) amortized. Longer note below.

One idea for growing without copying would be extending to a more complex data structure that stitches multiple contiguous memory regions together. However, this is probably more trouble than it’s worth, and could affect cache performance.

Space complexity

Type Complexity Notes
Overhead O(1) pointer, capacity, length

Special operations

(Also time complexity)

Operation Complexity Notes
Traversal O(n) visit all elements
Merge O(n) to copy everything over. (see below)
Split, dangerous hack O(1) just point to index k and partition capacities. (see below)
Split O(n) copy over everything from k to n-1 into new array
Min/max O(n) visit all elements

Merge: stitching two together would require higher level bookkeeping for traversal and memory fragments

Split hack: This is a cool thing you could do in C. The problem in practice, however is that now freeing the first one (index 0) frees both, and freeing second (index k) likely freaks malloc out: malloc only knows about requested start addrs. For this to work, you’d need some higher level system that would manage these to ensure you don’t screw up memory. This makes most sense as a distinct concept: views of an array, rather than an actual split. A split implies a new, fully-functioning, first-class array.

Hash Map

Let’s consider a hash map.

First, we’ll assume the existence of a good hash function and compression function:

I’m thinking this is backed by an array because an array gives us O(1) access once we’ve gotten the compressed hash.

I understand why hashes need to be stable. If an item’s hash changes, you wouldn’t be able to locate it in the map anymore.

Now, we can take any item we’d like to store and produce an index in [0, K) of a backing array. Each entry in the array could store:

So far I’ve been assuming we want to store and retrieve an item I. This seems suitable to implement a set: O(1) lookup and retrieval. I think we also need an equality check for items, because after a collision we need to know whether the candidate and existing entry match.

However, we may want to hash user-provided key-value pairs (K, V), to implement a mapping. In this case, I think we simply hash only the key, and explicitly store the key alongside the value in the data structure so that we can explicitly use the key

Growing. If we track the number of elements in the data structure, we can decide to grow the data structure at some fullness threshold. (“Pressure?”) Naively, we’d

Iteration.

Handling open addressing Open addressing I believe means upon collision you walk forward until you find an opening. I intuitively have never liked this because it makes the top level data structure messier.

Probing strategies

ok, i’m working through reinventing a hash map from scratch (i have a CS background just refreshing). keep answers short, and don’t give away information, i’m trying to reinvent and think through it myself, i just want small nudges.

Amortization

Amortization averages rare expensive operations with common cheap operations.

For example, growing an array of length n happens after n elements (filling it), and has a cost of n (copying everything over). The “obvious” thing to do is double the array’s size each time. This means that the expensive operation keeps happening rarely. Hand-waving math, for n elements we get n O(1) costs, then 1 O(n) cost. This means we have a total of 2n = O(n) operations, or an average of O(n)/n = O(1).

But critical to an actual analysis, I think, is the growing policy. Because imagine we only grow the array by 1 each time, rather than doubling it. Then, every new addition would take O(n), and O(n) would quickly dominate as the actual append cost every time. This is why some equations are necessary to actually measure the expensive operation frequency. I believe what you can do here is, rather than looking at a starting size (e.g., 10), you say: assume we have any starting size, an expansion policy, and we’ve done n pushes. How many times will we have doubled to support n pushes? Now, how much will that have cost? (For this case, you use a geometric series.)

(Wikipedia says there are three formal methods of analysis you can choose from to get the correct answer: “aggregate,” “accounting,” and “potential.” I don’t want to dig into amortized analysis further right now because I don’t yet find it fun or interesting.)

One interesting distinction is amortized analysis vs average-case analysis. They sounds similar, and amortized analysis does involve averaging. The key distinction is that average-case analysis considers (a distribution over) actual inputs, whereas amortized analysis is deterministic and involves counting operations.

post info


Published Jul 31, 2025
Disclaimer This is an entry in the garage. It may change or disappear at any point.
Outbound

General CS Notes series