General CS Notes

  1. Data Structures
  2. Python Data Structures
  3. Python Memory Management
  4. The Hardware-Software Interface

General CS Notes

Python Data Structures

It’s all just bits in memory.

I’ll be assuming CPython throughout.

TODO: investigate

list

Python’s list is a plain old dynamic array.

Some interesting notes:

Deep-Dive: List Source Code

Source https://github.com/python/cpython/blob/main/Objects/listobject.c

Quite readable. Observations from the code overall:

Here’s an interesting comment on allocations, especially since I was just looking up amortized time for growing dynamically-sized arrays depending on growth strategy:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc(). Add padding to make the allocated size multiple of 4. The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, …

source

Linear-time amortized behavior! From the phrasing, it sounds like linear-time here is a good result. I would have expected it to be bad, in contrast with constant time (from doubling). However, I’m sure it’s a tradeoff with memory inefficiency, and likely there’s better performance with a good system realloc(). (I’d guess best-case realloc is that it has great performance, even O(1), if it can either directly expand the memory region, or copy (map?) the old contents over very quickly.)

Update: the phrasing here confused me, but I think it’s saying that it’s amortized linear time O(n) for n appends, i.e., it’s amortized O(1) per append. This checks out from an (arguing with LLMs) analysis, which is that as long as we’re growing by a rate that doesn’t depend on n, we’ll amortize out to constant time per append.

This also makes me wonder: is resizing an array actually often cheaper than O(n), because reallocating might have O(1) performance? I could imagine this straying from amortized analysis to average-case analysis, because I’d guess whether reallocation can trivially expand the same region depends on the sizes of the old and new regions. E.g., I bet smaller reallocations are more likely to succeed without copying than larger ones due to fragmentation.

Update: Surprisingly, despite realloc seeming like a really efficient choice, it seems unlikely it’s used. Python’s newer memory manager mimalloc implements reallocation by allocating and copying. The core lines are:

// from _mi_heap_realloc_zero(...)
void* newp = mi_heap_malloc(heap,newsize);
const size_t copysize = (newsize > size ? size : newsize);
_mi_memcpy(newp, p, copysize);
mi_free(p);

I think this may be because custom memory managers (mimalloc replaces malloc/realloc/etc.) use mmap/munmap as the underlying memory primitive, and while mremap exists as an mmap-flavorted alternative to realloc, it’s Linux-only, so might not be relied on for a general purpose cross-system library.

This all means that resizing really is going to be O(n) rather than O(1).

I would have expected, with all the layers of allocators, that something like a list would draw from a pool with ready-to-expand memory regions. But it seems like a list goes straight to the “raw” memory API.

The code references “system realloc()”, but interestingly calls PyMem_Realloc. This makes sense because Python is multi-OS, so they’d need to abstract over system calls. I’m curious how much of their own memory management they do. Like whether they ask for large chunks of memory from the system and then manage it themselves with various pools, or if they just allocate on demand.

Update: The former. (But also, they do allocate on demand!) Deeper dive in Python Memory Management

One other question is: where is the length (size) kept? The list C data structure looks to simply be:

// listobject.c
typedef struct {
Py_ssize_t allocated;
PyObject *ob_item[];
} _PyListArray;

allocated is the capacity, and then there’s the backing array of pointers ob_item. But where’s the length (i.e., actually filled elements)?

… well, that definition and its usage is always within #ifdef Py_GIL_DISABLED guards. So I think that’s the “no GIL” version. Let’s instead trace the standard version, which is found in the header file:

// listobject.h
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;

/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/

Py_ssize_t allocated;
} PyListObject;

What!? Same problem! We have allocated and then the backing array of pointers ob_item. The magic is in PyObject_VAR_HEAD. I think this gets defined to always be a variable of type struct PyVarObject named ob_base. A field in the PyVarObject struct is ob_size, which contains the length.

// object.h

/* PyObject_VAR_HEAD defines the initial segment of all variable-size
* container objects. ... */

#define PyObject_VAR_HEAD PyVarObject ob_base;

struct PyVarObject {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
};
typedef struct PyVarObject PyVarObject;

id()

Small aside, but id() actually gives the object’s C pointer. This is cool. Some notes on this:

post info


Published Jul 31, 2025
Disclaimer This is an entry in the garage. It may change or disappear at any point.
Inbound
Outbound

General CS Notes series


← previous

1. Data Structures