General CS Notes: Python Data Structures

It’s all just bits in memory.

I’ll be assuming CPython throughout.

TODO: investigate

dict

`list`

Python’s list is a plain old dynamic array.

Some interesting notes:

Each element is a python object pointer (not the actual data)
The growth strategy is 9/8 (100.125%) + 6, then rounded up to nearest 4. This is a more conservative growth strategy than doubling (200%).
- I think this still gives amortized constant-time per append.

Deep-Dive: List Source Code

Source https://github.com/python/cpython/blob/main/Objects/listobject.c

Quite readable. Observations from the code overall:

wow there’s a lot of code protected by #ifdef Py_GIL_DISABLED, I bet a lot of maintainers hated this
garbage collection, allocation, and freeing takes up a bunch of the other code implementing operations
the rest is pretty straightforward, readable code implementing operations as you’d expect
they use goto for errors
it’s striking in that if you were to implement a language in C with garbage collection and objects, this is probably what it’d look like. I didn’t expect that. I guess I thought there’d be much more “opinionated” code or something. But instead, it feels like, if you had very similar goals to Python, you’d end up implementing something that looks a lot like Python’s source code.
there’s a bunch of Py_BEGIN_CRITICAL_SECTION(...)/Py_END_CRITICAL_SECTION(), I’m curious what these do.
the list uses “allocated” for capacity and “size” for length.
I forgot about the joy of C, where you can’t tell from the #includes where stuff comes from. E.g., PyMem_Malloc, where are you from? (Editor’s note: see Python Memory Management if curious.)
checking the blame of the source code is wild
- nearly every line has been changed
- a huge amount have been most recently changed by the GIL removal effort
- whole block design comments are from Guido’s “Initial revision” commit 35 years ago

Here’s an interesting comment on allocations, especially since I was just looking up amortized time for growing dynamically-sized arrays depending on growth strategy:

This over-allocates proportional to the list size, making room for additional growth. The over-allocation is mild, but is enough to give linear-time amortized behavior over a long sequence of appends() in the presence of a poorly-performing system realloc(). Add padding to make the allocated size multiple of 4. The growth pattern is: 0, 4, 8, 16, 24, 32, 40, 52, 64, 76, …

– source

Linear-time amortized behavior! From the phrasing, it sounds like linear-time here is a good result. I would have expected it to be bad, in contrast with constant time (from doubling). However, I’m sure it’s a tradeoff with memory inefficiency, and likely there’s better performance with a good system realloc(). (I’d guess best-case realloc is that it has great performance, even O(1), if it can either directly expand the memory region, or copy (map?) the old contents over very quickly.)

Update: the phrasing here confused me, but I think it’s saying that it’s amortized linear time O(n) for n appends, i.e., it’s amortized O(1) per append. This checks out from an (arguing with LLMs) analysis, which is that as long as we’re growing by a rate that doesn’t depend on n, we’ll amortize out to constant time per append.

This also makes me wonder: is resizing an array actually often cheaper than O(n), because reallocating might have O(1) performance? I could imagine this straying from amortized analysis to average-case analysis, because I’d guess whether reallocation can trivially expand the same region depends on the sizes of the old and new regions. E.g., I bet smaller reallocations are more likely to succeed without copying than larger ones due to fragmentation.

Update: Surprisingly, despite realloc seeming like a really efficient choice, it seems unlikely it’s used. Python’s newer memory manager mimalloc implements reallocation by allocating and copying. The core lines are:
// from _mi_heap_realloc_zero(...)
void* newp = mi_heap_malloc(heap,newsize);
const size_t copysize = (newsize > size ? size : newsize);
_mi_memcpy(newp, p, copysize);
mi_free(p);
I think this may be because custom memory managers (mimalloc replaces malloc/realloc/etc.) use mmap/munmap as the underlying memory primitive, and while mremap exists as an mmap-flavorted alternative to realloc, it’s Linux-only, so might not be relied on for a general purpose cross-system library.

This all means that resizing really is going to be O(n) rather than O(1).

I would have expected, with all the layers of allocators, that something like a list would draw from a pool with ready-to-expand memory regions. But it seems like a list goes straight to the “raw” memory API.

The code references “system realloc()”, but interestingly calls PyMem_Realloc. This makes sense because Python is multi-OS, so they’d need to abstract over system calls. I’m curious how much of their own memory management they do. Like whether they ask for large chunks of memory from the system and then manage it themselves with various pools, or if they just allocate on demand.

Update: The former. (But also, they do allocate on demand!) Deeper dive in Python Memory Management

One other question is: where is the length (size) kept? The list C data structure looks to simply be:

// listobject.c
typedef struct {
    Py_ssize_t allocated;
    PyObject *ob_item[];
} _PyListArray;

allocated is the capacity, and then there’s the backing array of pointers ob_item. But where’s the length (i.e., actually filled elements)?

… well, that definition and its usage is always within #ifdef Py_GIL_DISABLED guards. So I think that’s the “no GIL” version. Let’s instead trace the standard version, which is found in the header file:

// listobject.h
typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

What!? Same problem! We have allocated and then the backing array of pointers ob_item. The magic is in PyObject_VAR_HEAD. I think this gets defined to always be a variable of type struct PyVarObject named ob_base. A field in the PyVarObject struct is ob_size, which contains the length.

// object.h

/* PyObject_VAR_HEAD defines the initial segment of all variable-size
 * container objects. ... */
#define PyObject_VAR_HEAD      PyVarObject ob_base;

struct PyVarObject {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
};
typedef struct PyVarObject PyVarObject;

`id()`

Small aside, but id() actually gives the object’s C pointer. This is cool. Some notes on this:

object pointers are stable for their lifetime. They don’t move around in memory. (C.f., a comment from Guido that lives on, unchanged, 35 years later in the codebase, from the initial commit!)
small ints and common strings are immortal fixed constant objects
garbage collection, obviously
virtual mem addr, obviously
CPython implementation detail
aggressive alignment (e.g., to 8 bytes) may make things look bigger than they are (least sure about details on this one)

Published	Jul 31, 2025
Disclaimer	This is an entry in the garage. It may change or disappear at any point.
Tags	garage
Inbound	Python Memory Management
Outbound	Python Memory Management

Python Data Structures

list#

Deep-Dive: List Source Code#

id()#

`list`

Deep-Dive: List Source Code

`id()`