Playing Cartpole By Hand: Also: Fast Substring Search, Tree Serialization Order

Week 5, day 4 at Recurse F2’25

substring search is naively O(mn), but can be done in O(m + n) with O(1) space

Say:

len(haystack) = n
len(needle) = m

Naively, O(mn):

for each character in n as start
- check each character in m matches

Fancier techniques precompute information into tables, jump around cleverly, and/or use hashing.

Exhaustive resource:
http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Wikipedia has a whole page, here’s the table: https://en.wikipedia.org/wiki/String-searching_algorithm#Single-pattern_algorithms

python’s implementation has a short note about their approach: https://github.com/python/cpython/blob/main/Objects/stringlib/fastsearch.h#L5-L8

… and a much longer writeup with lots of examples: https://github.com/python/cpython/blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt

This is a cool instance where Python’s built-in approach:

looks the size of your inputs
guesses the fastest algorithm to use for your case
implements sophisticated algorithm(s) to do so
and provides a writeup of how they did it

If I ever wanted to implement one, it seems like Rabin-Karp might be the simplest. It uses a rolling hash:

O(m) - hash needle
O(m) - hash first m of haystack
O(n) - slide window of size m over haystack (really O(n - m))
- O(1) removing last character from hash and adding new one in
- O(m) on hash match, check all characters to verify (ideally just once)

So worst case O(nm) time if bad hash and lots of collisions, though ought to be O(n + m).

O(1) space, as each hash is a single number.

tree serialization order matters

Assume you are serializing a binary tree, and you render the nulls. You can do pre-order, in-order, or post-order traversals. Which one(s) will give you an unambiguous representation of the tree?

order	unambiguous serialization with nulls?
pre	yes
in	no
post	yes

This was surprising to me. And I still have a rough intuition of parts of it, but not of the whole thing.

It makes sense that pre- and post-order traversals make unambiguous serializations, because they can be used to unambiguously represent pre/post-fix math expressions.

In an expression, operators are internal nodes, and numbers are leaves, which tells you when to pop in/out.
In a tree, nulls tell you where leaves are, which also tells you the structure.

With in-order traversal, a node’s position depends on its subtrees, which makes serialization ambiguous. Both of the following two trees have an in-order traversal of [null, A, null, B, null]:

Tree 1:    B          Tree 2:      A
          /                         \
         A                           B

So while I understand why pre and post should work, I still only have a vague intuition why exactly they do, as well as why in-order does not.

(The difference between saying true words which correctly explain something, and having the actual intuition click, remains interesting for me.)

What if you don’t have nulls? Pre- and post-order traversals no longer unambiguously serialize a tree. But if you have pre + in, or post + in, you can do it. There are even leetcode problems for this:

w/o nulls: pre + in → tree
https://leetcode.com/problems/construct-binary-tree-from-preorder-and-inorder-traversal/description/

w/o nulls: in + post → tree
https://leetcode.com/problems/construct-binary-tree-from-inorder-and-postorder-traversal/description/

(I may try these tomorrow if I still have the enthusiasm.)

What about pre + post (still no nulls)? It’s not enough. When a node has one child, you can’t tell where it is.

Tree 1:    1          Tree 2:    1
          /                       \
         2                         2

For both: pre: [1, 2], post: [2, 1].

cart pole

Me attempting to play by hand, with modified environment.

playing cartpole by hand. i am a fan of doing a task manually before having machine learning do it so you better understand the shape of the problem and the data.

to my surprise, this capability isn’t baked into gymnasium that I can tell, but it’s not too bad to hack it in using their rendering framework (pygame)

summary of tweaks / learnings:

increased angle before failure
drew x guides
drew score
- track turn
save best score
subclassed env & registered custom
learned about registering, rendering, randomness, the basic API, and the bogus reward_threshold
made humans play. it is hard.

human performance

who	score
Chris	277
Dan	93
Alex	83
me	188
Ben	pretty good, wasn’t tracking yet

(this was actually with calling render() twice accidentally. (once explicitly, because I didn’t realize step() calls it automatically.) it is WAY harder not doing that.)

registering custom env

subclassing an env (cart pole) to tweak it. (was just editing the package contents directly.)

gymnasium envs are registered in:

# .venv/lib/python3.12/site-packages/gymnasium/
envs/__init__.py

like:

register(
    id="CartPole-v0",
    entry_point="gymnasium.envs.classic_control.cartpole:CartPoleEnv",
    vector_entry_point="gymnasium.envs.classic_control.cartpole:CartPoleVectorEnv",
    max_episode_steps=200,
    reward_threshold=195.0,
)

register(
    id="CartPole-v1",
    entry_point="gymnasium.envs.classic_control.cartpole:CartPoleEnv",
    vector_entry_point="gymnasium.envs.classic_control.cartpole:CartPoleVectorEnv",
    max_episode_steps=500,
    reward_threshold=475.0,
)

reward_threshold is strange.

`reward_threshold`

reward_threshold doesn’t appear to actually do anything! It’s just metadata that’s saved into the spec (the EnvSpec), and is available for querying.

Claude was very emphatic after its little grepping:

Does NOT stop training

Does NOT cap rewards

Does NOT send signals anywhere

Does NOT trigger any behavior

I guess it makes sense as metadata an environment creator would provide to agent builders to communicate success. For cart pole it seems trivial, since the max episode steps (500) is so closely related to the reward threshold (really ought to be about 500) because it gets +1 per step, but for other environments it’s not necessarily obvious.

Actually, there’s a bug. This reward_threshold is not correct, because they’ve implemented sutton_barto_reward as a command line option, which gives 0 per step and -1 on failure. Any agent with any performance will always get a -1 reward. (It’s interesting that the total reward for Sutton & Barto’s formulation isn’t a sign of whether the agent has learned!) In other words, having a command line option for changing the reward isn’t compatible with a fixed, upfront, static reward_threshold. (Good thing it… doesn’t do anything.)

Other arg notes:

it’s nice that order_enforce defaults to True, to make sure you aren’t calling things wrong

randomness

how is randomness controlled? the base Env’s reset() takes a seed, and then sets two attrs

# in Env (base environment class)
self._np_random, self._np_random_seed = seeding.np_random(seed)

this is stateful; their docs say:

Therefore, :meth:reset should (in the typical use case) be called with a seed right after initialization and then never again.

For Custom environments, the first line of :meth:reset should be super().reset(seed=seed) which implements the seeding correctly.

the original cartpole accesses this with

# in CartPolEnv (original)
self.state = self.np_random.uniform(low=low, high=high, size=(4,))

… which is actually implemented as a property

# in Env (base class)
@property
def np_random(self) -> np.random.Generator:
    """Returns the environment's internal :attr:`_np_random` that
    if not set will initialise with a random seed.

    Returns:
        Instances of `np.random.Generator`
    """
    if self._np_random is None:
        self._np_random, self._np_random_seed = seeding.np_random()
    return self._np_random

What I’m trying to figure out is: by just always calling reset like this:

for episode in range(episodes):
    env.reset()
    # ...

… am I doing it wrong? Both (a) calling it multiple times, (b) with no explicit seed ever?

It seems like this is fine:

it’s normal to repeatedly call reset()
- when they say “never again,” they only meant passing the seed
not passing a seed just means a seed will be organically chosen, so the run won’t be repeatable

if you call it with the same seed each time, you get the exact same starting observation

# episodes 1, 2, 3
Starting observation: [ 0.01823519 -0.0446179  -0.02796401 -0.03156282]
Starting observation: [ 0.01823519 -0.0446179  -0.02796401 -0.03156282]
Starting observation: [ 0.01823519 -0.0446179  -0.02796401 -0.03156282]
...

… if you instead call the same seed the first time, then None on subsequent episodes, you get the same sequence of starting observations on the episodes of each run:

# episodes 1, 2, 3
Starting observation: [ 0.01823519 -0.0446179  -0.02796401 -0.03156282]
Starting observation: [-0.03240941  0.03120945  0.0423345  -0.02234256]
Starting observation: [ 0.03197546  0.03898927  0.00129705 -0.02550354]
...

# next run, episodes 1, 2, 3
Starting observation: [ 0.01823519 -0.0446179  -0.02796401 -0.03156282]
Starting observation: [-0.03240941  0.03120945  0.0423345  -0.02234256]
Starting observation: [ 0.03197546  0.03898927  0.00129705 -0.02550354]

I got into all this because there’s a nondeterministic flag when registering the environment, which made me wonder whether CartPole actually implemented deterministic seeded “randomness.” And it does.

rendering

Rendering is basically:

either don’t render, or
draw with pygame, then
- render to window
- render to pixels & return them

def render(self):
    if self.render_mode is None:
        return
    # (omitted draw with pygame)
    if self.render_mode == "human":
        # update pygame window
        return None
    else:
        np_pixels = # pygame render
        return numpy_pixels

all methods?

custom envs have a pretty minimal API!

note that:

the first return of step() is ‘observations’, but for now that’s just the next state.
step() calls render()

def CustomEnv(Env):
    def __init__(): ...
    def reset(seed) -> state: ...
    def step(action) -> state, reward, term, trunc, info: ...
    def render(): Optional[pixels]...
    def close(): ...

vector env

TBD, but seems important for speed. there’s CartPoleVectorEnv(VectorEnv)

Published	Oct 23, 2025
Tags	blog recurse daily

Playing Cartpole By Hand

substring search is naively O(mn), but can be done in O(m + n) with O(1) space#

tree serialization order matters#

cart pole#

human performance#

registering custom env#

reward_threshold#

randomness#

rendering#

all methods?#

vector env#