Pokémon RL Rewards: Also: Verbosity, Time Flies, Job Talks

Week 6, day 4 at Recurse F2’25.

Verbosity

I’ve been really enjoying writing these daily blogs, but my verbosity has taken over. I think I spent 1.5 hours writing yesterday’s (Wed) this morning (Thu).⁰¹

The answer here, as will most problems once you write them out, is obvious: timebox.

Time Flies

Today was one of those days where time just flies, and you’re left wondering how the day is over already.

From	To	What
10am	11am	Daily programming puzzle (tries w/ wildcads)
11am	12:30pm	Write yesterday’s daily blogpost
12:30pm	1pm	Head into Recurse
1pm	2pm	Lunch @ the hub
2pm	2:30pm	Catching up with Zulip messages
2:30pm	4pm	High level job talk planning
4pm	5pm	Presentations
5pm	9:30pm	End of (others’) batch celebration + Halloween party

Only the bolded 1.5 hours of high level job talk planning was high priority work.

That’s OK! It’s fine to have different balances on different days. Buuuut, even though I know that, it’s hard for me to feel satisfied with a workday that I didn’t spend productively.

Job Talks

An interesting thing you (sometimes?) do when applying for research-y jobs as a PhD graduate, at least in CS, is give a “job talk,” which is a ~1 hour talk on your research.

Now, it’s “on your research,” but I think you also need to tie the individual contributions together with a bit of a story, and provide some forward-looking vision for upcoming research you’d like to do.

Having defended in December 2021, a lot has changed in NLP in the last four years. It’s actually hard to think of a crazier four-year period for any field in history.⁰² So it’s going to be interesting figuring out best how to situate my past research, and rev up the engines again of looking forward in a low-level way.

The extra challenge will be making a job talk outside of the context of a lab with labmates or an advisor to workshop it with! But such is the life I chose for taking four years outside of the research world.

Pokémon RL Rewards

I finally checked out the rewards functions for the Pokemon Red RL agent(s), and they’re fascinating.

Here are the rewards that get an agent a few cities in (and the code behind Pete’s amazing video):

state_scores = {
    "event": self.reward_scale * self.update_max_event_rew() * 4,
    #"level": self.reward_scale * self.get_levels_reward(),
    "heal": self.reward_scale * self.total_healing_rew * 10,
    #"op_lvl": self.reward_scale * self.update_max_op_level() * 0.2,
    #"dead": self.reward_scale * self.died_count * -0.1,
    "badge": self.reward_scale * self.get_badges() * 10,
    "explore": self.reward_scale * self.explore_weight * len(self.seen_coords) * 0.1,
    "stuck": self.reward_scale * self.get_current_coord_count_reward() * -0.05
}

From PWhiddy/PokemonRedExperiments. Also, the agent sees a few frames and a bunch of state encoded visually, which is a beautiful choice for ease of debugging.

Here are the rewards that beat the game:

 return {
    "event": 4 * self.update_max_event_rew(),
    "explore_npcs": sum(self.seen_npcs.values()) * 0.02,
    # "seen_pokemon": sum(self.seen_pokemon) * 0.0000010,
    # "caught_pokemon": sum(self.caught_pokemon) * 0.0000010,
    "obtained_move_ids": sum(self.obtained_move_ids) * 0.00010,
    "explore_hidden_objs": sum(self.seen_hidden_objs.values()) * 0.02,
    # "level": self.get_levels_reward(),
    # "opponent_level": self.max_opponent_level,
    # "death_reward": self.died_count,
    "badge": self.get_badges() * 5,
    # "heal": self.total_healing_rew,
    "explore": sum(sum(tileset.values()) for tileset in self.seen_coords.values()) * 0.012,
    # "explore_maps": np.sum(self.seen_map_ids) * 0.0001,
    "taught_cut": 4 * int(self.check_if_party_has_hm(0xF)),
    "cut_coords": sum(self.cut_coords.values()) * 1.0,
    "cut_tiles": sum(self.cut_tiles.values()) * 1.0,
    "met_bill": 5 * int(self.events.get_event("EVENT_MET_BILL")),
    "used_cell_separator_on_bill": 5
    * int(self.events.get_event("EVENT_USED_CELL_SEPARATOR_ON_BILL")),
    "ss_ticket": 5 * int(self.events.get_event("EVENT_GOT_SS_TICKET")),
    "met_bill_2": 5 * int(self.events.get_event("EVENT_MET_BILL_2")),
    "bill_said_use_cell_separator": 5
    * int(self.events.get_event("EVENT_BILL_SAID_USE_CELL_SEPARATOR")),
    "left_bills_house_after_helping": 5
    * int(self.events.get_event("EVENT_LEFT_BILLS_HOUSE_AFTER_HELPING")),
    "got_hm01": 5 * int(self.events.get_event("EVENT_GOT_HM01")),
    "rubbed_captains_back": 5 * int(self.events.get_event("EVENT_RUBBED_CAPTAINS_BACK")),
    "start_menu": self.seen_start_menu * 0.01,
    "pokemon_menu": self.seen_pokemon_menu * 0.1,
    "stats_menu": self.seen_stats_menu * 0.1,
    "bag_menu": self.seen_bag_menu * 0.1,
    "action_bag_menu": self.seen_action_bag_menu * 0.1,
    # "blackout_check": self.blackout_check * 0.001,
    "rival3": self.reward_config["event"] * int(self.read_m("wSSAnne2FCurScript") == 4),
}

From drubinstein/pokemonred_puffer.

It makes me somehow both sad and relieved to see such an immense amount of hardcoded human knowledge and engineering in these reward functions. Sad because there isn’t magic. But relieved because my vague intuitions about RL seem more accurate than I would have guessed.

I wonder what Sutton/Barto would say. They write something like, “the reward should tell the agent what to do, not how to do it.” But clearly you can’t just say “1 point when you beat the game, -1 points every other frame you’re alive” and have an agent learn to play Pokémon.

Footnotes

To make the time even more confusing, I’m writing today’s (Thu) tomorrow (Fri). ↩︎
At the same time, in a very strange way… not that much has changed? ↩︎

Published	Oct 30, 2025
Tags	blog recurse daily

Pokémon RL Rewards

Verbosity#

Time Flies#

Job Talks#

Pokémon RL Rewards#

Verbosity

Time Flies

Job Talks

Pokémon RL Rewards