Verbosity, Time Flies, Job Talks, Pokémon RL Rewards
Week 6, day 4 at Recurse F2’25.
Verbosity
I’ve been really enjoying writing these daily blogs, but my verbosity has taken over. I think I spent 1.5 hours writing yesterday’s (Wed) this morning (Thu).01
The answer here, as will most problems once you write them out, is obvious: timebox.
Time Flies
Today was one of those days where time just flies, and you’re left wondering how the day is over already.
| From | To | What |
|---|---|---|
| 10am | 11am | Daily programming puzzle (tries w/ wildcads) |
| 11am | 12:30pm | Write yesterday’s daily blogpost |
| 12:30pm | 1pm | Head into Recurse |
| 1pm | 2pm | Lunch @ the hub |
| 2pm | 2:30pm | Catching up with Zulip messages |
| 2:30pm | 4pm | High level job talk planning |
| 4pm | 5pm | Presentations |
| 5pm | 9:30pm | End of (others’) batch celebration + Halloween party |
Only the bolded 1.5 hours of high level job talk planning was high priority work.
That’s OK! It’s fine to have different balances on different days. Buuuut, even though I know that, it’s hard for me to feel satisfied with a workday that I didn’t spend productively.
Job Talks
An interesting thing you (sometimes?) do when applying for research-y jobs as a PhD graduate, at least in CS, is give a “job talk,” which is a ~1 hour talk on your research.
Now, it’s “on your research,” but I think you also need to tie the individual contributions together with a bit of a story, and provide some forward-looking vision for upcoming research you’d like to do.
Having defended in December 2021, a lot has changed in NLP in the last four years. It’s actually hard to think of a crazier four-year period for any field in history.02 So it’s going to be interesting figuring out best how to situate my past research, and rev up the engines again of looking forward in a low-level way.
The extra challenge will be making a job talk outside of the context of a lab with labmates or an advisor to workshop it with! But such is the life I chose for taking four years outside of the research world.
Pokémon RL Rewards
I finally checked out the rewards functions for the Pokemon Red RL agent(s), and they’re fascinating.
Here are the rewards that get an agent a few cities in (and the code behind Pete’s amazing video):
state_scores = {
"event": self.reward_scale * self.update_max_event_rew() * 4,
#"level": self.reward_scale * self.get_levels_reward(),
"heal": self.reward_scale * self.total_healing_rew * 10,
#"op_lvl": self.reward_scale * self.update_max_op_level() * 0.2,
#"dead": self.reward_scale * self.died_count * -0.1,
"badge": self.reward_scale * self.get_badges() * 10,
"explore": self.reward_scale * self.explore_weight * len(self.seen_coords) * 0.1,
"stuck": self.reward_scale * self.get_current_coord_count_reward() * -0.05
}
From PWhiddy/PokemonRedExperiments. Also, the agent sees a few frames and a bunch of state encoded visually, which is a beautiful choice for ease of debugging.
Here are the rewards that beat the game:
return {
"event": 4 * self.update_max_event_rew(),
"explore_npcs": sum(self.seen_npcs.values()) * 0.02,
# "seen_pokemon": sum(self.seen_pokemon) * 0.0000010,
# "caught_pokemon": sum(self.caught_pokemon) * 0.0000010,
"obtained_move_ids": sum(self.obtained_move_ids) * 0.00010,
"explore_hidden_objs": sum(self.seen_hidden_objs.values()) * 0.02,
# "level": self.get_levels_reward(),
# "opponent_level": self.max_opponent_level,
# "death_reward": self.died_count,
"badge": self.get_badges() * 5,
# "heal": self.total_healing_rew,
"explore": sum(sum(tileset.values()) for tileset in self.seen_coords.values()) * 0.012,
# "explore_maps": np.sum(self.seen_map_ids) * 0.0001,
"taught_cut": 4 * int(self.check_if_party_has_hm(0xF)),
"cut_coords": sum(self.cut_coords.values()) * 1.0,
"cut_tiles": sum(self.cut_tiles.values()) * 1.0,
"met_bill": 5 * int(self.events.get_event("EVENT_MET_BILL")),
"used_cell_separator_on_bill": 5
* int(self.events.get_event("EVENT_USED_CELL_SEPARATOR_ON_BILL")),
"ss_ticket": 5 * int(self.events.get_event("EVENT_GOT_SS_TICKET")),
"met_bill_2": 5 * int(self.events.get_event("EVENT_MET_BILL_2")),
"bill_said_use_cell_separator": 5
* int(self.events.get_event("EVENT_BILL_SAID_USE_CELL_SEPARATOR")),
"left_bills_house_after_helping": 5
* int(self.events.get_event("EVENT_LEFT_BILLS_HOUSE_AFTER_HELPING")),
"got_hm01": 5 * int(self.events.get_event("EVENT_GOT_HM01")),
"rubbed_captains_back": 5 * int(self.events.get_event("EVENT_RUBBED_CAPTAINS_BACK")),
"start_menu": self.seen_start_menu * 0.01,
"pokemon_menu": self.seen_pokemon_menu * 0.1,
"stats_menu": self.seen_stats_menu * 0.1,
"bag_menu": self.seen_bag_menu * 0.1,
"action_bag_menu": self.seen_action_bag_menu * 0.1,
# "blackout_check": self.blackout_check * 0.001,
"rival3": self.reward_config["event"] * int(self.read_m("wSSAnne2FCurScript") == 4),
}
From drubinstein/pokemonred_puffer.
It makes me somehow both sad and relieved to see such an immense amount of hardcoded human knowledge and engineering in these reward functions. Sad because there isn’t magic. But relieved because my vague intuitions about RL seem more accurate than I would have guessed.
I wonder what Sutton/Barto would say. They write something like, “the reward should tell the agent what to do, not how to do it.” But clearly you can’t just say “1 point when you beat the game, -1 points every other frame you’re alive” and have an agent learn to play Pokémon.
Footnotes