Research
This is my academic webpage
Bio
I received my PhD in computer science from the University of Washington. I was advised by Yejin Choi. My primary field of study is natural language processing (NLP).
I’m interested in NLP where the rubber meets the road: language and stuff like robotics, vision, physical common sense, and social norms. Rather than “language and X,” I think of it as “X involves communication, which involves language.” In other words, language as a messy artifact of a messy world.
I was a 2016 NSF Graduate Research Fellow. In 2018–2019, I was a student researcher at Google AI, working with Christine Kaeser-Chen and Serge Belongie. I was regularly a research intern at the Allen Institute for AI.
Contact me at:
Notes
- Don't Try to Reform Science
- Your Paper Is an Ad
- Figure Creation Tutorial: Making a Figure 1
- "Every PhD Is Different"
- A Modest Proposal: Let’s Stop Lying To Each Other in Our Research Papers
- Procedural City Layout Generation with GANs
- Ordinary least squares, ℓ² (ridge), and ℓ¹ (lasso) linear regressions
- Downloading books from Project Gutenberg
- Making Plots Pretty
Publications
Investigating Machine Moral Judgement Through the Delphi Experiment
Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jack Hessel, Jon Borchardt, Taylor Sorensen, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi
Nature Machine Intelligence 2025
Scarecrow: A Framework for Scrutinizing Machine Text
Yao Dou*, Maxwell Forbes*, Rik Koncel-Kedziorski, Noah A. Smith, Yejin Choi
Association for Computational Linguistics (ACL) 2022
[project] [bib]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, Yejin Choi
Empirical Methods in Natural Language Processing (EMNLP) 2021
[bib]
Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences
Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, Yejin Choi
Empirical Methods in Natural Language Processing (EMNLP) 2021
[data & code]
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi
Association for Computational Linguistics (ACL) 2021
[project] [bib]
MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations
Yao Dou, Maxwell Forbes, Ari Holtzman, Yejin Choi
Association for the Advancement of Artificial Intelligence (AAAI) 2021
[data] [bib]
Paragraph-Level Commonsense Transformers with Recurrent Memory
Saadia Gabriel, Chandra Bhagavatula, Vered Shwartz, Ronan Le Bras, Maxwell Forbes, Yejin Choi
Association for the Advancement of Artificial Intelligence (AAAI) 2021
[data & code] [bib]
Social Chemistry 101: Learning to Reason about Social and Moral Norms
Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, Yejin Choi
Empirical Methods in Natural Language Processing (EMNLP) 2020
[project page: demo, data browser] [data] [code] [video] [bib]
Thinking Like a Skeptic: Defeasible Inference in Natural Language
Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, Yejin Choi
Findings of Empirical Methods in Natural Language Processing (Findings of EMNLP) 2020
[bib]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi
International Conference on Learning Representations (ICLR), 2020
[demo] [bib]
Neural Naturalist: Generating Fine-Grained Image Comparisons
Maxwell Forbes, Christine Kaeser-Chen, Piyush Sharma, Serge Belongie
Empirical Methods in Natural Language Processing (EMNLP) 2019
[project] [data] [video] [bib]
Do Neural Language Representations Learn Physical Commonsense?
Maxwell Forbes, Ari Holtzman, Yejin Choi
Conference of the Cognitive Science Society (CogSci) 2019
[project] [code] [data] [poster]
Learning to Write with Cooperative Discriminators
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, Yejin Choi
Association for Computational Linguistics (ACL) 2018
[code] [demo]
Balancing shared autonomy with human-robot communication
Rosario Scalise, Yonatan Bisk, Maxwell Forbes, Daqing Yi, Yejin Choi, Siddhartha Srinivasa
arXiv, 2018
Verb Physics: Relative Physical Knowledge of Actions and Objects
Maxwell Forbes, Yejin Choi
Association for Computational Linguistics (ACL) 2017
[project] [code] [data] [video] [slides] [bib]
Robot Programming by Demonstration with Situated Spatial Language Understanding
Maxwell Forbes, Rajesh P. N. Rao, Luke Zettlemoyer, Maya Cakmak
IEEE International Conference on Robotics and Automation (ICRA) 2015
[video] [code]
Robot Programming by Demonstration with Crowdsourced Action Fixes
Maxwell Forbes, Michael Jae-Yoon Chung, Maya Cakmak, Rajesh P. N. Rao
AAAI Conference on Human Computation and Crowdsourcing (HCOMP) 2014
[slides]
Accelerating Imitation Learning through Crowdsourcing
Michael Jae-Yoon Chung, Maxwell Forbes, Maya Cakmak, Rajesh P. N. Rao
IEEE International Conference on Robotics and Automation (ICRA) 2014
[press: Popular Science] [press: IEEE Spectrum]
Workshop papers
Programming by Demonstration with Situated Semantic Parsing
Yoav Artzi*, Maxwell Forbes*, Kenton Lee*, Maya Cakmak
AAAI Fall Symposium Series on Human-Robot Interaction 2014
[slides]
Grounding Antonym Adjective Pairs through Interaction
Maxwell Forbes, Michael Chung, Maya Cakmak, Luke Zettlemoyer, Rajesh P. N. Rao
ACM/IEEE International Conference on Human-Robot Interaction (HRI) Workshop on Asymmetric Interactions 2014
[slides]
Software
Create interactive factor graph visualizations for the web.
This creates interactive web visualizations of factor graphs (e.g., output from py-factorgraph
) using the d3-force
component of D3.js. I use it in the Verb Physics demo.
A simple command parser for hands-free robotics programming by demonstration (HFPbD).
This is the NLP backend for our ICRA 2015 paper, Robot Programming by Demonstration with Situated Spatial Language Understanding.
I think it’s a great experience to write a rule-based parser. You quickly learn how exhausting it is to support a wide variety of language, and how brittle the resulting system is. But it’s also fun to use, because you kind of make your own formal, constrained language (i.e., stuff that works for your parser) that you know how to use.
The cool part of this is that, unlike all of my “actual” NLP research projects after, this one actually worked in the real world. I could type in commands, or talk to it with voice recognition, then it would parse my commands, send the commands to the robot, and the robot would plan and execute them. You could walk up to the robot and write a simple program for it using your voice, and the robot would actually manipulate objects as a result.
Manage community-curated collections of NLP corpora.
Our broader NLP group had struggled with collectively sharing corpora: who has what, where do we store it, how do we make sure it’s unmodified, and how do we share any processed versions? I built this tool to help us solve these problems. It scans over a directory on a shared filesystem, performing checks on the structure and permissions of corpora, and uploads the results to a browsable index. I have it run daily as a cron job. (This is also why it looks like I commit every day on GitHub.)
Accompanying the CogSci 2019 paper.
This is the model code, data, and documentation for the CogSci 2019 paper, Do Neural Language Representations Learn Physical Commonsense?.
My code for the PR2 robot, focusing on programming by demonstration (PbD) systems, spanning several projects.
The readme has an index of the branches containing the directions I took this code base. One research project lives in the hands-free
branch, which is the robotics portion of the code for the ICRA 2015 paper, Robot Programming by Demonstration with Situated Spatial Language Understanding. A large collection of branches (da
, generate-objects
, interface-video
, mock-objects
, mock-objects-http
, mock-objects-pr2
, success-testing
, success-testing-pr2
) were involved in the HCOMP 2014 paper, Robot Programming by Demonstration with Crowdsourced Action Fixes.
Looking back, it’s wild to remember how challenging it is to build high level robotics software. Of course there’s the robot itself, which can fail in myriad hilarious ways. But there’s also the massive, ever-shifting, ever-slightly-failing robotics software stack you must build on top of, including partially working components, quirks, edge cases, and general sluggishness.
Build factor graphs and run the loopy belief propagation algorithm in Python.
This is a tiny Python library for building factor graphs and running the (loopy) belief propagation algorithm. At the time, I could not find an easy-to-use library for building factor graphs, only huge packages specializing in other graphical models. Those are probably still better choices, but I learned a lot implementing this one. Made for the Verb Physics paper.
Parse POMDP environments and policies into Python objects.
A tiny Python library for loading a POMDP setup (environment) and the output (policy) of a POMDP solver run on it.
Accompanying the EMNLP 2020 paper.
This is the model code, data schema, and documentation for the EMNLP 2020 paper, Social Chemistry 101: Learning to Reason about Social and Moral Norms.
Accompanying the ACL 2017 paper.
This is the model code, data, and documentation for the ACL 2017 paper, Verb Physics: Relative Physical Knowledge of Actions and Objects.
Run the original implementations of BLEU, ROUGE, and METEOR from within Python.
If you try to run any of the standard machine text evaluation metrics, you discover a surprising world of pain. BLEU is a Perl script, METEOR is a Java program, ROUGE is a problematic Perl script, and they all have inconsistent interfaces and expectations. If you instead went with a Python reimplementation, you’d learn there were inconsistencies in preprocessing that change the resulting scores.
textmetrics
was my stab at providing a consistent Python interface still that ran the original programs (e.g., BLUE in Perl, METEOR in Java) under the hood. I also included some additional metrics by computing various ngram statistics. When we were studying neural text (de)generation, we realized that the standard metrics only told us so much, and we drew from auxiliary sources to get a more holistic picture of what models were doing.
Today, I no longer think it’s important to use the exact reference implementations (unless you’re doing a shared task with standard script, of course). There is enough statistical noise from other sources that I don’t think it’s vital to exactly replicate the reference implementations’ decisions. Instead, I would simply use the most widespread reimplementation in Python, perhaps the ones in huggingface/datasets.
Mentoring
I am lucky to have collaborated with some amazing undergraduate students during my PhD: Yao Dou, Jeff Da, and Pooja Sethi.
I am eternally grateful to the wonderful graduate students who mentored me when I was an undergraduate (and beyond): Mike Chung, Kenton Lee and Yoav Artzi.