1. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
- Author
-
Shaw, Peter, Joshi, Mandar, Cohan, James, Berant, Jonathan, Pasupat, Panupong, Hu, Hexiang, Khandelwal, Urvashi, Lee, Kenton, and Toutanova, Kristina
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Human-Computer Interaction ,Computation and Language (cs.CL) ,Machine Learning (cs.LG) ,Human-Computer Interaction (cs.HC) - Abstract
Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use -- via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
- Published
- 2023
- Full Text
- View/download PDF