GazeGPT | 2024

Robert Konrad, Nitish Padmanaban, J. Gabriel Buckmaster, Kevin C. Boyle, Gordon Wetzstein

Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

7 MIN TECH TALK

ABSTRACT

Multimodal large language models (LMMs) excel in world knowledge and problem-solving abilities. Through the use of a world-facing camera and contextual AI, emerging smart accessories aim to provide a seamless interface between humans and LMMs. Yet, these wearable computing systems lack an understanding of the user’s attention. We introduce GazeGPT as a new user interaction paradigm for contextual AI. GazeGPT uses eye tracking to help the LMM understand which object in the world-facing camera view a user is paying attention to. Using extensive user evaluations, we show that this gaze-contingent mechanism is a faster and more accurate pointing mechanism than alternatives; that it augments human capabilities by significantly improving their accuracy in a dog-breed classification task; and that it is consistently ranked as more natural than head- or body-driven selection mechanisms for contextual AI. Moreover, we prototype a variety of application scenarios that suggest GazeGPT could be of significant value to users as part of future AI-driven personal assistants.

OVERVIEW

GazeGPT is a human-centric interface to generative AI models. Current AI models are exceptional at ingesting multimodal data and providing reasonable responses, but often lack the fundamental information to identify the object of interest to the human user. GazeGPT uses a combination of a gaze tracker and a world-facing camera to provide context to user queries. The query, along with a multiscale crop around the object of interest, is uploaded to a multimodal large language model, like GPT-4V, which can provide better responses with the included context. This new interface to AI has the potential to fundamentally change how humans access information.

PROTOTYPE

The Zinn Labs DK1 Evaluation Kit. The major components used in the GazeGPT system (microphone, speaker, eye tracking cameras, and world-facing camera) are labeled.

APPLICATIONS

Applications of GazeGPT. The GazeGPT system excels at a wide variety of tasks, including general knowledge, contextual recommendations, translation, and even operates in multiple languages.

INSIGHTS & RESULTS

Results of selection evaluation for accuracy (left) and speed (right) for the selection target shown (top). Both the phone- and gaze- based selection modes achieved high accuracy (just under 2°), while the gaze-based selection mode was the fastest of all the modes. Significance is indicated at the **𝑝 = 0.01 and ***𝑝 = 0.001 levels. Errors bars indicate SE.

Example images used for the classification evaluation (left) and the results of the evaluation (right). Each trial displayed a 9×9 grid of dog images on a white background to emulate a natural environment that may have many competing objects of interest. Gaze-based selection consistently outperforms the other selection modes and is the only one to outperform the users themselves. Significance is indicated at the **𝑝 = 0.01 and ***𝑝 = 0.001 levels. Errors bars indicate SE.

Preference study results. Gaze-based selection is consistently rated as more natural and useful than the others. While the latency of the three modes should be identical, it appears that user frustration with body-based selection affected their perception of latency aswell. Significance is indicated at the *𝑝 = 0.05 and **𝑝 = 0.01 levels. Each dot represents a user ranking.

FILES

Technical Paper (link)
Demo Video (YouTube Video)