Lauren T. Zerbin

As Extended Reality (XR) headsets like the Apple Vision Pro become mainstream, interaction paradigms heavily rely on multimodal inputs, most notably “Gaze+Pinch”. While highly effective, this method requires continuous arm and hand movements. This can cause physical fatigue, feel socially awkward in public spaces, and completely exclude users with severe motor impairments. Existing hands-free alternatives, such as gaze-and-dwell, are notoriously slow. To solve this, we developed Gaze+Blink, a novel, fully hands-free spatial interaction technique designed for both discrete selections and continuous UI manipulations.

The Gaze+Blink system leverages a headset’s built-in eye and head tracking to replace hand gestures. For discrete actions—such as clicking a button or typing on a virtual keyboard—the user simply looks at a target and blinks intentionally with both eyes. For continuous actions, such as scrolling a menu or dragging and dropping a file, the user closes one eye and rotates their head to move the object, releasing the “hold” by opening the eye again. To test its viability, an initial user study comparing Gaze+Blink against the industry-standard Gaze+Pinch in a realistic, VisionOS-inspired environment was conducted. Participants completed tasks like navigating menus, entering passwords, and dragging images. The results were promising: Gaze+Blink matched Gaze+Pinch in task completion speed, perceived workload (NASA-TLX), and overall usability (SUS). However, a flaw emerged: Gaze+Blink suffered from a significantly higher error rate due to natural, involuntary blinks triggering accidental selections. Notably however, this did not significantly impact performance, leading to the assumption that involuntary blinks do not incur a selection as users do not tend to dwell on buttons while blinking accidentally.

To mitigate accidental inputs, we developed an enhanced version called Gaze+BlinkPlus. Over 6 million data points were collected to train a Deep Learning (ResNet-based) classification model capable of distinguishing between voluntary (intentional) and involuntary (accidental) blinks. Crucially, to preserve user privacy, the model does not process raw video from eye cameras; instead, it relies strictly on eye-tracking telemetry, such as pupil size, eye openness, and gaze direction. Tested on uncalibrated users, the model achieved a 75% accuracy rate in filtering out accidental blinks.

A second user study evaluating Gaze+Pinch, Gaze+Blink, and Gaze+BlinkPlus revealed that the machine learning model successfully reduced error rates during high-frequency tasks like virtual keyboard typing. While the blink-based methods still exhibited slightly higher error rates during continuous scrolling tasks compared to pinching, they maintained equally fast overall task completion times.

User feedback was polarized but insightful. Some participants praised the blink techniques as natural, fast, and physically relaxing, while others found them prone to accidental inputs or mentally taxing. The authors also noted a physiological limitation: a small percentage of users cannot independently close one eye (unilateral apraxia of eyelid closing), which limits the drag-and-drop feature.

Ultimately, the paper proves that AI-enhanced blink-based interaction is a viable, high-speed alternative to hand gestures in XR. It represents a significant step forward for spatial computing, providing a discreet input method for confined environments (like airplanes) and offering crucial accessibility to users lacking upper-limb mobility.

Find the preprint here: https://arxiv.org/pdf/2501.11540