Gaze as Context: a foray into intuitive interaction

As we here at Mirametrix have officially detached from the mother ship that is Tandemlaunch (though we’re still stuck in their office – shhhh ), it seemed like a perfectly appropriate time to talk about our views on human-computer interaction and how we see it evolving. While outwardly a standard eye tracking research company, we’re committed to aiding the push towards a future where device interaction is both more intuitive and simple.

Minority Report’s Futuristic Human-Computer Interface

Now, if one were to base this ideal future on recent Sci-fi films, body-pose based gesture interaction would be the name of the game. Although the interface demonstrated in Minority Report certainly had the strongest effect on our general perceptions of futuristic interfaces ([1] is a great read on the subject), these types of interfaces have appeared for decades now in literature and film. And with good reason: for a large number of applications, gestures can directly mimic how we would interact with things in the physical world. They look darn cool, to boot. However, there are some things gesture is inherently unsuited to handle: Abstract interactions, which lack direct physical counterparts, are particularly difficult to model via gesture. The state of current gesture interfaces brings to light another major hurdle: while fairly good at understanding the user pose and gesture used, these devices have difficulty determining what the user is attempting to interact with in the virtual realm. This is most apparent via the interaction schemes for menus or dashboards, such as the one shown below:

XBox One Dashboard Interface

The intuitive method of interacting with such an interface would be to point at or grab the item you wish to select. However, this is not feasible due to the constraints of gesture technology. The most common interaction scheme mimics the standard pointer, where you perform a gesture to initialize a mouse, move it to the correct object, and perform a second gesture to select. This is a step away from the direct interaction people are trying to achieve. By adding this layer of abstraction, we in many cases remove the “naturalness” of the interaction, reminding us that we’re still dealing with a mouse-click style user interface.

Similarly, despite major strides, current voice systems are for many use cases over-developed. The grand majority of these systems expect any phrase as input, which introduces some problems: first, it forces added complexity to the necessary language processing, as the system must be able to reliably and efficiently extract the syntax of the given phrase, understand the meaning behind the potential command, and determine if it applies to the system it’s functioning on. Keep in mind, this assumes the system has been able to accurately convert a given audio feed to words, a difficult problem in and of itself due to the vastly different accents existing in a given language. Secondly, such voice recognition systems must by their very nature fix the user’s interaction to purely objective commands, removing the deictic phrases we tend to use in normal conversation. Even with the voice component of the new Kinect, which has smartly been trained on a set command list[2] (reducing the scope of the problem, and increasing it’s accuracy), this lack of deictic commands is non-ideal. In the sea of available items on the dashboard, sometimes the most natural command we want understood is “Select that”.

Now, this problem of context is not exclusive to voice, but inherent in the grand majority of these new interaction devices we’re seeing on the market. From Kinect to Siri to Leap Motion, great products are being released, but they’re being held back by the isolationist nature of their ecosystems. And the solution to this problem is by no means new: since the advent of these new modalities, there has been interest in roping them together into multi-modal systems. Think about the last interaction you had with a friend; how often did you refer to contextual things (this, that), point at objects, or indicate what you were talking about by looking at it? We interact innately via multiple modalities (talking, looking, pointing, gesturing all at once), so it only seems natural to want our computers be able to decipher these multifaceted commands.

To slide back to how Mirametrix fits in here: we’re in the gaze tracking business because we think it’s an underlying key to gluing these modalities together. When referring to things in visual space, we naturally fixate on what we are referring to. This simple key, context, can immensely simplify the recognition problem in a number of scenarios. In the case of gesture mentioned earlier, this context can simplify the gesture tracking problem from one of perfect 3D pose tracking to a simple set of different gestures. Similarly, the voice recognition problem becomes one of differentiating between sets of synonyms in the case described above. This idea of context as a unifying/simplifying component is quite powerful: it not only determines what we’re referring to during interaction (connecting separate modality actions), but also allows the connected modalities to be simpler and more lightweight, by virtue of this contextual association. It is via this simple tenet that we hope to make interchanges between us and computers truly natural.



Leave a Reply