VR with Multimodal Interactions

MEinVR: Multimodal Interaction Paradigms in Immersive Exploration. Yuan, ZY; Liu, Y and Yu, LY. 21st IEEE International Symposium on Mixed and Augmented Reality (ISMAR) Adjunct 2022, pp.85-90. [PDF]

  • multimodal: using multiple modalities of user input, in this case, hand controller + voice; others include gestures (w/o controllers), head or foot postures, gaze

  • combining imprecise inputs of multiple modalities allows getting a precise result

  • imprecise spatial and verbal inputs are more like natural human interactions than highly precise pointing in 3D or having to speak with a rigid grammar and limited ontology
[back to paper list]

Some Limitations of VR Hand Controllers

  • switching modes (rotation, translation, selection, etc.) is unintuitive and has a high learning curve:
    • if the different modes are accessed with different controller buttons, the user has to learn/remember which button does what

    • “accidents” happen when one forgets which mode is active or uses the wrong button

    • 2D interfaces for mode switching (e.g., icon bar or menu) are not generally recommended in immersive environments; issues include size of the display and occlusion of the scene

  • in the immersive environment, precise specification/selection in 3D is more difficult than one might think: judging depth, dealing with occlusion, etc., particularly if the structure is not yet well understood

Natural User Interfaces, Natural Language Interfaces


Existing NLIs for visualization were developed for/in desktop environments. These include:

  • Articulate – machine learning-assisted, parses queries into commands
  • DataTone – keyboard or voice input, keeps track of user corrections
  • Eviza – users can iteratively modify queries

MEinVR builds upon these existing methods for a VR environment, with design goals to:

  • minimize user learning costs and cognitive load
  • reduce the accuracy requirements of each type of input
  • play to the strengths and avoid the weaknesses of each type of input
    • voice commands can refer to hidden or occluded parts
    • pointing naturally conveys positions that are hard to describe in words alone
    • controller movements naturally convey approximate distances and angles

Questions:

  1. How best to combine controller and speech input?
  2. Can the combination be used for complex tasks?
  3. Is the combination more effective than either type of input alone?

Implementation

Oculus Quest 2 using ChimeraX VR + speech recognition

  1. headset microphone captures audio
  2. Google speech-to-text (STT) converts the audio to text
  3. the system tries to recognize compound queries and split them accordingly
  4. Word2vec converts the text to a word vector for comparison with a library of existing commands
  5. Spacy calculates cosine similarity and returns the most similar command
  6. in parallel, VR controller in “query mode” tracks real-time position

Interface and Simple Task Examples

  • would it work to say “this chain” or “this residue”?
  • how do atomspec strings relate to their library of commands? (the lefthand figure only shows a generic category, “protein”)
  • does controller position-tracking automatically get the atomspec? (like our desktop mouse atomspec balloons)

Complex Task Examples

  • if the target (e.g. protein) is not in the circumscribed region, nothing will happen
  • does “this protein” expand to a single chain, or all protein in that model, or all protein in all models? (if the latter, the “this” is useless)
  • the first two might be better without selection, e.g., “Show this protein as stick”
    • ...however, their implementation probably requires a separate “select” action (cannot generate an atomspec for the “show” action)

Questions Redux


  1. How best to combine controller and speech input?
    • the jury is out; the authors refer to the current approach as a prototype, but seem pleased with their initial progress
  2. Can the combination be used for complex tasks?
    • some compound tasks are given as examples, but complexity may still be limited
  3. Is the combination more effective than either type of input alone?
    • probably yes for novice users and the queries that their system can handle, but there may be many queries their system cannot handle

The authors basically punt by saying:

“Further studies are required to explore the prominent advantages of each interaction input in different exploration tasks for various data. Moreover, a comprehensive user study needs to be conducted to evaluate the usability and effectiveness of our method in data exploration.”

I also didn't see anything about availability of their code, but the corresponding author's email address is given.