Multimodal Interfaces as an Alternative to Unimodal Speech Inputs

By David Hruska

Adobe AcrobatDownload PDF version.

With the exponential increase in computer processing power, speech recognition technology always seems to be one of the most promising technologies that will get faster, more accurate, and ubiquitous. Entirely speech-controlled computers may eventually reach a level where speech can be used as the sole interface technology, but there are many limitations that currently prevent its adoption. A possible solution is to incorporate several different interface methods, including speech, into one coherent multimodal interface.

Problems with Speech Recognition

Speech recognition-based computing seems to be the long sought after Holy Grail of interface design. It is promising with the advent of new, higher powered processors and people have a very realistic sense of how speech-based computing might work, based on shows such as Star Trek. The idea of speaking to a computer to have it carry out a set of commands sounds appealing, but it currently does have its drawbacks as well.

Speech recognition does present appealing benefits, such as hands-free text entry, increased data entry rates, increased spelling accuracy, and increased accessibility. And it currently does offer these benefits to users that use the technology under ideal conditions. For speech recognition to work most successfully, users must be in a noise-free environment, wear a high quality headset with a built in microphone, and spend time training the software to work with their voice.

Currently, speech recognition is used in fields such as government and private industries, aviation maintenance, medical fields where fast access to information is important, and hands-free banking. The application of speech recognition in these fields is limited to simple, predefined commands that the user has to memorize.

Speech recognition currently does not work for people who have heavy accents, who want to use it in noisy environments, and who do not want to spend time training the software. Another reason that people may not want to use speech recognition applications is because of privacy concerns. Additionally, peoples’ first encounter with speech recognition applications, whether it is over the phone for automated banking or speech-to-text dictation software on a computer, is commonly a negative one and spoils the experience for them in the future. Steven Simon suggests that as speech recognition becomes easier to use and free of errors, society’s view of the technology will improve and become more accepting.

Multimodality

Multimodal interfaces allow people to use multiple inputs at once to carry out a task on a mobile device. The advantage of multimodality is that the weakness of one input is overcome by the strength of another. People also prefer to interact with devices multimodally, which as an added benefit, also improves task performance (Oviatt, Advances, 63).

With enough input methods, the device could theoretically choose the best input combinations to fit the user’s context. For example, when driving in a car the best inputs are speech with very little touch commands so the driver can focus on watching the road. Common combinations of multiple inputs are speech and pen, speech and lip movement, and speech and gestures.

Speech and Pen

This interface combination allows multiple speech and stylus inputs. This combination is useful in noisy environments where speech input is primary; if the computer cannot distinguish between similar sounding words, it can display a simple selection box for confirming the correct word. This combination suppresses errors by 19 to 41 percent when compared to unimodal speech inputs (Oviatt, Advances, 64).

Speech and Lip Movement

Combining speech with lip movement is valuable in noisy environments, but does not offer many additional benefits in quiet conditions (Oviatt, Advances, 65).

Speech and Gestures

Examples of gestural interfaces are those that can detect moving hands, head nods, and pointing fingers. Gestures can be paired with speech to allow the user to point at something on a screen and say, “move that there.” The computer would know what you wanted to move, and then where to move it based on your pointing gesture.

This type of speech and gesture interface is very common, and is something that you probably use more than you realize. For example, when speaking to a waiter at a restaurant you may point to an item on the menu and ask, “ What is in this?” You and the waiter both know exactly what you are talking about.

Other Interface Methods

Aside from the input combinations above, there are several other interface methods that are interesting and worth noting. They are haptic interfaces, tilt-based systems, and auditory interfaces (Jones). Haptic interfaces allow the user to receive feedback from a device through touch or movement. The human sense of touch is very acute, and exploiting this sense may provide interesting feedback mechanisms in the future. Tilt-based systems allow users to control devices by tilting some piece of it to control a pointer. Jones states that tilt-based systems can be 25 percent faster than multi-tap text input. Auditory interfaces present the user with “earcons” that could provide another means for the interface to exhibit information without needing to look at a screen.

Multimodal Interface Benefits

Interfaces that take advantage of multiple input methods have more positive side effects than increased performance - it also opens devices to a wider audience that may not have been able to use them before. For example, people with disabilities or temporary illness may not have full use of their motor control and can benefit from alternative input methods. Young or old people may also face challenges in using a device that was not intended for them, as well as people that speak other languages (Oviatt, Multimodal Interfaces).

Multimodal Interface Challenges

One of the current challenges of multimodal interfaces is how to tie everything together. Multimodal interfaces must be designed so that the various input types all work together seamlessly. This could mean that the device must gather two or more input streams simultaneously and know when to use certain inputs over the others. The interfaces must also be able to correctly interpret the user’s intention based on all of the inputs. In addition to processing multiple input streams, the recognition engine needs to be able to handle misinterpreted inputs (Oviatt, Multimodal Interfaces).

Implementing Multimodality

T. Raman, a Research Staff Member at IBM, presents a list of guidelines for creating multimodal interfaces. His article states that multiple modalities need to be synchronized, should degrade gracefully, should share a common interaction state, and need to adapt to the user’s environment.

Synchronization between multiple input methods is important because one of the inputs should be able to pick up information where another is weak. Multiple interfaces should also be able to work together, even if they are gathering different kinds of information related to the same thing. For example, speech is temporal while visual information is spatial (Raman). The interfaces should be able to work together so a user can point at something and speak a command, much like the waiter example above. Likewise, the device should be able to respond in multiple outputs so that the conversation is full of information.

Multimodal interfaces should be able to “degrade gracefully,” allowing a device to switch input modes to adapt to changing contexts. A user could walk from a quiet building out to a noisy street and the inputs should switch focus accordingly to support accurate input.

For multiple inputs, such as visual and auditory, which do not seem to be directly related, they must share a common data set so they can work together. Usage history is a common set of information that all inputs methods can relate and may be used to determine usage habits. Usage habits are important so the device can decide what your intended input was if it gets confused.

Finally, the device should automatically adapt to the user’s environment and determine the best way to accomplish a task. The best method for accomplishing a task can be determined by the following list (Raman):

  • The user's needs and abilities
  • The abilities of the connecting device
  • Available bandwidth between device and network
  • Available bandwidth between device and user
  • Constraints placed by the user's environment, e.g., need for hands-free, eyes-free operation.

Multimodal Myths

The following section is adapted from Sharon Oviatt’s article titled 10 Myths of Multimodal Interaction from the journal Communications of the ACM. In this article, Sharon addresses 10 issues regarding multimodal design and the myths surrounding them based on research from multimodal human-computer interaction experiments. She intends to present ideas to build a better foundation for building multimodal interfaces in the future. Her myths seem to fit into four categories, which are discussed below.

Primary Input Methods

Sharon address that speech is not necessarily the dominant input mode for multimodal systems, nor is the combination of speech and gestures, such as pointing and issuing a command. Her argument is that speech does not contain important information such as spatial information and actions, which can be better portrayed through other input methods. The combination of speech and gestures lack symbolic information “that is more richly expressive than simple object selection.”

Commands

Based on her research, multimodal commands from multiple users were carried out in different ways, showing that people do not all act in the same multimodal manner. Because of this, multimodal systems should adapt to a specific user’s usage habits by detecting how they interact with the system. The research also shows that when people use speech in multimodal interfaces, the structure of the commands are simpler and more direct. Unimodal speech commands require more, complex information to perform a task.

Multiple Input Methods

Analogous to common belief, all possible inputs do not have to be processed at the same time. While some inputs, like speech and gestures, are highly related, other inputs can function at different times. Likewise, multiple inputs do not have to read redundant information; in many cases, successful multimodal interfaces are best when they capture multiple forms of unique information, such as ambient noise levels and lip movement.

Efficiency

Just because a device is designed for multimodal inputs does not mean that people will interact with it multimodally. People tend to interact multimodally when their context allows for it, which may not be all of the time. Additionally, if people are using multiple inputs that are creating erroneous data, multiple errors do not necessarily compound into one big error. Multimodal devices should be able to produce better results when all of the inputs are combined. People also know which input methods are best for their context, and will use those inputs accordingly.

Conclusion

In the future, interacting with computers and mobile devices will become more ubiquitous to the end user. The ubiquity will create a shift in computing that will allow more natural usage behavior and easier to use interfaces. To create the best user experience, multimodal interfaces must be able to synchronize multiple inputs and interact consistently with the user. Successful multimodal interfaces will also depend on multiple disciplines working together, such as Speech and Hearing Sciences, Perception and Vision, Linguistics, Psychology, and Statistics, to make the multimodal experience seamless.

Bibliography

Jones, Matt (2006). Mobile Interaction Design. Hoboken, NJ: John Wiley & Sons. 3-37.

Lynne Baillie, et al. (July 22-27, 2005) Designing Mona: User Interactions with Multimodal Mobile Applications. HCI International 2005, 11th International Conference on Human-Computer Interaction, Las Vegas, Nevada, USA.

Oviatt, Sharon et al. When Do We Interact Multimodally? Cognitive Load and Multimodal Communication Patterns. 129-136.

Oviatt, Sharon (1999). Ten Myths of Multimodal Interaction. Communications of the ACM. 42 Iss. 11, 74-81.

Oviatt, Sharon. (Sept-Oct 2003) Advances in robust multimodal interface design. In IEEE Computer Graphics and Applications, 23, p62(7).

Oviatt, Sharon (2000). Multimodal Interfaces that Process What Comes Naturally. Communications of the ACM. 43 Iss. 3, 45-53.

Pieraccini, Roberto (2004). Multimodal Conversational Systems for Automobiles. Communications of the ACM. 47 Iss. 1, 47-49.

Raman, T. V. (2003). User Interface Principles For Multimodal Interaction. Retrieved February 7, 2007, Web site: http://www.almaden.ibm.com/cs/people/tvraman/chi-2003/mmi-position.html

Simon, Steven J. (2007). User Acceptance of Voice Recognition Technology: An Emperical Extention of the Technology Acceptance Model. Journal of Organizational and End User Computing. 19 Iss. 1, 24-50.