Multimodal Issues - Article Review

Introduction

This is a review of several articles written about multimodal interfaces. Multimodal interfaces are interfaces that allow multiple modes of input (i.e. voice, text, gesture) and/or output. Explored in this review are the benefits and opportunities that multimodal interfaces bestow as well as the challenges that come with this functionality.

Opportunities

There are many benefits and opportunities that multimodal mobile computing offers. Adding a voice mode for either input or output is a common practice with handsets. It is self evident that humans are comfortable using speech to communicate. This would seem to indicate that people would prefer to speak when interacting with a computing device. (Hemson, pg 1) In addition to input, speech can also be beneficial when coupled with other types of information, such as text. “Since spoken output is presented in parallel, the speech output disambiguates, clarifies, and enhances textual output.” (Hemson, pg 2) Hemson argues convincingly that by combining multiple modes of output, communication is more likely to be successful.

Multimodal systems are not only more natural and robust, but also make it possible for communication to occur in changing and challenging environmental conditions. Multimodal input is very important in places where environment prevents one mode of input or output being utilized (Kondratova, 2006) In Kondratova’s study, the benefits of a multimodal system that was both eye and hands-free were seen clearly when participants were put in a simulation of a real life worksite. Loud noises interfered and so text input was utilized. When voice input was made possible through the use of a headset, subjects were much more aware of their surroundings, making their work much less hazardous. In addition to active input of information, the devices could be used to track passive data as well. For instance, the temperature could be recorded and delivered without the user having to input or even consider this data.

Multimodal devices can also be used to track environmental conditions that might affect the use of the device. If a device is to know exactly what type of information is appropriate, it will need to be knowledgeable about the user s’ attention. Are they available or are they preoccupied with other important information. There is a big difference between distracting someone while they are driving a car versus using a PDA to help navigate through a new city! There are several possible ways that a device could determine a user’s attention state. The use of voice analysis could tell the handsets the user is being distracted by the environment. Other behaviors that might be monitored are motor skills, eye movement, and use of a device such as slow scrolling or abrupt movement. (Jameson, 9)

New methods for analyzing multimodal usage are also becoming very helpful. One example is a camera that can be connected to headgear or glasses to collect data. These cameras are now capable of tracking eye movement as well as getting a picture of what is occurring on the screen, although in a less robust way than its lab bound counterpart. (Jameson, 6) This is beneficial in that it does not take any resources from the device itself, although it is a bit cumbersome. In addition to this, lab evaluations of multimodal devices have also been successfully completed using low fidelity materials such as paper, overhead projectors and index cards. (Chandler, 2002) Multimodality does not prevent creative developers from evaluating products, even at a very low fidelity that traditional interfaces have been tested on.

Describing a multimodal interface can be much more complex than a typical interface. Fortunately, researchers are developing methods of doing this. One group of researchers described four possible ways that a user could be utilizing different modalities. These four categories are as follows: 1) they always use one modality 2) they do not have specific modality preferences 3) they use modalities that are not available as well as modalities that are available 4) they use different modalities to enter in different types of information simultaneously. (Coutz et al, 1995) Having these four basic types of user interactions helps developers to understand how to best meet the needs of users. In addition to information about the user, the state of the software has to be known as well. Information such as the current state, goal, current modality or modalities in use and the temporal relationship between modalities are all pieces of information that have to be gathered in order to analyze use. (Coutz et al, 1995) By naming these elements, people analyzing multimodal use are able to more clearly understand and contextualize how these devices are being used.

Challenges

Although the benefits of multimodal systems are great, they do present their own set of challenges. Jameson states that, “there exists largely continual competition between the system and environment for various perceptual, motor, and cognitive resources.” (2002) Because of this struggle between the environment and the device, many complications can occur. As noted earlier, if devices become capable of interpreting the context, they will be able to know exactly when to initiate different modalities. Unfortunately, at this point, there is no sure fire way for a system to interpret the world or ensure that the actions of the user will be 100% correct. In certain situations, this imperfection doesn’t matter. But, when safety is a concern, the device cannot necessarily be trusted to make decisions for the user. (Jameson, 8)

Multimodal devices do have the capacity to offer superior interactions to unimodal devices because they can pick the optimum interaction on a case-by-case basis. Unfortunately, this is often negated by the fact that giving the user too many choices can increase cognitive overhead. (Jameson, 8) In short, the benefits are being equaled by the negative associated with choice. People who have to make a decision are less likely to be focused on the problem at hand than people accustomed to a typical way of doing things and working through that way, efficient or not.

Another challenge to multimodal systems appears when developers need to get data about customer use. Data collection on multimodal devices is difficult for three reasons. The first is that input is coming in multiple streams, some of which are hard to record (e.g. gesture). The second is that use is not typically stationary. Last, mobile devices have computational limitations that make it even more difficult to get information like screen data. (Jameson, pg 5) Although a workaround like a camera mounted on a headset or glasses is feasible, it does not come without obstacles as well. The mount may interfere with normal behavior. It requires power that could run short. It is much less reliable than those found in labs. The ability to track eye movement is not nearly as effective as those that are stationary.

Another big problem is the limitations of the device itself. It already has “severe computational resource constraints” (Jameson, 1) and so cannot be of much help in the data gathering effort. A multimodal device is typically stressed with trying to receive and distribute various types of information. It simply does not have the ability to capture screen shots and report on its interaction with the user. This complexity is made clearer when reading the following quote, “Multimodal user interfaces support interaction techniques which may be used sequentially or concurrently, and independently or combined synergistically.” (Nigay & Coutaz, 1993) The nature of multimodality makes it both a reliever of cognitive load and a producer as well.

The last challenge to gathering data is related to testing low fidelity versions of a multimodal device. Testing multimodal interfaces requires more creativity and possibly more people to properly simulate an interface. (Chandler, 2002) This is because one person cannot properly keep in mind all of the systems that need to be at the user’s beck and call. For instance, you might need extra wizards (people playing the part of the device) if both audio and visual cues are necessary. Also, making the different modalities work in conjunction with one another might require additional effort and practice as well.

Future Research Areas

Although current work shows promise, much additional research must be done. The primary area this work must be done in is identifying context in a more accurate way. If multimodal devices can accurately understand context, then programming the devices to interact with humans becomes a much easier task. One way of approaching this might be to allow multiple measures to be taken in parallel so that their combined data might produce a highly reliable prediction. This will also give us the added benefit of not having to ask the user to make a choice about what mode he/she should be in. This will reduce the cognitive load without any negative consequences.

Works Cited

Chandler, C., Lo, G., & Sinha, A. (2002). Multimodal Theater: Extending Low Fidelity Paper Prototyping to Multimodal Applications. Conference on Human Factors in Computing Systems (CHI), Student Posters (pp. 874-875). ACM Press.

Coutaz, J., Nigay, L., Salber, D., Blandford, A., May, J., & Young, R. (1995). Four Easy Pieces for Assessing the Usability of Multimodal Interaction: The Care Properties. IFIP International Conference on Human-Computer Interaction (pp. 115-120). London: Chapman & Hall.

Hemsen, H. A Testbed for Evaluating Multimodal dialogue Systems for Small Devices. Proceedings of the ISCA Tutorial and Research Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Gernamy.

Jameson, A. (2002). Usability Issues and Methods for Mobile Multimodal Systems. ISCA Tutorial and Research Workshop on Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany.

Kondratova, I., Lumsden, J., & Langton, N. (2006). Multimodal Field Data Entry: Performance and Usability issues. The Proceedings of the Joint International Conference on Computing and Decision Making in Civil and Building Engineering. Montreal, Canada.

Nigay, L. & Coutaz, J. (1993) A design space for multimodal interfaces: concurrent processing and data fusion. InterChi’ 93 ACM: New York , 172-178