Navigating in a virtual audio environment

A. Blum, M. Denis, B. Katz & A. Afonso

Introduction

The research reported here results from a collaboration between researchers in psychology and in acoustics on the issue of spatial cognition. It has been conducted in the framework of a EU-supported project, tge STREP "Wayfinding". Our objective is to account for the auditory component of wayfinding tasks with the following general question: Which acoustic cues are relevant for navigation? In this research, we focus on pedestrian navigation. Our approach consists of starting with the knowledge on auditory localization in the acoustic field and to see how these findings can be extended in the context of wayfinding.

Our study focuses on the role of head movements a listener uses to localize a sound source in the surrounding space. Literature in the acoustic domain attests for the significant role of head movements (HM) in the localization of sound sources (Wenzel, 1996; Wightman & Kistler, 1999; Minnaar et al., 2001). HM provide people with dynamic cues that allow for accurate auditory localization and play an important role in helping them resolve front/back confusions. This is not surprising from a neuro-anatomic point of view since the vestibular system is intimately connected to the auditory system in the inner ear. However, all previous studies have been carried out in static situations (participants are sitting on a chair, no displacement is made). The question we then address in the experiment reported here is whether HM are also important when the participants are allowed to navigate within the sound scene.

The stronger acoustic cues the human auditory system uses for sound localization are the interaural differences (Blauert, 1996). Interaural Time Differences (ITD) are the result of differences in the acoustic paths from the sound source to left and right ears when sources are lateralized (cf. figure 1). From the interaural axis, one can define surfaces of iso-ITD which are commonly called cones of confusion (cf. figure 2), and leading to possible ambiguities for the interpretation of the acoustic cues for localization. A classical error of interpretation is a front/back confusion: a source in front of the listener is perceived in the back, at a symmetrical position toward the interaural axis, thus presenting the same ITD. What has been shown in the studies previously quoted is that the human binaural system uses dynamic variations of interaural cues correlated with head motion for resolving these ambiguities. When the head turns, ITD vary in opposite directions whether if the sound source is in front of us or behind. This makes it possible a very clear distinction between the different candidates upon the cone of confusion.

Figure 1:ITD (Interaural Time Difference) from longer acoustic path when sound sources are lateralized
Figure 2:Surface of Iso-ITD (cone of confusion) and possible ambiguity in the acoustic cues for localization

In a navigational context, one may propose the following hypothesis: When moving through space, the auditory system is also provided with dynamic cues that may be sufficient to resolve localization ambiguities. These alternative dynamic cues would come from translation rather than from head rotations. Informal tests seemed to confirm the hypothesis. In these tests, the objective was to localize randomly positioned loudspeakers in a room while being blindfolded and with translational movements only allowed. The loudspeaker positions were correctly identified when walking through the room without making HM. However it is difficult to control HM of someone displacing for real, and we could not ensure that no acoustic information was taken from small unconscious rotations of the head. The original concept of the present study consists of using a virtual audio environment where the experimenter can totally control the interactions between HM and dynamic variations of binaural cues.

Binaural synthesis is now a classical tool for providing 3D audio in a virtual environment (Begault, 1994). This technique is rendered via headphones and is based on the natural acoustic cues the human auditory system uses to localize sound. A well-known issue is that under normal headphone listening conditions, the sound scene follows HM: the listener cannot benefit from binaural dynamic cues anymore. However, with a head tracker mounted on headphones, one can update the sound scene relative to head orientation in real time, correcting this artifact. In the present experiment, we have contrasted two conditions: with and without head tracking. Participants with head tracking can have pertinent acoustic information from HM as in natural spatial hearing whereas participants without head tracking have no possibility of getting such information to resolve localization ambiguities.

Experimental set up

Half of the participants are equipped with a head tracker (Flock of Bird) mounted on headphones. Position is updated every 50 ms (20 Hz).

Participants are blindfolded. Sound is rendered via headphones (Sennheiser HD 570) and spatialized with binaural techniques (IRCAM’s Spat) and an individualization procedure. Individualization was made with participant’s head circumference measurement and a pre-test where he/she could chose between seven statistically representative HRTF (filter used for binaural synthesis) as described in (Afonso et al., 2005). Participants move in the environment using a joystick (Saitek Cyborg). This avoids the memorization of sensorimotor (locomotor) contingencies.

Artificial reverberation is processed and fits a real room (Espro at IRCAM) with variable acoustic characteristics in its more absorbing configuration (reverberation time of 0.4 s). Sound of the avatar’s steps is rendered to represent displacement and according to the avatar velocity. In the virtual environment, the relation between distances, velocity and the corresponding acoustic properties is designed as to fit real contingencies. We will therefore express these dimensions with the usual unities (meter for distance and km/s for speed). Up/down movements of the joystick allow displacement respectively forward and backward in the virtual world. The maximum speed, corresponding to the extreme position, is 5 km/h, which is about the natural human walking speed. With left/right movements, the participant controls rotation of his/her avatar’s body. Translation and rotation can be combined with diagonal manipulations. No literature could be found to adjust the navigational parameter of the joystick. Experts in Virtual Reality from our laboratory (VENISE project) confirmed that adjustments were generally made in an empirical way. A single formula corresponding to pedestrian navigation was found in the documentation of Open Scene Graph, an open source library for 3D graphics ([1]). The rotation angle $\alpha$ in degrees resulting from the current value $\delta$x of the joystick (from a lateral movement) is expressed as:$\alpha$ = ($\delta$x / max).50°.$\delta$t, where max is the value corresponding to the maximum lateral position of the joystick, and $\delta$t the time step between two updates of $\delta$x. We obtained a linear relation between $\alpha$ and $\delta$x with a coefficient of 0.0009765 (which turned out to be very similar to the one chosen empirically, 0.001).

The experiment is designed with a “game like” scenario of bomb disposal. Bombs (sources with a ticking countdown sound) are hidden in an open space and the participant has to reach their position to defuse them.

The participant navigates from target to target repeating the following sequence:

1. Participant is in A and searches for the sound coming from position B.
2. When he/she gets close (within a radius of 2 m) to B, the source C becomes active.
3. When B is hit (defused), its sound stops, and the participant now looks for the next target, C.

The next source always starts before the participant reaches the active one. This may incite HM for the participant to localize the new sound while keeping a straight movement toward its current target. As two sources can be active at the same time, two different countdown sounds were used alternatively with equal normalized level.

The distance between two sources is always 5 m. There is a limited time (60 s) to reach the target (after which the bomb explodes). In this case, the participant is placed at the position of the missed target and has to resume the task in the direction of the next source.

A source is hit when the participant approaches its position within a radius of 0.6 m. This “hit detection radius” of 0.6 m correspond to an angle of ±6.8° when being at 5 m of the source (cf. figure 3) which is the mean human localization blur in the horizontal plane (Blauert, 1996). As a consequence, if the participant oriented him/herself within this precision when starting to look for a target, the target could be reached by going straightforward.

Figure 3

The experiment is composed of 6 identical trials involving displacement along a succession of 8 segments (8 sources to hit in each trial).

For the analysis, we ignored trial #1 as well as segments #1 and #8 of each trial:

• The first trial was used as a training session.
• For segment #1, there is only one sound active, as there is no bomb already active when searching for bomb #1.
• For segment #8, there is only one sound active, as there is no bomb following.

In total, 5x6 = 30 segments per participant are analyzed. The angles made by the 6 retained segments of one trial are balanced between right/left and back/front ( -135°, -90°, -45°, 45°, 90, 135°). Finally, to control a possible sequence effect, two different ordering of presentation are randomly proposed to the participants.

Recorded data

Twenty participants were selected for this study. None had hearing deficiencies. Each one was allocated to one of the two head tracking conditions (with or without). An equal distribution was achieved between the participants of the two groups according to gender, age, and educational and socio-cultural background. Each group comprised five women and five men with a mean age of 34 years (from 22 to 55 years, sd = 10).

For the analysis, we defined the following measurements of performance:

• Hit time (time to reach target for each segment)
• Close time (time to get within 2 m from target - next sound starts)
• Percentage of hits (bombs defused)

Additionally, at the end of the experiment, participants were asked to draw the sources' positions on a sheet of paper where their starting point and the first target are already represented in order to normalize the adopted scale and drawing orientation.

Finally, position and orientation of the participant over time were recorded. This allowed us to reconstruct the participants’ trajectories in the virtual environment. Figure 4 shows the trajectory of a participant for trial #2 toward the 8 sources he had to find. This participant belongs to the “no head tracking” condition. The orange arrows represent the orientation of the avatar controlled by left/right movements of the joystick. We can observe reorientation moment with a lot of rotation, mainly when the participant just found a target and started to search for the next source. On figure 5, the trajectory, for the same trial, of a participant in the “with head tracking” condition is presented. This time green arrows are added to represent HM relative to body orientation (the avatar’s body orientation in the virtual environment from movement of the joystick). At a given point, if the orange and green arrows are in the same direction, we know the participant didn’t turn his head at this moment. As in figure 5, this was the general case for all participants.

Figure 4:Trajectory of a participant without head tracking (trial #2). X and Y axes in meters

Figure 5:Trajectory of a participant with head tracking (trial #2). The green arrows represent head orientation relative to the avatar’s body displacement. X and Y axes in meters

Results

The results reported below are essentially based on the statistical analysis of hit times and of the percentage of hits. Figures present these two data with a bar for the mean hit time (and whiskers for standard deviation) and the numeric value under the bar for the corresponding percentage of hits.

We first control a possible influence of the source position sequence (two different orderings were randomly proposed) and of the type of source (two different sounds were used): no effect is found for these two control variables.

On the contrary, large individual differences in hit time performance (p < $10^{-5}$) are to be noticed. Figure 6 shows individual performances sorted by percentage of hits. Some participants show a mean hit time more than twice the shortest ones. Percentage of hits varies from 13% to 100%. In fact, some participant did not really manage to execute the task while other had no problem at all. Performances expressed in mean times or in percentages are globally correlated; linear correlation coefficient between hit times and percentage of hits is -0.67 (p = 0.0013).

Figure 6:Individual performances of the 20 participants

The effect of age was investigated. Four age groups were defined with four participants between 20 and 25 years, six between 25 and 30, six between 30 and 40 and four between 40 and 60. Figure 7 shows the performances for each age group. Young participants had shorter hit times and higher percentage of hits than older ones. A significant effect of age and a significant gender x age interaction were found (cf. figure 7 bis). Older women had more difficulty in executing the task.

Figure 7 and 7 bis

In a questionnaire before which was filled in before the experiment, the participants were asked to report whether they had previous experience with video games. Eleven participants reported they had such experience, the remaining nine participants having not. Figure 8 shows that the experienced group had higher performances. There was a significant effect of this factor on the hit times (p = 0.004), and the group with experience has 94% of hits versus only 69% for the other group. Not surprisingly, individuals familiar with video games may be more comfortable with immersion in virtual environment and joystick manipulation. This can be related to the age group observation since no participant from group [40-60] reported any experience with video games.

A significant effect of learning was also found (p = 0.0047), as shown in figure 9. This effect is more likely due to the learning of how to navigate in the virtual audio environment than to a memorization of the position of the sources since participants did not perceive any sequence repetition and reported they treated each target individually. This is also confirmed by the fact that they could not draw the path of a trial on a sheet of paper when they were asked to do so at the end of the experiment. In a previous experiment (Afonso et al., 2005), participants also had to reach virtual sound sources but with a real displacement. After navigation in the virtual environment, they were able to reconstruct the sound scene whereas in the present experiment they could not. This is an argument in favor of the importance of the memorization of sensorimotor (locomotor) contingencies for the representation of space.

Figure 8
Figure 9

Figure 10

An ANOVA on the hit times with the head tracker condition as a grouping factor did not reveal any significant effect. Mean hit times of the two groups are very similar (19.8 s versus 20.4 s). Figure 10 shows that participants who benefited from relevant acoustic cues related to HM did not perform better than the others. Moreover, front/back confusions are observable for participants without as well as with head tracking.

On figures 11 and 11 bis, two trajectories with such inversion are presented for a participant of the “with head tracking” condition. In fact, the participants did not appear to use HM to resolve localization ambiguity, they focused on the use of the joystick staying straight and concentrate.

Figure 11:front/back confusion The participant reaches source #1, source #2 is in front of him but he turns back and goes in the opposite direction. After a certain distance is traversed, he realizes this localization inversion and rotates again to go back toward the correct direction.
Figure 11 bis: back/front confusion. Same thing as in figure 11, but with an inversion between back and front and with source #3 and #4

Discussion and perspectives

The inclusion of head tracking was not found to be necessary in our experiment where movements from the joystick are sufficient for the participants to succeed in the task. This finding is of interest for the design of audio environments since head tracking is an expensive equipment. However, the use of a joystick elicits some questions:

• experience with video games affects performance;
• very few HM are made since the participants tend to focus on movements via the joystick.

Participants seem to have transferred vestibular modality toward the use of the joystick. This is supported by the typical navigation strategy observable in the participants’ trajectories: rotations are made with the joystick to decide of the correct direction (cf. figure 12 as an example)

Figure 12:Typical navigation strategy. Rotations are made with the joystick to decide of the right direction

In a new study, which involves A. Afonso, B. Katz, L. Picinali, and M. Denis, we are trying to identify the acoustic cues used by blind people to comprehend the configuration of a closed space (in a building). Blind people are invited to navigate along the corridors of our lab. The participants wear ear-microphones which record on-line every auditory information that people listen to while navigating. Other recordings are made along the same paths with 3D microphones. New participants listen to the recordings and have to reconstruct the virtual audio environment with Lego blocks. Their reconstructions will be compared to those of participants who have learned the same environment during actual locomotion.

In this new experiment, the head tracker is not Flock of Bird, but XSens. The engine for binaural rendering is based on a Bformat virtual loudspeakers system with binaural conversion through the LSE (Limsi Spatialization Engine). Lastly, the HRTF are chosen in the LISTEN database, and the ITD is customized for each participant depending on his/her head circumference.

References

Afonso, A., Katz, B. F. G., Blum, A., Jacquemin, C., & Denis, M. (2005). A study of spatial cognition in an immersive virtual audio environment: Comparing blind and blindfolded individuals. Proceedings of ICAD (Limerick, Ireland), 228-235.

Blum, A., Denis, M., & Katz, B. F. G. (2006). Navigation in the absence of vision: How to find one's way in a 3D audio virtual environment? Cognitive Processing, 7, Supplement 1, S151. (Abstract). International Conference on Spatial Cognition, Roma/Perugia, Italy, 12-15.

Begault, D. R. (1994). 3-D sound for virtual reality and multimedia. Cambridge, MA: Academic Press.

Blauert, J. (1996). Spatial Hearing, the psychophysics of human sound localization. MIT Press, Cambridge.

Minnaar, P., Olesen, S. K., Christensen, F., & Møller, H. (2001). The importance of head movements for binaural room synthesis. Proceedings of ICAD (Espoo, Finland), 21-25.

Wightman, F. L., & Kistler, D. J. (1999). Resolution of front-back ambiguity in spatial hearing by listener and source movement. Journal of the Acoustical Society of America, 105, 2841–2853.

Wenzel, E.M. (1998). The impact of system latency on dynamic performance in virtual acoustic environments. Proceedings of the 16th ICA and 135th ASA, Seattle, WA, pp. 2405-2406.