The use of 3D-audio in a multi-modal Teleoperation platform for remote driving/supervision
Brian FG Katz & Antoine Tarault
The use of spatial audio in virtual reality applications is becoming more and more common. However, the use of 3D audio in telepresence application outside of virtual conferences is less common. Its use in a teleoperation context is even less common. This paper presents a recent evolution of the SACARI project (Supervision of an Autonomous Car by an Augmented viRtuality Interface), the combination of both audio and video telepresence technology for the application of remote vehicle driving and supervision. For such an application, the real-time rendering of remote events is crucial, making system latencies an important part of the system design and optimization. Two different audio rendering systems are presented. A subjective evaluation study was carried out using a simplified version of the SACARI platform. This page represents a summary of the conference paper and presentation presented at the AES 30th International Conference, Saariselkä, Finland, 15-17 March 2007 .[AES30]
One of the most important rules in a teleoperation system is to reconstitute the remote location’s environment as accurately as possible, with as short a delay as possible. These two demands are of course contradictory, and the system must be designed as a compromise between quality/detail of the information presented and the temporal pertinence of the information. The criterion of accuracy is of course inherently related to the task at hand. A level of detail which exceeds that required for succeeding at the given task is not necessarily beneficial to the perceived quality of the system.
The Car Simulator
The system has been designed around two main ways of driving: remote driving and supervision. In remote driving mode, the user only specifies direction, speed, and braking (as if directly driving a vehicle). In supervision mode, the user should only need to give the immediate path to follow, say for the next ten seconds. Knowing the actual state of technology and the incalculable number of unexpected events, a totally autonomous vehicle cannot avoid each danger related to the road. For this reason, the user should be able to correct in real time the possible errors of the autonomy algorithm. The telepresence of the user at the remote site (that of the vehicle) is paramount for the detection of, and response to, unforeseen events which could be errors of the autonomy algorithm, dangers posed by other vehicles, etc. For this reason, the true precision of the spatial auditory rendering is less important to the task that the general directional indication cues (along with the actual auditory event, of course). It is therefore possible to consider the design of the audio stream as being concerned with reducing front/back or up/down confusions, and being less concerned by the true angular precision of the perceived auditory event.
Global System Architecture
To fulfill the teleoperation constraints, the overall system was conceived to have minimal latencies, distributed operations, and modular components. For the tasks and evaluations of the system it was necessary to allow easy data exchange between multiple applications and to have the possibility to record/replay data from different tests. In response to these constraints, SACARI was developed using the RTMaps platform  and OpenSceneGraph  on a distributed cluster architecture.
Remote stereo video
The system is equipped with a dual-camera whose purpose is to transmit a stereo video stream to a distant VR user. The stereo camera is mounted on a motorized pan-tilt head so that the distant user can actively “turn his head” virtually and peruse the remote environment (a “periscope” metaphor). The angular range of the pan-tilt head is +/-158,76°in azimuth and -46,39° to +30,86° in elevation.
Remote spatial audio
Due to the limited field of view of the video stream, the transmission of the spatial auditory environment to the distant user via an audio stream provides an important perceptual link to the global environment around the user, indicating events which are happening outside the video image. If such an event is detected, the user is capable of orientating the camera towards it, to better define the urgency and the reaction necessary.
Spatialized soun evaluation
Two audio schemes are considered for the capture and rendering of the remote acoustic scene. The purpose of this work is to compare various methods for the real-time interactive streaming of a real-life spatialized audio scene. The interaction and influence of head orientation information is paramount, and different methods treat this issue differently. A related study by Endo et al compared dynamic localization precision between two 2-channel methods. A localization task was performed in the horizontal plane using a motorized dummy head for dynamic cues. Comparisons were made between different sized heads (full and reduced scales) and also a pair of spaced microphones. This study has removed the head component of the dummy head, and adds the pinnae effect, as well as sources off of the horizontal plane.
As an audio parallel to the stereoscopic camera, two microphones are placed on the sides of the stereo camera head. To assist in the resolution of directional confusions, the microphones have been placed at the blocked ear-canal entrance position of two rubber ear replicas (corresponding to ears from HRTF subject 1034 in the LISTEN database ).
Due to the design and functional limitations of the camera and pan-tilt head it was not possible to include a head or torso model. Although the pinnae cues are present, it is understood that the head-shadow and torso effects are missing, creating an incomplete HRTF for the user. The exact nature of the ITD and ILD are also expected to differ from that of a typical head model. As the microphones are fixed to the pan-tilt head, the dynamic cues associated with the head movement of the remote user are available for the resolution of localization confusion errors. The binaural audio stream (2 channels, left and right ear) is transmitted, with each frame containing the corresponding timestamp, to the distant user site and rendered directly over headphones. In order to reduce the bandwidth used by the audio stream, it is possible to apply some audio compression techniques. The effect of such techniques has been previously examined for several methods  but a low-latency algorithm has not yet been implemented, therefore, no audio compression has been used.
In contrast to the stereoscopic camera/binaural audio approach (two eyes -> two cameras, two ears -> two microphones) a less user centric approach was tested. A fixed position first order Ambisonic microphone was placed 20 cm directly behind and slightly above the pan-tilt camera head, out of the possible field of view. The four audio channels, representing the zero and first order spherical harmonics of the B-format output of the Ambisonic microphone, are streamed to the user site. This spatially coded version of the sound field is then decoded at the user site, where the specific user head rotation is taken into account. This approach requires additional bandwidth as compared to the binaural approach (2 channels versus 4 channels). The use of audio compression on an Ambisonic stream is possible, but the effects on the rendered scene have not yet been studied. Therefore, no audio compression has been used. Due to the inherent conflict between loudspeaker positions and video projection optic pathways, loudspeaker layouts are somewhat limited. For this reason, a virtual Ambisonic rendering approach was employed. Using this method, the virtual loudspeakers remain at fixed positions relative to the user, and the Ambisonic stream is “rotated” according to the user’s head orientation. This removes the need for inter-position HRTF interpolations. The binaural decoding of the Ambisonic signals, using the virtual loudspeaker approach, equates to the binaural synthesis of an idealized Ambisonic system. The freely available netsend~ was chosen (as it is flexible with regards to platform, and is compatible with the Max/MSP and PureData programming environments) and was implemented as an RTMaps object. The bin_ambi rendering package for PureData by the Institut für Elektronische Musik und Akustik (iem) was used as a basis for the binaural Ambisonic rendering engine. Modifications of the default package were made to provide a simple first order Ambisonic decode. In addition, an alternate HRTF database was created. Taking advantage of the fact that the ears used in the binaural microphone correspond to the ears of one of the subjects in the Listen HRTF database, the raw HRTF measurements for given subject were implemented in bin_ambi. In this way, any differences in individual HRTF mismatch is “constant” between the binaural and Ambisonic configurations (albeit that the head size and head shadow are not the same due to the absence of the head and torso in the direct Binaural capture scheme, resulting is somewhat different ITD and ILD cues).
Aim of the evaluation
The main goal of this study was to compare dynamic cue sound spatialization between a binaural stereo capture system and an Ambisonic capture system. The evaluation focuses mainly on the improvement in the speed of detection and general localization of objects when combining the visual modality with spatialized audio.
The audio and video capture devices were placed in a remote room, rather than in the remote vehicle for this study. Therefore, only rotational control information was used. This room was acoustically treated to be rather dry, reducing any room effect. A HMD (Head Mounted Display) was used for the video display. The HMD, equipped with a head-tracker allows for the full exploitation of the potential visual field of view of the stereo camera mounted on the pan-tilt head.
Task performance test
The general application of the audio modality in the SACARI project is to direct the attention of the user to a specific event, which is then brought into the visual field of view through a change of head orientation. As it is generally accepted that binaural audio provides a higher degree of angular precision when compared to first order Ambisonics, it was necessary to develop a task which was not totally biased by this effect. An audio-visual localization task was chosen. A number of sound sources (loudspeakers) were distributed in the remote, darkened room. Each loudspeaker was equipped with a small light (LED). For each audio event, a sound was played from one of the loudspeakers and the corresponding light was illuminated. A target symbol was overlaid in the user’s video stream. The task was to orient the point of view such that the active light was within the target circle. A simple image analyses routine was used to detect the presence of the light in the target area, concluding the localization trial. A minimum time threshold was used to avoid spurious triggers and overshooting. Each trial began with the user aligning their orientation towards the reference light (directly in front at head elevation), at which point reference light turned off and the sound event would be played. No loudspeakers were located within the field of view from the reference position. The sound source used was an impulsive sound. An ecological sound was preferred over the use of an artificial noise burst. In order to attempt to increase the urgency of the participants’ reaction, the sound of a breaking glass was chosen. The sound provides both an abrupt attack and a broad spectral content, advantageous for sound source localization. A video example showing the experiment for the binaural case can be seen/heard in the following video. The audio content is the audio heard by the test subject, and therefore should be listened to over headphones.
Click the link to the right to view SACARI-Audio Experiment Video (29MB).
Eight source positions were used, equidistant from the fixed pan-tilt head at a distance of 2 m. Positions were chosen to highlight front/back symmetry as well as up/down symmetry conditions. Trial source event positions were chosen at random, with each position being repeated 3 times, for a total of 24 trials. The first eight trials for each sound presentation method were considered as a learning phase (not counted in the statistical analysis, as some learning effect was observed. This resulted in 16 trials per subject. Each subject completed the test for both the binaural and Ambisonic condition. The order of presentation was alternated between subjects, such that one half received the Binaural audio condition first, while the other half received the BinauralAmbisonic first. The test lasted about 10 min for each condition. A third and final NoSound condition without any audio output was also included, where the participants had to search for the active LED using only the visual feedback. This condition was not always completed as some users found this condition to induce motion sickness. A total of 16 participants took part in the test, aged 23-46. Participants had a varied range of previous experience in spatialized audio and virtual reality, from none to expert. The localization time-to-target response times, as well as the orientation trajectory paths were recorded.
Results - Objective
An analysis of the overall results indicates a significant difference between conditions with and without audio feedback. The binaural method also appears to slightly outperform the BinauralAmbisonic method. Due to the fact that the time-to-target is not uniform for all target positions, and it is not evident that all target position would be perceived equally, it is interesting to examine the results in more detail. A statistical analysis by source/target position is shown in the following Figure.
From these results, the differences in the reaction times are clearer between the presentation methods. The response times are equal if not better for all sources for the Binaural method as compared to the BinauralAmbisonic method. The distribution of response times is also smaller for the Binaural method. The difference between the distributions of responses is greatest for rear sources, indicating more difficulty for rear sources via BinauralAmbisonic. Response times for the NoSound condition represent what is possible using a “deaf” search pattern. A further analysis of the results consists in examining the participant’s trajectory or search strategy used. An example of the trajectories for a given participant, near the end of the session (after they have established their personal strategy and any learning effect has been included) is shown in the following Figure for target source A90E30 (azimuth 90° elevation 30°, corresponding to the right side, elevated) and target source A-120E-30 (rear, below). This Figure also presents the learning effect, which will be discussed below.
The Binaural trials show the most direct trajectory towards the target. For A90E30, it is clear that the azimuth is first established, followed by the elevation. There is more incertitude with the BinauralAmbisonic method, leading to an overshoot and slight confusion. The NoSound stimulus readily shows the participant using a basic search strategy across the entire space. Considering the example of source A-120E-30 it is clear from the trajectory that the participant was correctly interpreting both azimuth and elevation cues, although the Binaural trajectory seems to be smoother and more direct than the BinauralAmbisonic trajectory. The NoSound trajectory appears at first glance to be somewhat guided, but when compared to the previous source position trajectory, it is evident that this trajectory is simply the search pattern chosen by the participant, and that there are no cues to the target position. For more insight into the learning phase, an example of similar source positions, presented early in the test during the learning phase, for the same participant is shown in the previous Figure. The ease at which the task is accomplished (i.e. the auditory cues are not yet learning for the HRTF and method in question) is lower for the Binaural method. The difficulty with the BinauralAmbisonic method is clearly evident. The NoSound condition is relatively similar, as would be expected, since this method was presented after the two audio methods. Further analysis of the trajectory data is required to determine if there are any significant differences in front/back or left/right confusion rates. Further study regarding a reduction in the audio latency (possible with the method BinauralAmbisonic) could be performed. In this application, this would create a situation where the audio leads the video.
Results - Subjective
In addition to the time-to-target and trajectory data analysis available, participants in the study were also asked about their perceptions and preferences between the methods. The vast majority of participants (80%) preferred the Binaural method to the BinauralAmbisonic method. No participant preferred the NoSound method. In general, the Binaural method was more precise in azimuth and elevation, though some participants disagreed. Moreover, with the Binarual method, some participants perceived the lower elevation sources to be higher (at positive elevations). This may be due to the lack of torso and the torso shadowing effect. Many responses discussed the evidence of a learning effect, primarily for elevation cues. For the BinauralAmbisonic method, both front/back and left/right confusions were mentioned as a negative. Several subjects noted that they believed they were faster in the NoSound condition, although the data did not support this.
This study presents the addition of spatialized audio streaming in a teleoperation environment with head-orientation tracking. Two methods for the capture and restitution of the spatial audio scene are used: Binaural (pinnae cues without head/torso shadowing) and BinauralAmbisonic (virtual loudspeaker rendering using measured HRTF data). The same pinnae were used for both methods. These two methods, in addition to a baseline NoSound case are compared in an active target search task where a visual target alignment task within a narrow field of vision is used to eliminate differences in fine precision between the audio methods. Differences in task success speed were found between the methods, and variations were found to be more important for source locations located behind the participant. Preliminary trajectory analysis was also performed, showing that participants were able to adapt to the auditory positional cues present in both methods, although adaptation to elevation cues was more prevalent in the Binaural condition.
[AES30] Brian FG Katz, Antoine Tarault, Patrick Bourdot, and Jean-Marc Vézien, “The Use of 3d-Audio in a Multi-Modal Teleoperation Platform for Remote Driving/Supervision.” Proceedings of the AES 30th International Conference, Saariselkä, Finland, 15-17 March 2007.
 A. Tarault, P. Bourdot, and J-M. Vézien, “An Immersive Remote Driving Interface for Autonomous Vehicles”, In proceedings of Computer Graphics and Geometry Modeling workshop of the 5th International Conference on Computational Science, Berlin, Springer (2005).
 B. Steux, P. Coulombeau, and C. Laurgeau, “RtMaps: a Framework for Prototyping Automotive Multi-Sensor Applications”, IEEE IV (2000).
 R. Osfield, Introduction to the OpenSceneGraph; http://openscenegraph.sourceforge.net/introduction/index.html, 2003-09-14 (2003).
 R.M. Taylor II and al., VRPN: A Device-Independent, Network-Transparent VR Peripheral System. In Proceedings of the ACM Symposium on Virtual Reality Software and Technology. New York, ACM Press (2001).
 K.T. Simsarian, K-P. Åkesson, “Windows on the World: An Example of Augmented Virtuality”, Interfaces 97: Man-Machine Interaction, Montpellier, (1997).
 W. Endo, M. Kashino, and T. Hirahara, “The effect of head movement on sound localization using stereo microphones: Comparison with dummy heads.” Fourth Joint Meeting: ASA and ASJ, Honolulu, November 2006, J. Acoust. Soc. Am., Vol. 120(5), Pt. 2, November (2006).
 F. Prezat and B.F.G. Katz, “The effect of audio compression techniques on binaural audio rendering.” Proceedings of the 120th AES Convention, 20-23 May (2006).
 netsend~ for Max/MSP and Pure Data; http://www.nullmedium.de/dev/netsend~/.
 Max/MSP: A graphical environment for music, audio, and multimedia; http://www.cycling74.com/products/maxmsp.
 PureData; http://puredata.info/.
 bin_ambi :: Real-Time Rendering Engine for Virtual (Binaural) Sound Reproduction; http://iem.at/Members/noisternig/bin_ambi.
 “RBH Glass_Break 02.wav”, The Freesound Project, http://freesound.iua.upf.edu/.