Chapter 7: Conclusions and Recommendations

TOC or Beginning

            Popular opinion on the topic of consumer audio systems is that low frequencies are “hard to localize,” and thus can be reproduced by a mono subwoofer in some non-critical location.  However, a paper discussing DTS’ technique for encoding surround sound (Smyth, 1999) stated that “experimental evidence suggests that it is difficult to localize mid-to-high frequency signals above about 2.5 kHz” (p. 18).  Smyth continues, commenting that when a listener is presented with concurrent low and high frequency information, that high frequencies are relatively unimportant for proper image localization.  His statement seemed to oppose what is popularly said about localization, and prompted the preliminary research for this thesis.

            What does the statement “hard to localize,” encompass?  There seems to be several ways to interpret this.  One might say that this describes sounds that are least accurately localized (i.e. larger minimum audible angle).  The easier a sound is to localize, the more accurate one should be able to locate it.  However, others might argue this describes those sounds that are “confusing” to localize (especially front from back).  For instance, sounds without monaural pinna cues (i.e. not containing the 5-12 kHz range) are often difficult to discern front from back.  Even further, this might describe a listener’s confidence in the location of the event.  Most listeners are not confident in the location of narrow band continuous sounds, even though they may be able to determine the correct location.  Realize that this claim is open to interpretation, and most likely is a collective representation of all of these items.

During this investigation, several interesting discoveries were made.  First, it seems logical that subwoofers are also fairly difficult to localize.  However, this is not necessarily because they reproduce low frequencies.  In fact, from an absolute sense, the middle frequency range (1-3 kHz) is exhibits the largest minimum audible angle.  This was determined specifically by Stevens and Newman (1936) and later by Mills (1958) (among others).  They concluded that low frequency sounds exhibit definite interaural time differences, whereas high frequencies have strong interaural level differences.  Both cues exist in the middle frequency range, yet neither seems to dominate - leading to larger MAAs for middle frequencies than either lower or higher frequencies.

Subwoofers are difficult to localize because they typically reproduce continuous, narrow band (20-100 Hz) sounds.  Much of the localization research (see Blauert, 1999) has shown that sounds get easier to localize with increasing bandwidth.  This is because the number and type of localization cues increase with bandwidth, thus providing more cues for the brain to compare.  For instance, to localize a low frequency tone, a listener must rely only on interaural time differences; no level differences or pinna cues are present.  On the other hand, white noise exhibits low frequency ITDs, pinnae cues and high frequency ILDs.  Each of these cues will be in agreement to clearly indicate the location of the sound event.  This results in a more confident sense of a source’s location.  Also, impulsive sounds produce transients in the localization cues, which allows the brain to better interpret the location.

Although having a smaller MAA than middle frequencies, low frequencies are probably the “most confusing” to localize.  Frequencies under about 5 kHz are devoid of monaural pinnae cues, which help avoid front/back confusion.  In addition, Hartmann (1983) has suggested that reverberation due to room acoustics has the most impact on the localization of low frequency signals.  Yet, listeners should be able to discern left-from-right, because of dominant low frequency ITD cues.

            In an attempt to further investigate this topic of localization versus frequency, this thesis studied the effect that spatially relocating portions of the audible spectrum have on the localization of a stereo image.  This “spatial relocation” commonly occurs in consumer electronics, where tweeters (high frequencies) are often physically separated from a woofer (low frequencies), sometimes by a significant distance (see Figure 1).   

The experiments for this thesis compared the relative shift of a stereo image caused by horizontally relocating various low and high frequency bands.  Specifically, in a stereo speaker setup (± 40º), frequency bands were relocated from the left speaker to an offset speaker 15º closer to the median plane (see Figure 14).  The subjects were asked to comment on the relative shift of a centrally located image created by white noise bursts and music. 

These back-to-back comparisons would include two of seven conditions including stereo (no relocation) or six different relocated frequency bands: A (80-800 Hz), AB (80-1,600 Hz), ABC (80-5,000 Hz), E (12,000-20,000 Hz), DE (5,000-20,000 Hz), or CDE (1,600-20,000 Hz).  The frequency points which define these bands were chosen because they compare low and high frequencies and are known to contain localization cues of relative dominance.  ITDs are known to dominate the 20-800 Hz range, while their effect diminishes from 800-1,600 Hz, having no effect above this range.  Pinnae cues occur in the 5-12 kHz range (see Blauert, 1999).

Moving these frequency bands from the left channel to the SR channel (see Figure 20) would essentially alter the localization cues of the overall auditory event.  The theory behind the experiments was that the most dominant localization cues would create the most noticeable shift towards the right.  This is because ultimately, there are several localization cues that have a relative salience across the audible spectrum.  The most important of these is the low frequency (< 800 Hz) interaural phase differences (IPD).  This is followed by high frequency (2 -20 kHz) interaural level difference (ILD) and lastly, the monaural spectral cues of the pinnae (5-12 kHz).

            From experimental listening tests performed for this thesis, it has been shown that relocating the lower frequency bands (A, AB, ABC) caused more noticeable horizontal shifts to the stereo image than those caused by relocating the high frequency bands (E, DE, CDE).  While music was used for portions of the experiment, the most reliable test signal was a set of white noise bursts.  Not only does white noise represent an even distribution of spectral energy, but is also known to be one of the easiest types of sounds to localize (Stevens & Newman, 1936).  Music has a time-varying amount of spectral energy, which makes it more difficult for the listener to notice the spatial relocation.

            Results of the noise track showed that relocating band E typically produced no noticeable shift as compared to stereo (see Figure 29 and Table 3).  However, moving bands A, AB, or ABC produced significant shifts towards the right.  Comparably, moving bands DE and CDE also shifted the image to the right of stereo, but not as far as those created by the lower frequency bands.  Essentially, the results suggest that the stereo image created by relocating the low frequency bands was generally shifted further to the right than with relocating the high frequency bands.

            The specific reason for this apparent low frequency dominance is difficult to determine.  Loudness could be a possible factor, because left/right panning is typically associated with the balance of the stereo channels.  For instance, if band A was much louder than band E, this could explain why it was more influential.  Therefore, loudness was both calculated (see Table 8) and experimentally determined (Table 9 - Table 15) for the SR bands.  The results found the high frequency bands to be louder.  In fact, band E is almost one-third louder than band A.  Therefore, loudness is probably not the cause of the low frequency localization dominance seen here.

            It seems more likely that the 15º change in azimuth creates different localization cues for the SR band, and that changes in low frequency ITD cues produce a more noticeable image shift.  Thus, moving band E mainly changes ILD cues, whereas moving band A causes changes in ITD.  It is well established that ITDs tend to dominate overall perception, which this research supports.

Also, a large concentration of high frequency energy does not seem to commonly occur in music.  An analysis of fifty-two mixed-genre music tracks produced only seven with more than 3% of their average energy above 10 kHz.  Therefore, it seems reasonable that in most music tracks, high frequency energy is a fairly insignificant portion of the overall energy being reproduced.

An additional factor could be that listeners may not pay much attention to the stereo sound stage, especially for the upper audible spectrum.  This was supported during ABX testing, where listeners were not able to differentiate between regular stereo and a setup that shifted the left speaker’s high frequency (>10 kHz) signals by a 15º azimuth towards midline.  These short music clips contained a slightly greater average (~3%) of high frequency energy than the “typical” music track; determined from the above-mentioned sampling of mixed-genre tracks.

            Having shown that low frequency energy dominates the localization of a stereo image for this particular test setup and variables, there are several directions future research could take.  The most obvious has practical applications, where one could develop the “mono-ized” tweeter system discussed in the introduction.  However, simply moving the high frequency information to a central tweeter with no pre-processing will create a system with an odd sound stage.  Recall that this setup creates localization shifts towards the tweeter for any instrument/image that has a large concentration of high frequency energy relative to the chosen crossover frequency (i.e. a cymbal). 

The most noticeable differences for the mono tweeter system will be the change in high frequency image position and spectral balance.  The new image position is formed because without a tweeter on both sides of the listener, the system no longer reproduces the intended interaural level differences.  Instead, the level differences are dictated purely by the spatial position of the mono tweeter and the amount of high frequency energy it is reproducing.  This is difficult, if not impossible, to compensate for.

The spectral balance will also be different, because only one tweeter (instead of two) is replicating high frequencies.  This will potentially reduce the high frequency loudness.  Also, the sound is no longer coming from an off-center azimuth (i.e. 30º), but instead from a location near center (i.e. 0º).  Each spatial position has a different path to the ears, which exhibits a characteristic “filtering” effect described by the Head Related Transfer Function (HRTF).  For instance, if the mono tweeter is located at 0º azimuth, and that position is known to attenuate 5 kHz signals compared to a typical position of 30º, this should be compensated for by boosting the 5 kHz range during preprocessing.  HRTFs have been extensively researched and applied to audio systems offering “simulated” surround sound with two speakers or “3D” headphone listening systems (see Gardner, 1998).

This preprocessing relies on the ability to predict the intended spatial origin of the recorded high frequency information.  Because high frequency images are localized using interaural level differences, the cross correlation of high frequency left/right amplitude information should indicate the intended spatial position of the recording.  Equal energy suggests the image is towards the middle, while unbalanced energy suggests an image on one side.  Knowing the intended spatial position, along with the actual location of the mono tweeter, one could process the signal using HRTFs to better camouflage the missing tweeter.

However, the image’s location will still be incorrect due to the inability to control high frequency interaural level differences (having only one tweeter).  One idea to explore would be to introduce a time-delayed version of the mono signal, while encoding the original and time delayed version with different temporal envelope modulation.  At high frequencies, only envelop time differences (and level differences, which are fixed in this case) will have an effect on the resulting stereo image position.

            As to more simple investigations, one might change certain parameters used in these experiments in an attempt to support or disprove these findings.  Perhaps, comparing results between a horizontal and vertical SR channel would be interesting.  It is expected that the vertical channel would be even more difficult to notice.  This is because a change in vertical position only alters the monaural spectral cues, while level and time differences will stay the same.  Strybel & Fujimoto (2000) showed the vertical minimum audible angle to be 4-5º larger than the horizontal MAA at 0º azimuth.  Others might investigate the effect of relocating “equally loud” frequency bands, different loudspeakers, a reverberant vs. “typical” acoustic space, or varied speaker locations and configurations. 

The incorporation of in-ear recordings could also lead to a more analytical approach to understanding the listening test results.  These recordings are obtained by playing the test signals while monitoring the ear canal microphones of a dummy head or by placing probe microphones at the ear canals of actual listening subjects (see Blauert, 1999, p. 31).  The recordings provide spectral and temporal representations of the ear canal signals, which could be used to support the listening subjects’ responses.

            In summary, this research suggests that the localization of a stereo image is most affected by the spatial origin of low-to-mid frequencies as opposed to higher frequencies.  This is not due to any absolute localization abilities, considering that the middle frequency range (1-3 kHz) is known by scientists to have the largest minimum audible angle (i.e. the least accurately localized).  Instead, it is probably due to the perceptual dominance of low frequency interaural time differences. 

Regardless, a large amount of high frequency information is not typically present in music.   This was additionally supported when a group of average listeners did not notice that the high frequency sound stage had been altered.  Therefore, modifying high frequency localization cues seems to have a minimal impact on the perceived performance of the sound system.  This technique could thus be applied to perceptual coders, as is implemented in DTS’ technique (Smyth, 1999), or audio system designers via the mono tweeter system.  The mono tweeter system will obviously reduce the cost of the system while having a minimal effect on fidelity, especially if some preprocessing is performed.

Top or TOC or Beginning

 

 

 

Created  February 2003 by Rob Hartman

Copyright (C) 2003