Chapter 7: Conclusions and
Recommendations

Popular opinion on the topic of
consumer audio systems is that low frequencies are “hard to localize,” and thus
can be reproduced by a mono subwoofer in some non-critical location. However, a paper discussing DTS’ technique
for encoding surround sound (Smyth, 1999) stated that “experimental evidence
suggests that it is difficult to localize mid-to-high frequency signals above
about 2.5 kHz” (p. 18). Smyth
continues, commenting that when a listener is presented with concurrent low and
high frequency information, that high frequencies are relatively unimportant
for proper image localization. His
statement seemed to oppose what is popularly said about localization, and
prompted the preliminary research for this thesis.
What does the statement “hard to
localize,” encompass? There seems to be
several ways to interpret this. One
might say that this describes sounds that are least accurately localized (i.e.
larger minimum audible angle). The
easier a sound is to localize, the more accurate one should be able to locate
it. However, others might argue this
describes those sounds that are “confusing” to localize (especially front from
back). For instance, sounds without
monaural pinna cues (i.e. not containing the 5-12 kHz range) are often
difficult to discern front from back.
Even further, this might describe a listener’s confidence in the
location of the event. Most listeners
are not confident in the location of narrow band continuous sounds, even though
they may be able to determine the correct location. Realize that this claim is open to interpretation, and most
likely is a collective representation of all of these items.
During this investigation, several interesting discoveries
were made. First, it seems logical that
subwoofers are also fairly difficult to localize. However, this is not
Subwoofers are difficult to localize because they typically
reproduce continuous, narrow band (20-100 Hz) sounds. Much of the localization research (see Blauert, 1999) has shown
that sounds get easier to localize with increasing bandwidth. This is because the number and type of
localization cues increase with bandwidth, thus providing more cues for the
brain to compare. For instance, to
localize a low frequency tone, a listener must rely only on interaural time
differences; no level differences or pinna cues are present. On the other hand, white noise exhibits low
frequency ITDs, pinnae cues and high frequency ILDs. Each of these cues will be in agreement to clearly indicate the
location of the sound event. This
results in a more confident sense of a source’s location. Also, impulsive sounds produce transients in
the localization cues, which allows the brain to better interpret the location.
Although having a smaller MAA than middle frequencies, low
frequencies are probably the “most confusing” to localize. Frequencies under about 5 kHz are devoid of
monaural pinnae cues, which help avoid front/back confusion. In addition, Hartmann (1983) has suggested
that reverberation due to room acoustics has the most impact on the
localization of low frequency signals.
Yet, listeners should be able to discern left-from-right, because of
dominant low frequency ITD cues.
In an attempt to further investigate
this topic of localization versus frequency, this thesis studied the effect
that spatially relocating portions of the audible spectrum have on the
localization of a stereo image. This
“spatial relocation” commonly occurs in consumer electronics, where tweeters
(high frequencies) are often physically separated from a woofer (low
frequencies), sometimes by a significant distance (see Figure 1).
The experiments for this thesis compared the relative shift
of a stereo image caused by horizontally relocating various low and high
frequency bands. Specifically, in a
stereo speaker setup (± 40º), frequency bands were relocated from the left
speaker to an offset speaker 15º closer to the median plane (see Figure 14). The
subjects were asked to comment on the relative shift of a centrally located
image created by white noise bursts and music.
These back-to-back comparisons would include two of seven
conditions including stereo (no relocation) or six different relocated
frequency bands: A (80-800 Hz), AB (80-1,600 Hz), ABC (80-5,000 Hz), E
(12,000-20,000 Hz), DE (5,000-20,000 Hz), or CDE (1,600-20,000 Hz). The frequency points which define these
bands were chosen because they compare low and high frequencies and are known
to contain localization cues of relative dominance. ITDs are known to dominate the 20-800 Hz range, while their
effect diminishes from 800-1,600 Hz, having no effect above this range. Pinnae cues occur in the 5-12 kHz range (see
Blauert, 1999).
Moving these frequency bands from the left channel to the SR
channel (see Figure
20) would essentially alter the localization cues of the
overall auditory event. The theory
behind the experiments was that the most dominant localization cues would
create the most noticeable shift towards the right. This is because ultimately, there are several localization cues
that have a relative salience across the audible spectrum. The most important of these is the low
frequency (< 800 Hz) interaural phase differences (IPD). This is followed by high frequency (2 -20 kHz)
interaural level difference (ILD) and lastly, the monaural spectral cues of the
pinnae (5-12 kHz).
From
experimental listening tests performed for this thesis, it has been shown that
relocating the lower frequency bands (A, AB, ABC) caused more noticeable
horizontal shifts to the stereo image than those caused by relocating the high
frequency bands (E, DE, CDE). While
music was used for portions of the experiment, the most reliable test signal
was a set of white noise bursts. Not
only does white noise represent an even distribution of spectral energy, but is
also known to be one of the easiest types of sounds to localize (Stevens &
Newman, 1936). Music has a time-varying
amount of spectral energy, which makes it more difficult for the listener to
notice the spatial relocation.
Results of the noise track showed
that relocating band E typically produced no noticeable shift as compared to
stereo (see Figure 29 and Table 3). However,
moving bands A, AB, or ABC produced significant shifts towards the right. Comparably, moving bands DE and CDE also
shifted the image to the right of stereo, but not as far as those created by
the lower frequency bands. Essentially,
the results suggest that the stereo image created by relocating the low
frequency bands was generally shifted further to the right than with relocating
the high frequency bands.
The specific reason for this
apparent low frequency dominance is difficult to determine. Loudness could be a possible factor, because
left/right panning is typically associated with the balance of the stereo
channels. For instance, if band A was
much louder than band E, this could explain why it was more influential. Therefore, loudness was both calculated (see
Table 8) and experimentally determined (Table 9 - Table 15) for the SR bands.
The results found the high frequency bands to be louder. In fact, band E is almost one-third louder
than band A. Therefore, loudness is
probably not the cause of the low frequency localization dominance seen here.
It seems more likely that the 15º
change in azimuth creates different localization cues for the SR band, and that
changes in low frequency ITD cues produce a more noticeable image shift. Thus, moving band E mainly changes ILD cues,
whereas moving band A causes changes in ITD.
It is well established that ITDs tend to dominate overall perception,
which this research supports.
Also, a large concentration of high frequency energy does
not seem to commonly occur in music. An
analysis of fifty-two mixed-genre music tracks produced only seven with more
than 3% of their average energy above 10 kHz.
Therefore, it seems reasonable that in most music tracks, high frequency
energy is a fairly insignificant portion of the overall energy being
reproduced.
An additional factor could be that listeners may not pay
much attention to the stereo sound stage, especially for the upper audible
spectrum. This was supported during ABX
testing, where listeners were not able to differentiate between regular stereo
and a setup that shifted the left speaker’s high frequency (>10 kHz) signals
by a 15º azimuth towards midline. These
short music clips contained a slightly greater average (~3%) of high frequency
energy than the “typical” music track; determined from the above-mentioned
sampling of mixed-genre tracks.
Having shown that low frequency energy
dominates the localization of a stereo image for this particular test setup and
variables, there are several directions future research could take. The most obvious has practical applications,
where one could develop the “mono-ized” tweeter system discussed in the
introduction. However, simply moving
the high frequency information to a central tweeter with no pre-processing will
create a system with an odd sound stage.
Recall that this setup creates localization shifts towards the tweeter
for any instrument/image that has a large concentration of high frequency
energy relative to the chosen crossover frequency (i.e. a cymbal).
The most noticeable differences for the mono tweeter system
will be the change in high frequency image position and spectral balance. The new image position is formed because
without a tweeter on both sides of the listener, the system no longer
reproduces the intended interaural level differences. Instead, the level differences are dictated purely by the spatial
position of the mono tweeter and the amount of high frequency energy it is
reproducing. This is difficult, if not
impossible, to compensate for.
The spectral balance will also be different, because only
one tweeter (instead of two) is replicating high frequencies. This will potentially reduce the high
frequency loudness. Also, the sound is
no longer coming from an off-center azimuth (i.e. 30º), but instead from a
location near center (i.e. 0º). Each
spatial position has a different path to the ears, which exhibits a characteristic
“filtering” effect described by the Head Related Transfer Function (HRTF). For instance, if the mono tweeter is located
at 0º azimuth, and that position is known to attenuate 5 kHz signals compared to
a typical position of 30º, this should be compensated for by boosting the 5 kHz
range during preprocessing. HRTFs have
been extensively researched and applied to audio systems offering “simulated”
surround sound with two speakers or “3D” headphone listening systems (see
Gardner, 1998).
This preprocessing relies on the ability to predict the
intended spatial origin of the recorded high frequency information. Because high frequency images are localized
using interaural level differences, the cross correlation of high frequency
left/right amplitude information should indicate the intended spatial position
of the recording. Equal energy suggests
the image is towards the middle, while unbalanced energy suggests an image on
one side. Knowing the intended spatial
position, along with the actual location of the mono tweeter, one could process
the signal using HRTFs to better camouflage the missing tweeter.
However, the image’s location will still be incorrect due to
the inability to control high frequency interaural level differences (having
only one tweeter). One idea to explore
would be to introduce a time-delayed version of the mono signal, while encoding
the original and time delayed version with different temporal envelope
modulation. At high frequencies, only
envelop time differences (and level differences, which are fixed in this case)
will have an effect on the resulting stereo image position.
As to more simple investigations,
one might change certain parameters used in these experiments in an attempt to
support or disprove these findings. Perhaps,
comparing results between a horizontal and vertical SR channel would be
interesting. It is expected that the
vertical channel would be even more difficult to notice. This is because a change in vertical
position only alters the monaural spectral cues, while level and time
differences will stay the same. Strybel
& Fujimoto (2000) showed the vertical minimum audible angle to be 4-5º
larger than the horizontal MAA at 0º azimuth.
Others might investigate the effect of relocating “equally loud” frequency
bands, different loudspeakers, a reverberant vs. “typical” acoustic space, or
varied speaker locations and configurations.
The incorporation of in-ear recordings could also lead to a
more analytical approach to understanding the listening test results. These recordings are obtained by playing the
test signals while monitoring the ear canal microphones of a dummy head or by
placing probe microphones at the ear canals of actual listening subjects (see
Blauert, 1999, p. 31). The recordings
provide spectral and temporal representations of the ear canal signals, which
could be used to support the listening subjects’ responses.
In summary, this research suggests
that the localization of a stereo image is most affected by the spatial origin
of low-to-mid frequencies as opposed to higher frequencies. This is not due to any absolute localization
abilities, considering that the middle frequency range (1-3 kHz) is known by
scientists to have the largest minimum audible angle (i.e. the least accurately
localized). Instead, it is probably due
to the perceptual dominance of low frequency interaural time differences.
Regardless, a large amount of high frequency information is not typically present in music. This was additionally supported when a group of average listeners did not notice that the high frequency sound stage had been altered. Therefore, modifying high frequency localization cues seems to have a minimal impact on the perceived performance of the sound system. This technique could thus be applied to perceptual coders, as is implemented in DTS’ technique (Smyth, 1999), or audio system designers via the mono tweeter system. The mono tweeter system will obviously reduce the cost of the system while having a minimal effect on fidelity, especially if some preprocessing is performed.

Created February 2003 by Rob Hartman
Copyright (C) 2003