[Chapter 4][Table of Contents][Chapter 6]

5. Results and Discussion

Performing a fair statistical analysis is important for useful interpretation of the acquired listening test data and is presented. The results of the listening test are presented, which is followed by a discussion. A discussion of the resultant statistical analysis to bring about implications aided by previous cumulative knowledge provides further explanation of the results.

5.1 Results

When statistically analyzing listening test data using the conventional significance level (a = 0.05), employing a small number of trials or listeners may produce a high risk of concluding that audible differences are inaudible (type 2 error) [24]. This risk can be absolutely and relatively large as compared to the risk of concluding that inaudible differences are audible (type 1 error). Care must be taken so that the type 2 error does not obliterate type 1 error.

The significance test is statistically testing scientific propositions by deduction of two incompatible statistical hypothesis, H0 and H1, and determining the plausibility from research data of rejecting H0 in favor of H1. H0, the null hypothesis, says that the proportion of correct identifications p in the conceptual population of trials is 0.5. This has the statistical implication that says the differences between the two components under test are not audible and the listener will perform at chance. Therefore

H0: p = 0.5.

H1, the alternative hypothesis, says that p is greater than 0.5. This has the statistical implication that differences are audible and that the listener will perform above chance. Therefore

H1: p > 0.5.

H1 is said to be directional in this case. The task then, is to decide whether to reject H0 in favor of H1.

The experimenter’s decision regarding the rejection of H0 in favor of H1 can be correct in two ways and incorrect in two ways, as shown in Fig. 5.1. Cells a and c show two possibilities when H0 is true. Cell a represents the correct decision of not rejecting H0, which has the probability of a correct decision 1 - a . Cell c represents the incorrect decision of rejecting H0 and is designated as type 1 error, symbolized by a . Cells b and d illustrate the two possible decisions when H0 is false. Cell b represents the incorrect decision of not rejecting H0 and is designated as type 2 error, symbolized by b . Cell d represents the correct decision of rejecting H0, which has the probability of correct decision 1 - b .

Figure 5.1

Fig. 5.1. 2 x 2 table illustrating the two correct and two incorrect decisions possible when deciding whether to reject H0 in favor of H1. Indicated in parentheses are scientific interpretations frequently made of the column and row headings regarding the audibility of differences. [L. Leventhal, "Type 1 and Type 2 Errors in Statistical Analysis of Listening Tests," J. Audio Eng. Soc., vol. 34, pp. 437-453 (1986 Jun.), pp. 440, Fig. 1.]

The power of a statistical procedure is the degree that it produces a high probability of rejecting H0 when H0 is false, as indicated in cell d. Anything that can reduce the type 2 error probability will increase statistical power. Thus, the power of a statistical analysis of a listening test is the probability that its analysis will uncover the ability to hear differences in the presence of ability. Fig. 5.2 shows that the probability of committing a type 2 error (b ) decreases as sample size (N) increases, and therefore stating that the power increases with the size of N. These samples are taken from normal populations with variance s 2 and the mean under the null hypothesis is symbolized on the figure as m 0.

Figure 5.2

Fig. 5.2. Power curves of the two-tailed test at a = 0.05 with varying sample sizes. [S. Siegel, Nonparametric Statistics. (McGraw-Hill, New York, 1956), pp. 10, Fig. 1.]

In studies of small N (number of trials), the a = 0.05 significance level usually produces type 2 error larger than type 1 error. Equalizing both errors usually requires reduction of type 2 error, since it is desirable to keep both errors as small as possible. There are three ways of reducing type 2 error in a listening test.

The first way is to increase N. This requires more trials or listeners and is the preferred method to decrease type 2 error. However, one must increase N with care since more trials can be conducive to fatigue or boredom and jeopardize accuracy of results.

The second way is to increase p. While p is not known most of the time, careful design of the listening test can help increase it. For example, utilization of carefully selected audio signals, methods of irradiation (i.e. headphones, loudspeakers), giving the listener warm-up trials, preventing fatigue, boredom, and distractions, and using a familiar room and equipment are some of methods of enhancing this ability. However, these procedures by themselves cannot be relied upon to reduce type 2 error since their effect is speculative and difficult to measure.

The third way is to increase type 1 error. This method of increasing type 1 error to decrease type 2 error should only be a last resort, upon the attempt to increase N and p. However, this may be necessary, particularly if it is important to avoid type 2 errors that have p values which are just slightly above chance.

A listening test which is fair to both sides of the study should incorporate equal probabilities of type 1 and type 2 errors for the p of interest. For such a study, the fairness coefficient FCp, a measure of the degree to which the two error risks have been equalized for a given p, is a convenient figure of merit. For a study with type 1 error probability a and a type 2 error probability for a given p, b p, the fairness coefficient is

Equation 5.1.
(5.1)

An FCp of 1.0 represents for a given p, a perfectly fair study, although this is usually impossible to achieve since a and b change abruptly in unhelpful increments for adjacent values of r.

Average correct responses for the headphone listening test data are as follows in Table 5.1. The AA and BB test signal pairs were discarded since they do not provide information for phase distortion detection and thus only the AB and BA pairs were considered for this calculation.

Test Signal

Average Correct Responses
(N = 15)

70 Hz Sawtooth Wave tmax = 4 msec 9
70 Hz Sawtooth Wave tmax = 8 msec 9.5
3.5 kHz Sawtooth Wave tmax = 4 msec 5.5
3.5 kHz Sawtooth Wave tmax = 8 msec 4
10 kHz Sawtooth Wave tmax = 4 msec 1.5
10 kHz Sawtooth Wave tmax = 8 msec 4.5
Impulse tmax = 4 msec 13
Impulse tmax = 8 msec 14
Jazz-Vocal tmax = 4 msec 6.5
Jazz-Vocal tmax = 8 msec 4
Percussion Instruments tmax = 4 msec 4
Percussion Instruments tmax = 8 msec 4

Table 5.1. Average correct responses for 15 test subjects performing the headphone test.

Average correct responses for the loudspeaker listening test data are as follows in Table 5.2. Again, the AA and BB test signal pairs were discarded and only the AB and BA pairs were considered for this calculation.

Test Signal Average Correct Responses
(N = 15)
70 Hz Sawtooth Wave tmax = 4 msec 7
70 Hz Sawtooth Wave tmax = 8 msec 7
3.5 kHz Sawtooth Wave tmax = 4 msec 5
3.5 kHz Sawtooth Wave tmax = 8 msec 3.5
10 kHz Sawtooth Wave tmax = 4 msec 2
10 kHz Sawtooth Wave tmax = 8 msec 2
Impulse tmax = 4 msec 8.5
Impulse tmax = 8 msec 6.5
Jazz-Vocal tmax = 4 msec 4.5
Jazz-Vocal tmax = 8 msec 3.5
Percussion Instruments tmax = 4 msec 4
Percussion Instruments tmax = 8 msec 2.5

Table 5.2. Average correct responses for 15 test subjects performing the loudspeaker test.

5.2 Discussion

Table 5.3 is the minimum number of correct responses r for concluding that performance is better than chance, and the resulting type 1 and type 2 error probabilities for p values from 0.6 to 0.8 in listening tests for a given N of 15.

N r Type 1 Error (a ) actual value Type 2 Error (b )
p = 0.6 p = 0.7 p = 0.75 p = 0.8
15 14 0.0005 0.9948 0.9647 0.9198 0.8329
13 0.0037 0.9729 0.8732 0.7639 0.6020
12 0.0176 0.9095 0.7031 0.5387 0.3518
11 0.0592 0.7827 0.4845 0.3135 0.1642
10 0.1509 0.5968 0.2784 0.1484 0.0611
9 0.3036 0.3902 0.1311 0.0566 0.0181
8 0.5000 0.2131 0.0500 0.0173 0.0042
7 0.6964 0.0950 0.0152 0.0042 0.0008
6 0.8491 0.0338 0.0037 0.0008 0.0001

Table 5.3. Minimum number of correct responses r for concluding that performance is better than chance, and resulting type 1 and type 2 error probabilities for p values from 0.6 to 0.8 in listening tests with N at 15. [L. Leventhal, "Type 1 and Type 2 Errors in Statistical Analysis of Listening Tests," J. Audio Eng. Soc., vol. 34, pp. 437-453 (1986 Jun.), pp. 445, Table 3.]

A few points can be illustrated in Table 5.3, regarding the analysis of listening tests with statistical tests of significance. In listening tests, which are brief enough to be practical to conduct and avoid fatigue or boredom (i.e., N £ 20), significance tests conducted at the 0.05 level of significance result, for most values of p, in a type 2 error risk which is much larger than actual type 1 error risk. Another point is that type 2 error and power change as a function of p, although one never really knows p.

If from Table 5.3, if r = 9 is selected, for p = 0.6 the fairness coefficient FCp can be calculated as

Equation 5.2.
(5.2)

This is a very desirable result as the ideal value is 1 (equal error). The actual type 1 error for the above situation is 0.3036 and type two error is 0.3902 which are both fairly similar. Although a p of 0.6 may seem as a low criterion, it was chosen so that the subtle effects of the audibility of phase distortion were uncovered in the analysis. Therefore, for this study, anything above 9 correct responses (r) out of 15 will be considered statistically significant for p = 0.6.

For data analysis of the randomized test signal sequences of AA, AB, BA, and BB presented in the listening test, the AA and BB data were discarded for analysis. Test subject responses to the test signal sequences AA and BB give no information regarding the detection of phase distortion and were only included to have an even random discrimination of the test signal sequences. The results of presentation sequence AB and BA were grouped together for calculation of the average correct listener responses. Thus, order effects were not considered in the analysis.

In a broad sense, the average correct responses for the loudspeaker-based listening test were significantly lower than for the headphone-based test. This was shown by a two-sample t-test assuming equal variances for a = 0.05 stating differences between the two types of listening test responses exist. The results of the loudspeaker-based test also seemed to be independent of distortion level (t max = 4 or 8 msec), even for test signals that showed statistically significant phase distortion audibility with headphones such as the 70 Hz sawtooth wave and impulse. The audibility effect of any Q effects in the all-pass filter, which were not investigated in the research, may explain why for some signals, the correct responses for the audibility of the phase distortion were higher for the 4 msec delay than for the 8 msec delay.

The audibility of phase distortion for steady-state signals, such as the sawtooth wave was dependent upon frequency (of the sawtooth wave). The impulsive test signals (impulse) displayed phase distortion audibility for a mid-range all-pass filter center frequency. Simple test signals, such as the sawtooth wave, seemed to be more conducive in revealing the presence of phase distortion. In contrast, complex test signals such as the jazz vocal proved to be more difficult.

The audibility of phase distortion in audio signals was also highly dependent upon individual ability, although for statistical analysis individual data was not considered. For example, while most test subjects were very good at recognizing what was in general perceivable as phase distortion such as the impulse and the 70 Hz sawtooth wave, a few others had greater difficulty. Specifically, a few subjects seemed to hear clearly the presence of phase distortion in the jazz-vocal test signal for the headphone listening test, while a few test subjects seemed to perceive phase distortion better than others during the loudspeaker listening test.

Table 5.1 indicates that even for the headphone listening test, phase distortion audibility was of very subtle nature. This is surprising, since there exists gross phase distortion present in the all-pass filtered test signals. In this test, human ears seem to be tolerant of even large phase distortions in audio signals. For the r = 9 and p = 0.6 for the N = 15 criterion described for the statistical analysis earlier, it can be seen that 70 Hz sawtooth wave and the impulse test signals for both 4 and 8 msec maximum group delay times (t max) were significant. Given the results, the criterion in selecting the all-pass filter center frequency f0 = 3.5 kHz seemed to be valid for the impulse test signal. Phase distortion detection became progressively difficult for the 3.5 and 10 kHz sawtooth waves. Although it was proved that relative phase has subtle effects on timbre and there exists phase-locking of the auditory fibers of the ear below 5 kHz, the introduction of phase distortion for the 3.5 kHz sawtooth waves did not have a statistically significant result. The assumption that equal-loudness contours provide valid areas of maximal phase distortion sensitivity did not seem to hold true for the sawtooth wave. Other mechanisms in the human auditory system or test equipment may also be responsible for this. Based on the fact that phase-locking of the auditory fibers is lost above 5 kHz, it was hypothesized that the 10 kHz sawtooth wave had negligible results in the audibility of phase distortion. Results of the headphone listening test confirm this as they are all far below chance occurrence (r < 7.5). All other test subject’s responses were below chance occurrence. Shifting the peak group delay of the fundamental of a spectrally rich signal such as a sawtooth wave in relation to its upper harmonics shows greater audibility for lower frequencies.

Table 5.2 indicates for the loudspeaker listening test that the overall phase distortion audibility was increasingly difficult as compared to the headphone listening test. None of the test subjects’ responses qualified for the r = 9 and p = 0.6 for N = 15 statistical analysis criterion used, although the impulse test signal at 4 msec maximum group delay came very close at 8.5 average responses. When reverberant listening conditions are present, the audibility of phase distortion seems to be highly masked by its presence. Irradiation methods obviously do play a part in the audibility of phase distortion. Although the headphone listening process can be considered as a sole entity, room effects also contribute to the loudspeaker listening process and thus should be considered in tandem. Strategic acoustic room treatments and excess-phase equalization could minimize audible room effects such as reverberation and excess-phase response [26], respectively. These corrections should in turn increase the sensitivity of the audibility of phase distortion of loudspeakers to experimental data comparable to that of the headphone data.

It is of interest to note that on music, even with transient content such as the percussion instruments, did not reveal phase distortion present in the signal. This could be due to in part to the existence of reverberation present in the original recording. Jazz vocals, with its rich spatial content, may have obscured the presence of any phase distortion. Although phase-locking of the auditory nerve fibers is present even at low intensities, the lack of phase distortion audibility in music test signals implies another masking mechanism is perhaps present. The audibility of phase distortion in music test signals is further exacerbated when played back on loudspeakers than on headphones, due to the non-anechoic listening environment. Clearly, for non-anechoic listening environments, there are certain issues to be resolved before consideration of the audibility of phase distortion comes in to play.

Fig. 5.3 shows the permissible phase distortion level for the various test signals used in the headphone listening test. This graph has its limitations as it is a construct based on the constraints of the experimental design (i.e., type of test signals used and limited levels of distortion introduced in the test signals) and is by no means a complete representation of the nature of the audibility of phase distortion. Fig. 5.4 shows the permissible phase distortion level for the various test signals used in the loudspeaker listening test. What is implied by Fig. 5.3 and 5.4 is the minimum allowable level of phase distortion before audibility and therefore actual levels are very likely to be higher.

Figure 5.3

Fig. 5.3. Permissible phase distortion in milliseconds of test signals used for the headphone listening test. This graph has its limitation, as it is a construct based on constraints of the experimental design. Shown is the minimum allowable level of phase distortion before audibility and actual levels are very likely to be higher.

Figure 5.4

Fig. 5.4. Permissible phase distortion in milliseconds of test signals used for the loudspeaker listening test. This graph has its limitation, as it is a construct based on constraints of the experimental design. Shown is the minimum allowable level of phase distortion before audibility and actual levels are very likely to be higher.

Conducting research of broad test-signal scope and experimental design practicality resulted in limited results. Obviously, more refined research utilizing a greater signal variation base (i.e., more frequencies for sawtooth waves across frequency spectra) and phase distortion levels are necessary to ascertain more accurate permissible levels of phase distortion for these signals.

Another way to show the content displayed in Fig. 5.3. and 5.4 is shown in Table 5.4., which contains the types of test signals used for both headphone and loudspeaker tests, center frequencies of the all-pass filter, and the minimum permissible phase distortion levels.

(a)
Test Signal Center Frequency (f0) of All-Pass Filter Permissible phase distortion level (msec)
70 Hz sawtooth wave 70 Hz 0
3.5 kHz sawtooth wave 3.5 kHz
8
10 kHz sawtooth wave 10 kHz 8
Impulse 3.5 kHz 0
Jazz-vocal group 160 Hz 8
Percussion instruments 150 Hz 8

(b)

Test Signal

Center Frequency (f0) of All-Pass Filter

Permissible phase distortion level (msec)

70 Hz sawtooth wave

70 Hz

8

3.5 kHz sawtooth wave

3.5 kHz

8

10 kHz sawtooth wave

10 kHz

8

Impulse

3.5 kHz

8

Jazz-vocal group

160 Hz

8

Percussion instruments

150 Hz

8

Table 5.4 (a) Permissible phase distortion levels (msec) for headphone listening for test signals and center frequencies of all-pass filters used in research. (b) Permissible phase distortion levels (msec) for loudspeaker listening for test signals and center frequencies of all-pass filters used in research.

Table 5.4 may be used to identify the permissible level of phase distortion as a function of test signal type and frequency. It can be seen that phase distortion audibility depends on both type of test signal and phase distortion level incurred in the all-pass filter, the former being dominant.

Considerations regarding this research include the fact that phase distortion inherent in the transducers used for the listening tests such as the microphones, headphones, and loudspeakers, were not investigated. The amplifiers used by the headphones and loudspeakers were assumed to be phase linear. Test subjects were assumed to have normal hearing, however were not tested.

Fig. 5.3 and 5.4. indicate regarding design of acoustic transducers and loudspeaker systems, that only for very critical listening conditions is the correction of phase distortion a requirement and only after extraneous sources of masking (i.e., reverberant listening conditions) are addressed properly. For example, the peak group delay incurred in the transition of a fourth-order Butterworth low to high-pass crossover is 0.2 msec. For accurate perception of audio signals, other primary design requirements exist for acoustic transducer and loudspeaker system design that should take priority before secondary measures such as phase correction are addressed.

Another implication is the phase distortion incurred from studio equalizers. These equalizers are usually of analog type, and essentially behave as band-pass/band-reject filters. Minimum-phase phase distortion is introduced by usage of such a device. Fig. 5.5 shows the group-delay characteristic for a first-order band-pass filter (f0 = 3.5 kHz) with Q = 1, 10, 50, and 100. These graphs would be analogous to altering the Q of a parametric equalizer. Inspection of Fig. 5.5 (b) reveals that for Q = 50, there was approximately 4 msec of group-delay peaking and for Q = 100, approximately 8 msec. The permissible levels established in this research imply that phase distortion incurred by analog equalization may not be of concern up to a certain permissible level (i.e., Q = 100 in a first-order band-pass filter), especially for most mid-range musical content.

Figure 5.5 (a)

Figure 5.5 (b)

Fig. 5.5 (a). Group delay response for first-order band-pass filter (f0 = 3.5 kHz) with Q = 1 (solid line) and Q = 10 (dotted line). (b). Group delay response for first-order band-pass filter (f0 = 3.5 kHz) with Q = 50 (solid line) and Q = 100 (dotted line).

This chapter presented the results and provided discussion. Average correct responses of all test subjects for both the headphone and loudspeaker listening tests were presented. The statistical analysis method of equalizing type 1 and 2 error, which is important for useful interpretation of the acquired data, was presented and implemented. Finally, the results were discussed and implications were brought forward. The results from this thesis research imply that phase distortion is of secondary importance as compared to frequency response irregularities, which is in agreement with previous research results. The next chapter will conclude this thesis research by providing a summary


[Chapter 4][Table of Contents][Chapter 6]