Chapter 2 Digital Audio
Introduction
Analog signals are continuous both in time and in amplitude. A discrete signal, however, is sampled at uniform time intervals. A digital signal is a discretely sampled signal where each sample assumes a value from a discrete range.
The frequency at half of the sampling rate is commonly called the Nyquist frequency [2]. Sampling rates for audio signals are measured in samples per second, or Hertz (Hz). Some common sampling rates for audio signals are 8kHz, 11.025kHz, 16kHz, 20kHz, 32kHz, 44.1kHz, 48kHz, and 96kHz. They can represent frequencies up to their Nyquist frequency; 4kHz, 5.5kHz, 8kHz, 10kHz, 16kHz, 22kHz, 24kHz, and 48Hz, respectively.
|
Figure 2.1 Continuous signal, f(t), with its discrete equivalent, f(nT), beneath
The sampling theorem states that no band-limited information is lost between the continuous signal and the discrete signal; however if the signal is digital instead of discrete, some information is lost due to quantization. Quantization is the process where a discrete amplitude value is given to each discrete sample in a digital signal. The discrete amplitude value is chosen among a group of possible values. A system desires to have the largest possible set of values to reduce the quantization error, but the larger the set, the more data-intensive the set. Some common value sets for digital audio are 8-bit, 16-bit, 20-bit, and 24-bit that consist of 28, 216, 220, and 224 possible values, respectively. There are ways to reduce quantization error, such as dithering [3], but these will not be discussed here.
The frequency domain is a representation of a signal in terms of complex exponentials of radian frequencies [2], and it is commonly called the spectrum of the signal. The complex exponential sequence, e-jw , is the kernel of the transform that converts a time signal, f(t), into its frequency representation, F(jw ) [4]. The transform is referred as the Fourier transform and is written as
|
( 2.1) |
The Fourier transform measures the amount of energy at each radian frequency.
The frequency representation is complex and thus composed of real and imaginary parts. It can be written in rectangular form as
F(jw ) = FR(jw ) + jFI(jw ). |
( 2.2) |
The magnitude, | F(jw ) |, and phase, Ð F(jw ), of the frequency representation can be combined to express the polar form as
F(jw ) = |F(jw )|ejÐ F(jw ). |
( 2.3) |
The Fourier transform can be written in discrete-time form by replacing f(t) with f(nT). Let fn = f(nT).
|
The discrete-time Fourier transform (DTFT) can be described by using the complex z-plane. The z-plane consists of mapping a complex number to real and imaginary axes. The z-plane is displayed in Figure 2.2. The Unit Circle is plotted to show where |Z| = 1. The radian frequency, w , is the angle between the sweeping vector and the positive real axis. The Nyquist frequency occurs half way around the circle, when w =p . The sampling frequency is the full revolution of the vector around the unit circle, w =2p . The DTFT is periodic every 2p radians, and thus F(jw ) = F(jw +2p n). The magnitude of the DTFT is symmetric about the Nyquist frequency, w =p , and the phase is negatively symmetric about the Nyquist frequency.
|
Figure 2.2 The complex z-plane
The DTFT cannot be realized on a digital machine because the frequency variable, w , is continuous and the summation requires an infinite number of samples [5]. A discrete set of frequencies is chosen within the period:
|
( 2.5) |
For computational reasons the frequencies are uniformly spaced.
|
|
The discrete set of frequencies solves the first problem of the DTFT. Examining the signal within a finite duration solves the second problem.
|
Again, for computational reasons this finite duration is chosen to be equal to the number of discrete frequencies, M = N. By combining equations ( 2.4) and ( 2.6) with a finite duration time sequence ( 2.7), the discrete Fourier transform (DFT) is obtained.
|
( 2.8) |
Each index of the discrete frequency spectrum is referred to as a frequency bin. The magnitude of each bin is the amount of energy at the equivalent discrete frequency. The magnitude of the zero bin, p = 0, is the dc content of the signal. All other frequencies are harmonically related to the fundamental frequency, p = 1, of the transform since they are equally spaced.
The inverse Fourier transform uses the inverse kernel, (1/2p )ejw . The inverse kernel is the basis function used to linearly approximate the original time signal [4]. The inverse Fourier transform is written
|
(2.9) |
Equivalently, the inverse discrete Fourier transform (IDFT) is written
|
(2.10) |
If the discrete set of frequencies is an equally spaced set, and it is of the same length as the time duration, a fast Fourier transform (FFT) can be used [2]. Best results of the FFT are obtained when the time duration is a power of two [5]. The FFT uses decimation to reduce computation from N2 complex multiplications to N log2 N complex multiplications [2].
Several properties of a signal can be determined by examining the autocorrelation of the signal. The autocorrelation is the convolution of the signal, f(t), with its reverse, f(-t).
![]() |
(2 .11) |
The autocorrelation is an even function; i.e. f(k) = f(-k). The average power in the signal can be found at f(0). The autocorrelation can be expressed in discrete form as
![]() |
(2 .12) |
The discrete autocorrelation is an aperiodic discrete convolution of two finite length sequences, f(n) and f(-n). It is referred to as a circular convolution or a circular autocorrelation [2].
One of the unique relationships between the time and frequency domains is that convolution in one domain is equivalent to multiplication in the other domain. Therefore, if the Fourier transform is used to compute F(p) from f(t), and F(p) is multiplied by its complex conjugate, F*(p), then the Fourier transform of the autocorrelation results.
( 2.13) |
This result is known as the power density spectrum of the signal, or, simply, the power spectrum. The inverse Fourier transform of the power density spectrum is the autocorrelation of the signal. If the signal is multiplied by a rectangular window, i.e. a time-limited segment, then the power density spectrum is called a periodogram [2].
When data is corrupted or missing, forms of estimation can be used to recover the data. Assuming valid data is known a priori and a posteriori the missing or corrupted data block, interpolation can be used to estimate the block. The most basic form of interpolation is zero-order interpolation, often called sample-and-hold, which consists of replacing the damaged data block with the last valid data sample [3].
|
Figure 2.3 Zero-order interpolation
First-order interpolation, often called linear interpolation or straight-line interpolation, consists of using a moving average. This method, as with the previous method, can only be used when a few samples are missing. The moving average calculates the mean of the surrounding valid data to approximate the missing data.
|
( 2.14) |
|
Estimation of unknown data is determined by the probability of what value that data may assume. Since audio data can assume any value, yet it is not discontinuous, audio data can be considered as pseudo-random. Pseudo-random data is contained in a probability density function (PDF) that is described by the surrounding data. This probability density function determines the possible values for the unknown data.
Higher order interpolators can be constructed using various methods. Higher order interpolators use probability theory and methods such as Minimum Mean Square Error (MMSE), Maximum Likelihood (ML), and Maximum a posteriori (MAP) [5]. The MMSE estimator minimizes the mean square error to determine the unknown sequence. The ML estimator assumes the data is a set of unknown constants and tries to determine the maximum likelihood of the data based on an assumed PDF of the data. A posterior distribution is a probability based on the known data, the PDF of the data, and any other prior known information. The MAP estimator maximizes the posterior distribution by minimizing the error.
|
Figure 2.5 Higher order interpolation
A data sequence to be interpolated can be grouped into three sections: the unknown section x(i), the known section preceding the gap x-(i)a, and the known section following the gap x-(i)b.
x = [x-(i)a x(i) x-(i)b] |
(2.15) |
Let U and K be rearrangement matrices that reassemble x from the unknown, x(i), and known, x-(i), data samples.
x = U x(i) + K x-(i) |
The PDF of the unknown data samples conditional upon the known data samples is defined as
|
( 2.17) |
If it is assumed that the PDF of the data sequence is random zero-mean Gaussian, which is suitable for Gaussian audio signals [5], then p(x) is defined
|
( 2.18) |
where
f x = [xxT] = the autocorrelation of x
The data sequence, x, is then rewritten using equation ( 2.16).
xTf x-1x = (Ux(i) + Kx-(i))Tf x-1(Ux(i) + Kx-(i)) |
The sequence x which minimizes equation ( 2.19) is the MAP interpolation. Since the PDF of the sequence is Gaussian, the MAP interpolation corresponds exactly to the MMSE interpolation. The sequence that minimizes equation ( 2.19) is given by
|
( 2.20) |
where
M(i)(i) = UTf x-1U
M(i)-(i) = UTf x-1K
Gestalt psychology is based on the perception of whole units. In audio, these units are auditory streams. Elements make up an auditory stream based on their organizational properties such as similarity, proximity, continuity, common fate, symmetry and closure. The more alike the elements are to each other, the greater the probability that the mind will perceive them as one stream. Unlike elements are perceived as different streams.There are various other methods of interpolation such as Autoregressive (AR) models and the McAulay-Quatieri (MQ) method. The AR models [5] are based on an all-pole filter excited by white noise. The filter coefficients are determined by the PDF of the data. The MQ method [6] assumes that the data sequence is composed of sinusoids. The method involves transforming the signal into the frequency domain and interpolating between the known frequency tracks. The method also includes decision methods for determining which frequency tracks before the gap correspond to which frequency tracks after the gap and which tracks die or originate within the unknown data segment.
If an auditory stream is interrupted by an audible silent interval, a listener will hear a distinct gap in the stream. The gap is distinct to the listener due to the discontinuity of the auditory stream. The listener perceives that one auditory stream ends and a new one begins. If the silent interval is replaced with noise, the listener may "hear through" the noise and perceive a signal that did not necessarily occur [7]. This phenomenon is termed auditory restoration. When this occurs, the listener perceives the original stream continuing through the noise stream, even though the original stream was not present during the noise stream. Since the ear has previous experience of certain streams psychoacoustically masking other streams, it perceives that the noise stream is simply masking the original stream. The degree of auditory restoration depends on the size of the gap and the probability that the noise could have masked the signal. Noise could mask a signal with equal or less energy and with an equal or smaller frequency range.
This perception of sound continuity is dependent on the sound before and after the noise burst. The sound heard after the noise burst induces the perceptual continuity of the sound through the noise burst [7]. If the sound after the noise burst has a plausible resolution based on the sound before the noise burst, auditory restoration will occur.
Figure 2.6 displays examples where the listener hears a distinct gap in the audio stream. Figure 2.6(a) demonstrates a tone interrupted by a silent interval. Figure 2.6(b) shows an unexpected resolution of the tone after the noise burst. Figure 2.6(c) shows a noise burst that does not contain similar frequency components to possibly mask the sound.
Figure 2.7 displays examples where the listener perceives the sound continuing through the gap due to auditory restoration. Figure 2.7(a) and (b) show a noise burst filling in the gap. The noise burst is within the same frequency range of the tone, and the tone has a plausible resolution following the noise burst based on its tonal direction before the noise burst.
|
Figure 2.6 Examples of perceived non-continuity of sound [7]: (a) a tone interrupted by a silent interval, (b) a tonal resolution after a noise burst inconsistent with the tonal direction before the noise burst, (c) a tone interrupted by a noise burst without similar frequency components.
|
Figure 2.7 Examples of perceived continuity of sound [7]: (a) and (b) a tone interrupted by a noise burst of similar frequency components and having a plausible resolution after the noise burst.
Natural auditory streams show particular characteristics when terminating. Without these cues, the listener has the tendency to believe the stream is continuous. The human ear builds libraries of expectations based on previous auditory experiences. When an unexpected event occurs, the ear becomes more acute. An unexpected silent interval is noticed even if it spans a relatively small time interval. The ear also notices unexpected changes in amplitude and sound direction. These unexpected changes lead the mind to perceive a new or different auditory stream.
An unexpected silent gap should be filled with information that can perceptually continue the auditory stream. The information should be consistent in amplitude and frequency with the original auditory stream. This phenomenon can benefit the output of an extrapolation system. As long as the extrapolation output helps produce an auditory restoration of the sound, inconsistency with the actual missing data can be neglected.