A way is needed to reduce the amount of data in an audio file without degrading its quality. This is done with file compression. Compression schemes can either be lossless or lossy. Lossless schemes retain all of a file. The uncompressed file is an exact copy of the original. This is the method that must be used when compressing computer text files, programs, and binary data files such as word processing documents or spreadsheets. If any part of one of these file types is lost then the file becomes corrupt. At best the file will only contain erroneous data.
The methods used for lossless compression work well for the types of files mentioned but do poorly on audio and graphics files. An ASCII text file's size is typically reduced up to 60% or more with lossless compression. This is because the distribution of data in typical text and data files is uneven. The letter "e", for example, typically occurs more often than any other letter in a text file. By using fewer bits for letters with a high probability of occurrence (e.g. "e") and more bits for infrequent letters (e.g. "z") the overall file size can be reduced. This is the method used by Huffman compression. Some other popular schemes are zip, arj, lha, and zoo compression. These are variations on the LZ77 compression scheme. LZ77 retains a buffer of n bytes (a sliding window) to compare to the current data. When a data sequence is found that is already in the buffer, the encoder simply encodes the position and length of that sequence within the window instead of sending the sequence again [8].
Table 2 demonstrates the large difference in the amount of compression achieved between ten data files (68.2%) and ten audio files (8.3%) using WinZip 6.2, a popular Windows compression program.
| text filename | uncompressed size (kB) | compressed size (kB) | compression % | .wav filename | uncompressed size (kB) | compressed size (kB) | compression % |
| word.doc | 266 | 47 | 82 | classical | 2,039 | 1,963 | 4 |
| write.wri | 123 | 36 | 71 | connick | 288 | 270 | 7 |
| spreadsht.xls | 137 | 52 | 62 | floyd | 2,139 | 2,048 | 4 |
| webdoc.html | 56 | 20 | 64 | hardrock | 1,257 | 1,212 | 4 |
| text1.txt | 109 | 34 | 69 | live | 968 | 835 | 14 |
| helpfile.hlp | 519 | 300 | 42 | petty | 808 | 776 | 4 |
| databas.mdb | 338 | 58 | 83 | queen | 622 | 587 | 6 |
| program.exe | 519 | 300 | 56 | sting | 853 | 766 | 10 |
| text2.txt | 43 | 10 | 76 | voice guitar | 1,137 | 973 | 14 |
| data.dat | 501 | 117 | 77 | voice | 356 | 299 | 16 |
| average | 68.2% | average | 8.3% |
As mentioned, neither Huffman nor LZ77 can compress audio or video files very well. This is because audio samples do not repeat with nearly the frequency that data (e.g. the letter "e") repeats. Data typically has a very limited range of values (e.g. 28=256 alphanumeric characters) in its set while 16 bit audio data has 216 = 65,536 values to choose from. This greatly reduces the probability that data will be repeated. Without repeating data, LZ77 encoders have no way to shorten a file's size. Other lossless encoding techniques have been developed for compressing audio data [9][10]. These schemes, however, only achieve compression ratios of up to 3:1. This is much less than the 25:1 ratio required to transmit CD quality mono audio through a 28.8 kbps modem.
Due to the nature of the human perception, however, parts of audio or graphics files can be selectively removed without the loss being noticed. Since some of the original material is lost, this type of compression is known as lossy compression. Some lossy compression schemes can reduce the size of an audio file dramatically while still maintaining CD audio quality.
To reduce the size of an audio file the sampling rate can be reduced but this limits the bandwidth of the input signal and lowers its quality. For example, an audio file could be reduced to half its original size by reducing the sampling rate from 44.1 kHz to 22 kHz. The input signal, however, would be limited to a range of 0 Hz - 11 Hz. Voice applications would sound fine but for music there would be a noticeable loss of high frequencies. The alternative is to use less bits to quantize each sample but at very low bitrates this greatly increases the quantization noise in the signal. Other, more complex methods must be employed to achieve data reduction without introducing as much noise.
Audio compression coder/decoders (codecs) attempt to preserve all of the frequencies in the range of human hearing (20 Hz - 20 kHz). In the case of human speech, however, only frequencies in the range of 20 Hz - 3.4 kHz are necessary, so speech codecs only concern themselves with this range. Codecs are implemented as either a software program or as dedicated hardware that can be added to a computer or audio component. Typically the user loads a 16 bit, 44.1 kHz, PCM audio file into the encoder and then saves a new compressed copy of the file. The output file is not in the same format as the original file so a decoder is required to reconstruct an audio file in the original format.
There are two basic types of audio compression codecs: time domain and frequency domain. Time domain codecs include m-law and A-law companders, ADPCM, vocoders and linear predictive coders. Frequency domain codecs include two more types, transform and subband. Transform codecs map the time-based PCM samples into the frequency domain. The result is a set of frequency components which can be coded separately. MPEG-1 Layer III is an example of a transform codec. Subband codecs are actually hybrids of time domain and frequency domain coding methods. The signal is broken into smaller equal bandwidth subbands of different frequency ranges while remaining in the time domain. Concurrently the same signal is mapped to the frequency domain and the spectral lines fed to a psychoacoustic model. The model then dictates the quantization level of each subband in the time domain. The most popular example of a subband codec is MPEG-1 Layer I or II.
A large subset of time domain codecs has been developed for speech applications. Therefore time domain codecs can be broken into two categories: speech and all other remaining applications. Since the most common "other application" is music, the two categories will be referred to here as speech and music applications.
While speech codecs are designed primarily to encode speech, some of the algorithms are also useful for music. To develop a new encoding approach it is necessary to review all of the available encoding techniques, thus speech codecs are discussed here.
The most popular type of vocoders (voice coders) use a Linear Predictive Coder (LPC) model [11]. The original speech input is discarded and the parameters of the best-fit model are transmitted. The decoder uses those parameters to generate synthetic speech that is intelligible but sounds like a talking machine [12]. Specifically, vocoders operate as follows:
The vocal tract is represented as a time-varying filter and is excited with either a white noise source, for unvoiced speech segments, or a train of pulses separated by the pitch period for voiced speech. Therefore the information which must be sent to the decoder is the filter specification, a voiced/unvoiced flag, the necessary variance of the excitation signal, and the pitch period for voiced speech. This is updated every 10-20 ms to follow the non-stationary nature of speech. [13]
Vocoders usually operate at or below 2.4 kbps. Due to the simplified model of speech that is used, however, increasing the bitrate further does not improve the robotic voice quality.
A Code Excited Linear Prediction voice coder (CELP) uses the same technique as LPC but also computes errors between the original speech and the synthetic model. It transmits the model parameters and a very compressed representation of the errors. This compressed representation is an index to a "code book" shared between the coder and decoder which is why it is called "Code Excited". CELP can produce the same speech quality as a 32 kbps ADPCM coder while only using 4.8 kbps [12].
In 1982 the European Groupe Spécial Mobile (GSM) was formed to design a mobile cellular telephone system for all of Europe. Phase I of the specification was published in 1990 and by 1993 forty seven countries operated or were considering GSM networks. Currently every continent contains GSM systems and the acronym now stands for the Global System for Mobile communications. GSM uses a Regular Pulse Excited - Linear Predictive Coder (RPE - LPC) with a Long Term Predictor loop [14]. In other words, previous samples are used to predict the current sample. The codec samples the signal at 8000 Hz and every 20 ms compresses 160 13-bit samples into 260 bits. This offers a 4 kHz frequency range at a transmission rate of 13200 bps [12]. This also allows for real-time transmission of speech over 14.4 kbps modems. While GSM is intended for speech, at very low bitrates it occasionally performs well with music (as shown in section 3.1.3).
High amplitude quantization error resembles analog noise and for the most part is masked. At lower amplitudes the error resembles distortion and is more audible. m-law and A-law companding use non-uniform quantization step sizes to improve the S/E ratio at low amplitudes. By using a logarithmic scale, more quantization step sizes are placed at low amplitudes. By doing this the quantization error is reduced and fewer bits can be used to represent the signal. An 8-bit implementation can achieve a small signal S/N ratio and a dynamic range equivalent to that of a 12-bit uniform PCM system [6].
ADPCM combines the properties of differential pulse-code modulation (DPCM) and adaptive delta modulation (ADM). Like DPCM it uses a linear predictor to predict the next sample. The difference between the prediction and the actual value is then quantized and encoded with pulse-code modulation. The difference in amplitude is usually small compared to the 65,536 values used for 16-bit audio and so fewer bits can be used. As with ADM the step size used to represent the difference is allowed to adapt to the changing signal. That is, it can become larger to follow fast-changing transients or smaller to follow slow-changing ones.
MPEG codecs use perceptual coding techniques to exploit masking properties allowing them to greatly compress a file while maintaining good quality. Whereas 16 bits are normally used to achieve a -98 dB noise floor, MPEG uses an average of 3 bits per sample resulting in only a -20 dB noise floor. While the noise floor is actually higher it is not perceived because it is masked by louder sounds. MPEG codecs are discussed in depth in chapter 3.2.
Table 3 shows a qualitative comparison of the different compression schemes. All of these schemes are available either in Cool Edit or in the Sound Recorder program that comes with Windows 95. This table is based on tests of three different types of source material recorded from compact discs using Cool Edit. Both the hard rock and classical selections were recorded in stereo while the speech selection was recorded in mono. The same compression ratio was attempted for all of the schemes but it was not possible in all cases because some of the encoding schemes offered only fixed bitrates. For each type of source material a subjective measurement of quality is noted as bad, poor, fair, good, or excellent along with the compression ratio. The quality measurements are this author's assessment.
| Encoding Scheme | Hard Rock | Classical | Speech |
| A-law Companding | fair (22:1) | fair (22:1) | good (11:1) |
| m-law Companding | fair (22:1) | fair (22:1) | good (11:1) |
| DVI / IMA ADPCM (4 bits/sample) | fair (42:1) | fair (43:1) | poor (21:1) |
| Microsoft ADPCM (Multiple Pass) | fair (42:1) | fair (43:1) | poor (21:1) |
| GSM 6.10 | good (77:1) | poor (77:1) | good (37:1) |
| MPEG (Layer 1, Model 1, 3.71 b/samp) | fair (9:1) | fair (9:1) | good (4.5:1) |
| MPEG (Layer 2, Model 2, 3.71 b/samp) | excel (9:1) | excel (9:1) | excel (4.5:1) |
| MPEG (Layer 2, Model 2, 0.74 b/samp) | bad (43:1) | bad (43:1) | good (21:1) |
While LCP and CELP encoders were not available for these tests, audio demonstrations of the two, along with original source material, were found at [15]. As expected, the LPC file (2.4 kbps) sounded very synthetic. CELP (4.8 kbps) was much better but slight synthetic overtones were noticeable. Another type of CELP scheme known as low-delay code excited linear prediction (LD-CELP) (16 kbps) performed well. The synthetic overtones disappeared but more noise was introduced.
A surprise of these tests was the quality of GSM 6.10's encoding of music, especially since GSM is intended for speech applications. Given the fact that GSM compressed the files almost twice as much as MPEG one might expect the quality to be very poor. However, GSM performed consistently well regardless of source content whereas with source material that contained a continuous wide range of frequency content (e.g. classical horns, rock cymbals) the MPEG encoder's bit pool was quickly depleted. In these cases the noise introduced by GSM was not as disrupting as the artifacts produced by MPEG. With less demanding source material, however, MPEG's perceptual coding had a better chance to perform and subjectively it did a much better job than GSM.
It should also be noted that not all MPEG encoders perform equally. Other MPEG-1 Layer II codecs may be superior. While it was unavailable for testing, reviews of the Fraunhofer Institute's MPEG-1 Layer III codec [16] indicate that it would outperform GSM 6.10 at low bitrates .
MPEG is a standard compression method used in the audio and video industry. Because of this any new approaches to low bitrate encoding should at least consider its perceptual coding techniques. Any approaches that use MPEG as their basic building block will be more readily adaptable to the industry. MPEG encoding was chosen for this project for these reasons and for the overall quality of its low bitrate encoding. MPEG-1 Layer II, (psychoacoustical model II) performs as well or better than GSM in all but extreme cases and layer III performs better than GSM.
The need for low bitrate audio and video led the ISO/IEC standardization body to establish the Moving Pictures Experts Group (MPEG).
This group had the task to compare and assess several digital audio low-bit-rate coding techniques in order to develop an international standard for the coded representation of moving pictures, associated audio, and their combination when used for storage and retrieval on digital storage media (DSM) [17].
Currently, two of the four originally proposed standards are finished and the third has been merged with the second. Today, three "phases" are defined [18]:
MPEG-1: "Coding of Moving Pictures and Associated Audio for Digital Storage Media
at up to about 1.5 Mbps." Status: International Standard IS-11172,
completed in 10/92.
MPEG-2: "Generic Coding of Moving Pictures and Associated Audio." Status:
International Standard IS-13818, completed in 11/94.
MPEG-3: No longer exists (has been merged into MPEG-2).
MPEG-4: "Very Low Bitrate Audio-Visual Coding." Status: a draft specification is
scheduled to be completed in 1997.
The MPEG-1 standard "Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbits/s" was finished in 1992 and consists of four parts: system, video, audio, and conformance testing.
MPEG-1 audio consists of three layers. For each layer the standard specifies the bitstream format and the decoder. The encoder is not specified in great detail to allow for future improvements. Each subsequent layer improves the sound quality per bitrate with increasingly complex encoding schemes. The bitrates and compression schemes of each layer are shown in Table 4. The numbers shown in parenthesis reflect the bitrate each layer can achieve while still maintaining audio quality comparable to a compact disc.
| Layer | bitrate (kbps) | compression scheme | applications |
| I | 32 - 448 (192) | simplified MUSICAM | digital home recordings |
| II | 32 - 384 (128) | MUSICAM+frame header | broadcasting, TV, multimedia |
| III | 32 - 320 (64) | MUSICAM & ASPEC | pro audio, telecommunication |
Each layer is backward compatible. Layer III, for example, can decode files encoded by either layer I or layer II encoders. Layer I is a simplified version of the MUSICAM (Masking-pattern Adapted Subband Coding And Multiplexing) coding scheme. MUSICAM was chosen for layer I because it offers good sound quality with little processing overhead which translates to small codec delay times. Layer II is nearly identical to the MUSICAM coding scheme except for an added frame header. Layer III uses a combination of the MUSICAM and ASPEC (Adaptive Spectral Perceptual Entropy Coding) encoding schemes.
As noted, one method used by lossy data compression schemes is reducing the number of bits used per sample. This, however, raises the quantization noise in the signal. [MPEG works on the principle that if the bits are used intelligently compression can take advantage of masking properties and critical bands to reduce the number of bits without the loss being noticed.] By looking at a PCM signal it is hard to imagine how to do this. Figure 6 and Figure 7 show the frequency distribution of the signal along with the masking level of the signal. This reveals how bit reduction can be performed.


In all three MPEG audio layers an input PCM signal is both converted into the frequency domain as well as split into 32 frequency subbands. As shown in Figure 7 an average signal level will produce an average masking level. Any sounds close in frequency and below the masking level will not be heard. As the number of bits used to quantize the signal is reduced, the signal to error (S/E) ratio (or SNR) goes up. As long as the noise level stays below the masking level the noise will not be heard. One of two psychoacoustic models is used on each subband to determine what its mask to noise ratio (MNR) is. This is used to determine what SNR ratio is needed (i.e. how many bits are needed) to avoid audible quantization noise. Bands which contain loud signal content will mask quantization noise so that it is not heard. These bands do not need a large S/E ratio and fewer bits can be used for them. The overall effect is a reduction in file size while retaining enough bits to mask any noise in the signal.
The following sections discuss the step-by-step process used by MPEG-1 encoders and decoders to compress and reconstruct a PCM audio signal. Each layer is discussed along with the two psychoacoustical models.
The MPEG-1 audio algorithm is a psychoacoustic algorithm, the primary parts of which are shown in Figure 8.

The encoding process begins by breaking the signal into 32 equally spaced subbands with a filter bank. As shown in Figure 9, it is not possible to construct filters with a perfectly flat response in the passband and zero output everywhere else. The shaded overlapping regions create aliasing noise which must be eliminated. Quadrature Mirror Filter (QMF) banks can reconstruct the original signal from overlapping subbands without aliasing as long as equal-width subbands are created. Polyphase filter banks can provide equal bandwidth subbands with high stop-band attenuation, allowing good control over aliasing with less delay than a QMF. Layers I and II use a Polyphase filter. Layer III uses both a polyphase filterbank and a Modified Discrete Cosine Transform (MDCT) followed by a sequence of alias reduction butterfly filters to achieve better results at the cost of more processing overhead.

As the input PCM audio signal is being broken apart by the filter bank it is also being converting from the time domain into the frequency domain. We know from the Fourier theorem that all audio signals are composed of a number of sine waves at various amplitudes and frequencies. By using a Fourier transform we can take a signal that is in the time domain (amplitude versus time) and convert it to the frequency domain (amplitude versus frequency) (Figure 10). This allows the frequency components to be analyzed by a psychoacoustic model.

As noted the signal is broken into 32 subbands by the filter bank and the frequency content of each is known from the FFT. These subbands can now be quantized and coded under the control of the psychoacoustic model. The psychoacoustic model analyzes the amplitudes of a select number of spectral lines to determine individual masking thresholds for each. These lines are used in conjunction with the "absolute threshold in quiet" curve to determine a global masking threshold. The masking level for each subband is then taken from this overall threshold. The psychoacoustic model uses these masking levels to calculate and return a Signal to Mask Ratio (SMR) for each subband. The SMR is fed to a bit allocation algorithm to determine how many bits are to be allocated for each subband. Bits are allocated from a bit pool where bits are given to the noisiest subbands until the pool is depleted. If enough bits are allocated the quantization noise can be completely masked.
There are two psychoacoustic models described in the standard. The first is designed to have low computational overhead and provide accuracy at high bitrates. The second is designed for low bitrates and is more complex.
Psychoacoustic model 1 is intended to be used with Layer I and Layer II but either model can be used with any of the layers.. The standard [19] describes nine steps used to calculate the SMR:
A 512-point FFT is used for Layer I and a 1024-point FFT is used for Layer II. Increasing the number of points (or spectral lines) increases the frequency resolution of the transform. The FFT is calculated directly from the input PCM signal and windowed by a Hann window. Window functions smooth the signal at both edges of the sampling window, making the signal more continuous. This yields a more band limited spectrum, reducing spectral leakage [20].
The sound pressure level is calculated for each of the 32 subbands with the following formula:
where:
Lsb(n) is used to calculate the SMR in step 9.
The threshold in quiet curve is stored in six different tables. The exact table that is used is determined by the layer and sampling rate. Each layer has three tables with sampling rates of 32 kHz, 44.1 kHz, and 48 kHz. Each table contains index numbers which correspond to a frequency (Hz), a critical band rate (z), and an absolute threshold (dB). The first three indices of the 32.0 kHz, Layer I table are shown in Table 5 [19].
| Index Number | Frequency(Hz) | Crit. BandRate (z) | AbsoluteThresh. (dB) |
| 1 | 62.50 | 0.617 | 33.44 |
| 2 | 125.00 | 1.232 | 19.20 |
| 3 | 187.50 | 1.842 | 13.87 |
Psychoacoustics has shown that masking properties differ when a pure tone is masked by noise and when it is masked by another pure tone. A broad-band noise masker (noise across the entire spectrum) yields a constant masking threshold for frequencies below 500 Hz and increases 10 dB per decade above that.
Narrow-band noise is noise that is equal to or smaller than the critical bandwidth. The critical bandwidth is about 100 Hz for frequencies below 500 Hz and 20% of the center frequency for those above 500 Hz [21]. Narrow-band noise and pure tone maskers exhibit different masking properties than broad-band noise maskers. The masking threshold is centered around either the critical band (for narrow-band masking) or the frequency of the pure tone masker. The range of the masking effect decreases as the masked tone moves further away from the masker. Low masker levels result in an equal drop in masking threshold as the masked tone's frequency is increased or decreased. As the masker is increased, its masking threshold begins to carry further into the high frequency region while on the low frequency side the threshold's slope stays the same.
It is necessary to find the tonal and non-tonal components of each band to calculate the global masking threshold (see step 7).
Decimation is a procedure that reduces the number of maskers that are considered when calculating the global masking threshold. Tonal and non-tonal components that fall below the threshold in quiet are not used. Also, components within a distance of 0.5 Bark of each other are decimated. The unit "Bark" is a measure of critical-band rate and 0.5 Bark is equal to one half of a critical band. The component with the highest power is kept and the others are not used.
The individual masking thresholds are also used to calculate the global masking threshold. Only a small subset of the N/2 spectral components are used; their frequencies are the same as those in the threshold in quiet tables (Table 5). The number of samples in each table depends on the sampling rate and the layer being used (Table 6).
| Sampling Rate | Layer I | Layer II |
| 32 kHz | 108 | 132 |
| 44.1 kHz | 106 | 130 |
| 48 kHz | 102 | 126 |
The individual masking thresholds of each spectral line are calculated for both tonal and non-tonal components. They are a function of:
The upper and lower slopes of the individual masking threshold of each of the tonal and non-tonal maskers along with the threshold in quiet curve determine the global masking threshold Lyg(i) at each i'th frequency. This threshold is calculated by adding all of the thresholds in quiet and the powers corresponding to the individual masking thresholds:

A minimum masking level, LTmin(n), is computed for each subband. This is calculated from the minimum global masking threshold of all of the frequencies contributing to that subband.
The signal-to-mask ratio (SMR) is computed for every subband n with the following formula:
where Lsb(n) is the sound pressure level for subband n from step 2 and LTmin(n) is the minimum masking threshold for subband n from step 8. As discussed in section 3.3.1 the SMR can then be used to allocate the least number of bits needed to mask the noise in each subband.
Layer 3 encoding is used for very low bitrates and involves a more complex encoding scheme. The major improvements over the other two layers include:
Also, Layer III generally uses psychoacoustic model 2.
While psychoacoustic model 2 can be used with any layer, it is usually used with Layer III. The standard [19] describes fourteen steps used to calculate the SMR:
In the first step the threshold generator stores and concatenates samples to accurately reconstruct 1024 consecutive samples of the input signal. In the second step the input signal is windowed by a 1024 point Hann window and converted into the frequency domain with a 1024 point FFT. The polar representation of the transform is calculated where
and
represent the magnitude and phase, respectively. The magnitude and phase representations are used later to calculate the tonality for every frequency component. A predicted magnitude,
, and phase,
, are calculated in the third step and used with the magnitude and phase from the second step to calculate an unpredictability measure,
, in the fourth step. In the fifth step the magnitude from step two and the unpredictability measure are used to calculate the energy and unpredictability in the threshold calculation partitions. Each partition provides a resolution of either one FFT line or one-third of a critical band, whichever is wider.
In the sixth step the partitioned energy and unpredictability are convolved with a cochlea spreading function (given in [19] and graphically shown in [7]). The result is used to calculate the tonality index in step seven. This is used in step eight to calculate the SNR for each partition. In steps 9-12 the absolute threshold, thrw, is calculated which is then used in the final step to yield the noise level, npartn, in each partition. The energy level, epartn, in each partition is also calculated using the magnitude,
, from step two. The energy and noise levels are then used to calculate the SMR in the final step:

As can be seen from this brief overview, psychoacoustic model 2 is much more complex and processor intensive than psychoacoustic model 1.
The MPEG standard was written with backward compatibility in mind. Encoding tasks such as estimating the masking thresholds, quantization, and scaling can continuously improve as long as the output stream can be decoded by standard MPEG decoders. In this way future encoding improvements will not effect existing MPEG decoders.
The decoding process is not as computationally intensive as the encoding process. Once the encoded bit stream is input into the decoder the bit allocation and scalefactors are decoded and the samples are requantized. The output is passed through a synthesis subband filter which recreates the PCM samples (Figure 11).

| < Back | ..... | Continue > |