Chapter 6 - Internet Audio Streaming
The first developed method of receiving an audio stream over the Internet consisted of downloading an audio file and playing the file after it fully downloaded. While this method still exists, another method was developed. This new method is called Internet audio streaming.
Internet audio streaming is a recent but increasingly popular application of digital audio [17]. It is the process of playing an audio stream as it downloads. This process enables the user to hear the audio stream as it downloads rather than waiting until it fully downloads before being able to listen to it.
The Internet is based on a packet-switch network. This means that it uses a "best-effort" method, or protocol, to send and receive data. The data is grouped into datagrams known as packets. The packets contain header information describing the sender, the receiver, and other relevant information. A host sends packets, and a client receives packets. Both the host and client interface the Internet through routing devices called gateways, or IP routers. The host to client architecture of the Internet is shown in Figure 6.1
|
Figure 6.1 Host to Client Architecture
All encoding and packetizing of data occurs on the hosts gateway, whereas decoding occurs on the clients gateway. Data buffers are included in the host computer, host gateway, client computer, and client gateway. These data buffers collect and order packets.
Internet audio streaming is the transfer of audio-encoded packets that are decoded and sent to the clients soundcard upon reception. The host side is responsible for encoding and packetizing the audio stream. The client side is responsible for decoding the packets and sending the decoded audio to the sound card. This process is diagrammed in Figure 6.2
There are delays inherent in the overall system. These delays are contributed by the encode/decode delay, transfer delay, buffer delay, modem delay, sound card delay, and other delays. As long as the delays are kept constant, then the audio will be delivered uninterrupted.
|
Figure 6.2 Internet Audio Streaming Data Flow [18]
The Internet communication system uses Transmission Control Protocol/Internet Protocol (TCP/IP). This communication system is built in layers as diagrammed in Figure 6.3. The fundamental layer is the bottom physical layer. All other layers are software-based layers built on top of lower layers.
There are two main protocols for real-time audio streaming. RTP (Real-time Transfer Protocol) was the first of the two to be issued. RTP uses Internet Protocol (IP) and User Datagram Protocol (UDP). UDP runs on top of IP and sends packets as fast as possible to the client. This method provides large bandwidth and transfer rates, but it is very unreliable and can easily overload the network [17].
|
Figure 6.3 TCP/IP Layer Structure[17][19]
RTSP (Real-Time Streaming Protocol) is the more recent of the two streaming protocols. For RTSP the average packet size is 50ms of audio data. RTSP minimizes the overhead of multimedia delivery and runs on top of RTP [17].
Both protocols use a method of numbering packets to make sure that the audio is sequentially correct. Unfortunately, neither method features the two basic requirements that an audio streaming protocol should contain; constant delay and constant bandwidth. These two requirements do not exist in the Internet communication system that uses TCP/IP. In order for audio streaming protocols to exist within the Internet, they must run within the TCP/IP architecture.
Unfortunately, packets are not guaranteed to reach the client. In normal circumstances any packet that is lost or delayed is re-requested. Since audio streaming works on a continuous basis, lost or delayed packets cannot be re-requested, and, instead, the missing audio data needs to be dealt with in other ways.
Packet loss occurs when overloaded buffers in gateways congest the network [17]. Packet loss and delay are an inevitable fact of the Internet. Audio streaming systems set up a small buffer to collect and organize packets before being decoded. Packets that do not arrive can be re-requested as long as there is still time left in the delay buffer. Packets that arrive after this buffer time are considered lost. It has been found that packet loss is highly correlated, i.e. loss of packet n increases probability of loss of packet n+1 [18].
Lost packets can be dealt with in different ways. The system can mute the output or repeat the last packet. If missing packets are small and surrounded by valid packets, then the missing audio can be interpolated. Forward Error Correction (FEC) could also be used. This method consists of re-sending audio data in packets. This method increases needed bandwidth. Other methods of dealing with packet loss include replacing the missing samples with noise to emulate "bad reception" or interleave packets during the encoding process and then interpolating missing samples after de-interleaving. The interleaving method reduces the size of the missing audio by spreading audio data over several packets.
The proposed algorithm would be implemented in the decoding portion of the real-time streaming system. It would run in the client side. Every incoming packet would be relayed to the extrapolation system after the decoding process. The extrapolation system would extrapolate future packets. If a given packet were missing, the streaming system would switch from the real packet data to the output of the extrapolation system. Once valid packets resume then a cross-fade would be implemented to smoothly transfer the estimated audio output to the real audio output.
A block diagram of an audio streaming system using the proposed extrapolation system is shown in Figure 6.4. Incoming packets are decoded, buffered, and sent to the extrapolation system. The decision process then outputs incoming packets if received or extrapolated packets if not received.
|
Figure 6.4 Block diagram of extrapolation system in audio streaming application
An example of an input packet buffer vs. an output packet buffer is displayed in Figure 6.5. The second and third packets in the input buffer are missing in this example. These packets are replaced by the output of the extrapolation system. Although not shown, all input packets are relayed to the extrapolation system. The extrapolation system produces extrapolated packets based on the input packets. Since the fourth input packet is valid, it is cross-faded with the next extrapolated packet. The fifth input packet is also valid and thus sent to the output buffer.
|
Figure 6.5 Example of Input vs. Output
Since RTSP denotes packets as having 50ms of audio data, a packet can contain from about 551 samples of 11.025kHz-sampled music to about 2205 samples of 44.1kHz-sampled music. If an input size of 8192 samples were processed through the extrapolation system, where the first 4096 samples were known valid data, then 4096 samples could be extrapolated. This means that about 2 packets of 44.1kHz-sampled music (93ms) or about 7 packets of 11.025kHz-sampled music (350ms) could be extrapolated. An additional time delay would be included due to the extrapolation system, but as long as that delay was constant then it would just add to the delays already inherent in the Internet streaming process and still deliver an uninterrupted audio stream.
This algorithm is well suited for this application because the digital audio data is pre-segmented. The extrapolation system could be configured to extrapolate one segment (packet) or multiple segments (packets). Since there are already multiple delays inherent in Internet audio streaming, adding another delay would not compromise the overall system. Expected quality of audio streaming is presently very low, therefore poor estimation in extrapolated audio is less likely to be noticed than if the extrapolation is computed in a high-resolution transmission of audio data. On the other hand, the estimation of extrapolated data is likely to sound better to the listener than other forms of replacement for missing audio packets. Since large audio gaps cannot be corrected by interpolation, especially if only preceding data is known, the gaps would currently be filled with either silence or noise.
Since silence and noise are the most plausible solutions to fill large gaps during Internet audio streaming, these are compared to extrapolation results in a listening test. Currently, the average sampling rate of audio signals streamed over the Internet is 11.025kHz. Therefore, three musical signals sampled at 11.025kHz are used in a listening test. Each signal contains about 5 seconds of audio data. The first musical signal, Woods, is from the classical genre. It is an audio clip from George Dukes Muir Woods Suite Phase 4. The second signal, Woodjazz, is from the jazz genre. It is an audio clip from George Dukes Muir Woods Suite Phase 5. The third musical signal, Knife, is from the rock genre. It is an audio clip from Emerson, Lake, & Palmers Knife-Edge.
A set of 21 test signals is created for each music example. Each signal has a gap of varying length. The following gap lengths are used in the listening test: 10ms, 20ms, 40ms, 80ms, 160ms, 320ms, and 640ms. Each set of signals is comprised of three signals of each gap length. The gap in the first of each three signals is filled with silence. The gap in the second of each three signals is filled with white noise of amplitude equal to the average amplitude of the signal. The gap in the third of each three signals is filled with the output of the extrapolation system. The input to the extrapolation system is the data vector of equal length preceding the gap. The overlap percentage is set at 6.25%, and the block length is ¼ the number of samples in the gap. This process is repeated for all the gap sizes and all the music examples.
The 40ms extrapolation of each musical example is shown in the following plots. These plots are shown as an extrapolation example for each musical signal. Figure 6.6 displays the 40ms extrapolation of the classical example, Woods. Figure 6.7 displays the 40ms extrapolation of the jazz example, Woodjazz. Figure 6.8 displays the 40ms extrapolation of the rock example, Knife.
Listening tests were performed using 66 audio examples; the 63 altered signals with filled gaps and the three original examples. Twelve college students volunteered as test subjects. The subjects listened to the 66 examples through headphones in random order. After listening to each example, the test subject evaluated how the filled gap was perceived. This evaluation was based on the Mean Opinion Score (MOS) rating system. The MOS system is displayed in Table 6.1

Figure 6.6 Forty millisecond extrapolation of classical (Woods) example

Figure 6.7 Forty millisecond extrapolation of jazz (Woodjazz) example

Figure 6.8 Forty millisecond extrapolation of rock (Knife) example
Rating |
Quality of the gap |
Level of Distortion |
5 |
Excellent |
Imperceptible |
4 |
Good |
Just perceptible but not annoying |
3 |
Fair |
Perceptible and slightly annoying |
2 |
Poor |
Annoying but not objectionable |
1 |
Unsatisfactory |
Very annoying and objectionable |
Table 6.1 Mean Opinion Score Five-Point Scale
The results of the listening tests are plotted in the following figures. Figure 6.9 displays the classical example, Woods. Figure 6.10 displays the jazz example, Woodjazz. Figure 6.11 displays the rock example, Knife. In all three examples, the extrapolation-filled signals far exceeded the silence-filled and noise-filled signals. The extrapolation-filled signals were evaluated with a quality between Good and Excellent for all gap sizes except the 640ms gap size. The extrapolation-filled signals were evaluated with a quality of Fair for the 640ms gap length whereas both the silence-filled and noise-filled signals were evaluated with a quality of Unsatisfactory for the 640ms gap length. The noise-filled signals were evaluated with the lowest quality on all gap lengths and musical signals except for the smaller gap lengths in the Knife example where they performed better than silence but still worse than extrapolation.

Figure 6.9 MOS Results for Classical Example

Figure 6.10 MOS Results for Jazz Example

Figure 6.11 MOS Results of Rock Example
As the gap lengths increase, the quality of each filled gap decreases. The extrapolation-filled signals have a shallower slope than both the silence-filled and noise filled signals. The silence-filled and noise-filled signals generally have a quality rating of Good for 10ms gaps and a quality rating of Unsatisfactory for 640ms gaps, whereas extrapolation-filled signals generally have a quality rating of Excellent for 10ms gaps and a quality rating of Fair/Good for 640ms gaps. The extrapolation-filled signals therefore have only about half the degradation slope of the noise-filled and silence-filled signals.
This listening test has demonstrated the auditory benefit of using the proposed extrapolation system over two common forms of gap replacement; silence and noise. The extrapolation system has more computational requirements than a silence or noise gap-filling system, but the auditory benefit is substantial. Since frequency-domain blocking reduces computational requirements, this system can be realized on the average client computer.