Chapter 5

COMPUTATIONAL COMPLEXITY

The goal of this chapter is to examine the computational complexity of the convolution methods from Chapter Four in greater detail. Specifically, it will show that the well-known block convolution methods and the running convolution from Vetterli (1988) are the most efficient. Furthermore, the multirate method that introduces distortion in the transition band is significantly less computationally demanding than direct time-domain convolution and compares favorably with the biorthonormal method when non-trivial filters are used. Since this multirate variation produces distortion in the audible frequency range, the methodology and results of qualitative listening tests will be reported at the end of the chapter.

Computational Complexity for the Convolution Methods

Before each of the methods is examined in detail, it should be pointed out that all the implementations will have some common processing that must take place prior to a running or streaming convolution. Specifically, the impulse response must be loaded in to the architecture in a form most directly associated with the algorithm. For example, a frequency domain overlap-save implementation (such as the one shown in Figure 6) might first zero-pad the impulse response to a pre-determined length and compute the FFT of the resulting response. The frequency-domain coefficients are then used in each

block convolution. Also, most implementations will scale the impulse coefficients and the input to avoid clipping or overflow. Alternatively, floating point implementations may not scale the impulse and input, but will probably scale the output so that its numerical range aligns with that of the original signals. While these computations may affect the initial startup input-to-output latency, they are common to all the methods and do not affect the convolution computation itself.

Most of the algorithms utilize the FFT and IFFT. A typical bit-reversed radix-2 FFT of a complex N-point signal requires (N/2)× log2(N) complex multiplications and N× log2(N) complex additions (Lindquist 1989, p. 112). Although higher radix and mixed radix methods have been proven more efficient (Manolakis and Proakis 1992, p. 714-732), the bit-reversed radix-2 method provides a practical and familiar baseline. Also, when the input is real, the N-point FFT can actually be calculated with a complex (N/2)-point FFT plus N/2 extra complex multiplies and N/2 complex additions (Manolakis and Proakis 1992, p. 715). Hence, the computation of a real N-point FFT or IFFT requires

(79)

complex multiplications and

(80)

complex additions per sample. Noting that a complex multiplication requires four real multiplications and two real additions and that a complex addition requires two real additions, Equations 79 and 80 equivalently represent

(81)

(82)

real multiplications and additions per sample, respectively.

Time-Domain Convolution

Direct form convolution performed on the time-domain signals is performed with the FIR architecture shown in Figure 1. Preparatory calculations might include scaling the impulse response and loading it into the delay line. For an impulse response of length Nh, each output sample requires Nh real multiplications and Nh - 1 real additions. A practical implementation will likely include input scaling to match impulse scaling , adding one more real multiplication per sample. Clearly, an architecture with a single multiply/accumulate structure would be advantageous. In the event that the impulse is symmetric or antisymmetric (i.e. has linear phase) the number of multiplications can be reduced by 50%, but this is generally not the case. So the direct form requires Nh real multiplications and Nh - 1 real additions for each output sample.

Frequency-Domain Convolution

For the frequency domain convolution of an impulse of length Nh and a known input of length Nx preparatory computation involves zero-padding and scaling for both signals. The convolution computation itself requires an FFT with resolution Nx + Nh - 1 plus Nx + Nh - 1 complex multiplications plus an IFFT with resolution Nx + Nh - 1. This produces Nx + Nh -1 output samples. Hence, the frequency domain convolution requires

(83)

(84)

complex multiplications and additions per sample or

(85)

(86)

real multiplications and additions, per sample, respectively.

Block Convolution

Although the overlap-add and overlap-save block convolution methods summarized in Chapter 2 are both very efficient, recall that the overlap-add method requires the linear convolution of input blocks and additions of the overlap components at the output. The overlap-save method lends itself to more efficient use of the FFT since the input blocks are overlapped and circularly convolved. Also, no extra additions are required at the output since the circular artifacts are simply discarded. So, in a manner of speaking, the overlap-save method is more efficient in terms of the rote number of additions and multiplications. Preparatory calculations for both algorithms include scaling and zero-padding the impulse and then computing its FFT.

Given an impulse response of length N, the overlap-save algorithm shown in Figure 5 may require that every N samples received are scaled. Then a 2× N-point FFT is computed. 2× N complex multiplications take care of the convolution. Then the 2× N-point IFFT produces the N desired output points concatenated with N circular artifacts. So the overlap-save method shown in Figure 5 requires

(87)

(88)

complex multiplications and additions or

(89)

(90)

real multiplications and additions, respectively, per sample.

Multirate Convolution with Biorthonormal Filter Banks

Clearly, any multirate approach to convolution involves greater implementation complexity and preparatory computation due to the presence of the additional filters, decimators, and expanders. Although the multirate filter operations are minor compared to the convolution operations, they will increase the computational complexity and inherent delay of the algorithm. Therefore, the savings provided by the multirate convolution itself must overcome the computational overhead required by the multirate implementation in order for a given algorithm to be more efficient. For the case of biorthonormal convolution, a possible architecture is illustrated in Figure 30.

 

Figure 30. Block diagram of a biorthonormal convolver.

 

Disregarding the processing required on the impulse, and realizing that any scaling operation can be incorporated in the analysis filters, this scheme still requires four block convolutions with impulse responses of half the length plus four filtering operations for each block of input. Assuming that the analysis filters are very simple, as shown in Equation 64 and Table 1, their contributions are negligible. Similarly, the decimation and expansion operations and the multiplexed output do not significantly increase the computational complexity. Qualitatively speaking, the biorthonormal scheme replaces a single convolution block with four blocks of half the length running at half the speed. So complexity of the biorthonormal method should be about the same as that of the single channel method. For an input block and impulse of length N, the principle computations are the four block convolutions half the length of the input, or

(91)

(92)

complex multiplications and additions per sample. This appears to be nearly twice the computation required for a single block convolution. However, these convolutions run at half the sampling rate of the single channel convolution (due to the decimation by two). In other words, the operations per sample per unit time are reduced by a factor of two. So the comparable computation involves

(93)

(94)

complex multiplications and additions or

(95)

(96)

real multiplications and additions per sample, respectively. Although delayed by one sample, the biorthonormal convolver incorporating the very simple filters of Equation 64 exhibits a complexity nearly identical to that of the single channel block convolution.

Multirate Running Convolution

The running convolution method proposed by Vetterli (1988) and shown in Figure 23 is similar to the biorthonormal case in that its complexity is predominantly governed by the three block convolutions of half the length of the input running at half the sampling rate. Also, from Figure 23, the input filter bank requires one addition per two input samples (due to the subsequent decimation) and the output filter bank requires three additions per two output samples (due to the preceding expansion) (Vetterli 1988). So the running convolution requires

(97)

(98)

complex multiplications and additions or

(99)

(100)

real multiplications and additions per sample. This method exhibits a 25% improvement in computational complexity over the single rate block convolution.

Multirate Convolution with Cross-Convolved Channels

The cross-channel method illustrated in Figure 24 and Table 3 is again similar to the biorthonormal case in that simple analysis and synthesis filters such as those in Equation 64 add negligible complexity to the algorithm. Except for the extra additions at the output, the complexity is the same as for the biorthonormal case, requiring

(101)

(102)

complex multiplications and additions or

(103)

(104)

real multiplications and additions. In this case, the output is delayed by two.

Multirate Convolution without Cross-Convolved Channels

For the multirate method without cross-convolved channels, the contribution of the analysis and synthesis banks is no longer negligible since their order and selectivity directly relates to the amount of aliasing introduced in the transition band around p/2. The polyphase representation and the noble identities discussed in Chapter Three help to optimize the implementation. If the optimized IIR filters developed by Ekanayake and Premaratne (1995) are used, then Equations 76 and 77 define the relationship between the filters in the filter bank. From Equations 26 and 32, the polyphase components, E0 and E1, are related to the analysis and synthesis filters A0, A1, S0, and S1 by

(105)

(106)

(107)

(108)

Similar polyphase components can be devised for the multirate bank used with the impulse response. In this case, the same filters are used on the impulse response. With the help of the noble identities (Figure 10), the multirate convolution implementation can be realized as shown in Figure 31. The synthesis bank remains unchanged because the cascade of the two synthesis filters superfluously complicates the polyphase realization.

 

Figure 31. Block diagram of the implementation for multirate convolution without cross convolved channels.

 

The advantage to the polyphase structure is that (a) each analysis filter order is halved and (b) the analysis filters operate at half the original sampling rate.

For an input block of length N, the two block convolutions (of length N/2 running at half speed) require

(109)

(110)

complex multiplications and additions or

(111)

(112)

real multiplications and additions. In order to retain any computational reduction and outperform the single channel method, all the filtering processes must contribute no more than this number of operations. For non-trivial filters, this means that they must be implemented in a comparably efficient architecture. For example, FIR analysis and synthesis filters can be implemented with the overlap-save method. Given an analysis filter of order M, filtering the N point signal requires

(113)

(114)

complex multiplications and additions. Realizing that the polyphase analysis bank in Figure 30 operates at half the sampling rate, that each analysis bank has order M/2, and that each (decimated) sample further requires two (real) additions and one (real) multiplication, the entire multirate filtering structure introduces

(115)

(116)

complex multiplications and additions per sample or

(117)

(118)

real multiplications and additions per sample due to the two analysis filters, additional synthesis operations, and the four synthesis filters. In order to outperform the single channel method, these operations must add up to less than those defined in Equations 111 and 112. In other words,

(119)

Even for M = 8, N ³ 3.4× 1016 ! So the computation imposed by the multirate filter banks overwhelms the savings gained in the decimated convolutions. Momentarily disregarding the polyphase analysis structure, if the M-order filters are FIR and relatively small, they can be implemented directly (M real multiplies and M - 1 real additions per sample). Then efficiency is maintained when

(120)

In this case, lower order filters are somewhat more effective. If M = 8, then N > 7.5× 104. Still, this would require a 1.7 second impulse sampled at 44.1 kHz just to break even with the single channel method. The overlap-save filter implementation catches up to the direct form around M = 33.

On the other hand, if the subband convolutions are performed using a direct time-domain method, then even relatively large filter contributions are insignificant. Now each subband convolution runs at approximately (N/4) multiplications and (N/4) additions per sample (half the computations required for single-channel direct form convolution). So, even when implemented in direct canonical form, (M multiplies and M - 1 additions) the condition

(121)

is easily met. If M = 128, for example, N > 3060, which would be the length of a 70 ms impulse response sampled at 44.1 kHz. Indeed, whenever the convolution calculation itself is proportional to N operations per sample, the multirate approach can provide greater efficiency.

Comparison of Convolution Methods

Figure 32 illustrates the results of the above derivations for computational complexity for the various filters. From Figure 32a, it is evident that the overhead incurred by the multirate structure is negligible compared to overall savings in the time domain case. Figure 32b demonstrates that the multirate variation results in no computational savings over the single-channel overlap-save method while Vetterli’s algorithm shows the expected 25% reduction. Also evident is the crossover in efficiency between the FIR and overlap-save implementation of the multirate filters themselves. That is, somewhere between M = 16 and M = 64, the overlap-save method becomes more efficient than the FIR method (around M = 33).

(a)

(b)

Figure 32. Graphical comparison of computational complexity for the various convolution methods, (a) time-domain convolutions, (b) frequency domain convolutions (M is multirate filter order, FIR means direct implementation, O-S means overlap-save implementation).

Next...Previous...TOC...Home