This chapter describes the mathematical process of discrete linear convolution and well known techniques used to reduce the number of calculations or significant input to output latency inherent to the process.
Definition of Convolution
Convolution is the mathematical process that relates the output, y(t), of a linear, time-invariant system to its input, x(t), and impulse response, h(t). That is,
where "*" represents the commutative convolution operation. The continuous time relationship is described by the familiar convolution integral.
For infinite discrete sequences, designated by x(n), y(n), and h(n), this integral relation reduces to another familiar expression, the convolution sum.
For two finite discrete sequences of length Nx and Nh, the linear or aperiodic convolution sum takes on a slightly different form.
where h(k) and x(n-k) are zero outside their appropriately defined intervals. For Nx > Nh, each summation need only be calculated for the 0 £ k £ Nh-1 terms. The output, y(n), will have length Nx+Nh-1.
Convolution processing is notoriously computationally intensive. Direct application of Equation 4 in a tapped delay line architecture requires on the order of 2× Nh2 operations (Nh multiplications and additions for each output sample) to produce an output Nh samples long. For typical audio signals, a three second impulse response sampled at 44.1 kHz requires around 2× (3× 44,100)2 or 35 billion computations to convolve with the same length input.
Processing in the frequency domain introduces significant input to output latency, since the input must be initially buffered and transformed into the frequency domain. Conversely, the output must be inversely transformed into the time domain. The buffering imposes a minimum 2× Nx sample latency. The frequency transformation itself and the convolution process introduce further overhead
Other convolution methods aim to reduce the number of required computations and reduce the amount of input to output latency without altering the expected output. Some of these techniques include block convolution and hybrid time/frequency processing.
Direct convolution, as mentioned above, can be implemented with the tapped delay line architecture common to finite impulse response (FIR) filtering. Essentially, convolution is a filtering process. Or, more appropriately, filtering is an application of convolution. Expanding Equation 4, the expression becomes
This difference equation lends itself to the tapped delay line architecture shown in Figure 1.
Figure 1. Tapped delay line architecture implementing direct time domain convolution
As mentioned earlier, this solution requires Nh multiplications and Nh additions for each output sample. Although current signal processing and computing technology makes this a feasible approach for short impulse responses, its sheer lack of efficiency makes this method undesirable.
For the discrete case, multiplication in the frequency domain translates to circular convolution in the time domain (Rabiner & Gold, 1975). Figure 2 demonstrates the difference between linear and circular convolution. The length of a circular convolution is equal to the length of the longest sequence. Note how trailing samples that were left alone in linear convolution are now wrapped around, resulting in the shorter, different result. Manolakis and Proakis (1992, pp. 689-695) present a more rigorous discussion on the Discrete Fourier Transform (DFT) and circular convolution.
Figure 2. An example of linear vs. circular convolution for two short sequences
In light of this difference,
where m is the discrete frequency variable, yields the desired linear convolution result only if x(n) and h(n) are padded with zeros prior to the DFT such that their respective lengths are Nx+Nh-1, essentially zeroing out all circular artifacts. Therefore, both the input sequence and the impulse response must be finite and defined. This has been termed "fast" convolution because the DFT is performed with the Fast Fourier Transform (FFT) and its inverse (IFFT) (Rabiner & Gold, 1975). Unfortunately, this method precludes the implementation of a dynamic or real-time input without special consideration for the lengths of the signals. The block diagram in Figure 3 outlines a direct frequency domain method.
Figure 3. Block diagram of frequency domain convolution ("fast" convolution)
This method is much more efficient than direct time-domain convolution. For a given input, the number of operations required is around 2(Nx+Nh-1)× log2(Nx+Nh-1) operations for the FFTs, 2× (Nx+Nh-1) operations for the complex multiplication of X(m) and H(m), and another (Nx+Nh-1)× log2(Nx+Nh-1) operations for the IDFT. For a three second input this results in about 17 million calculations, less than one thousandth of the operations required for direct-time convolution.
However, "fast" convolution is limited to cases where x(n) is finite and completely defined. Due to the required length adjustments, it is also impractical when x(n) is much longer than h(n). Alternative methods allow the convolution to be performed in consecutive sections, allowing very long or indeterminate signals to be convolved with the desired impulse response without prohibitive latency. These methods are called sectioned or block convolutions.
Overlap-Add Block Convolution
Consider the example shown in Figure 4. The linear convolution of x(n) (Figure 4a) and h(n) (Figure 4b) is shown in Figure 4c. Figures 4d-f illustrate consecutive linear convolutions of blocks of x(n) with h(n). Each block convolution has the expected length Nxi+Nh-1. In this case, the three block convolutions have a combined length of 3.(Nxi+Nh-1) = 15, whereas the initial convolution has length Nx+Nh-1 = 11. By inspection, it is reasonable to assume that the overlap between the sectioned convolutions should be (15-11)/2 = 2 samples, as shown in Figure 4g. This corresponds to the value of Nh-1. When the overlapping samples are summed, the resulting sequence (Figure 4h) equals the initial convolution result shown in Figure 4c. This method is called overlap-add block convolution.
Figure 4. Block convolution using the overlap-add method. (a) input x(n), (b) impulse response h(n), (c) expected output y(n), (d) output y1(n) for block convolution of x1(n) and h(n), (e) output y2(n), (f) output y3(n), (g) shifted block outputs, overlap is Nh-1=2, and (h) the sum of overlapped block outputs equivalent to the direct convolution result.
The input blocks need not be precisely Nh samples long. But it is generally a good idea to keep Nxi on the order of Nh to avoid unnecessarily long block convolutions. The overlap between block outputs must remain Nh-1, regardless. Mathematically, x(n) and y(n) can be represented as
Overlap-Save Block Convolution
Unlike the overlap-add method, the overlap-save method requires that the input blocks overlap. Then the input blocks are circularly convolved with the impulse response. Because of the overlap redundancy at the input, the circular artifacts in the output (the first Nh-1 samples) can simply be discarded. Figure 5 illustrates the overlap-save method.
Figure 5. Block convolution using the overlap-save method. (a) input signal x(n) divided into overlapping sections, overlap is Nh-1=2, (b) impulse response h(n), (c) output y(n) using direct convolution, (d) output y1(n) for block circular convolution of x1(n) and h(n), (e) output y2(n), (f) output y3(n), (g) output y4(n), and (h) sequential concatenation of block outputs after discarding the first two samples of each block, which is equivalent to the direct convolution result. "|" represents concatenation.
Since this method uses circular convolution, it lends itself to use of the FFT when calculating each block. All that is required is zero-padding the impulse response (to the right) to length Nxi so that the FFT of h(n), H(m), will have the proper length for multiplication with Xi(m). Since h(n) and Nxi are known in advance, H(m) can simply be calculated and stored in memory before processing begins.
Again there is no size constraint on the input blocks as long as they overlap by at least Nh-1 samples. However, noting that each block requires an approximately Nblock.log2(Nblock) operation FFT, Nblock complex multiplies, and an Nblock.log2(Nblock) operation IFFT, it is a good idea to keep Nblock on the order of Nh. Mathematically, symbolic representations of x(n) and y(n) are rather cumbersome, but can be expressed as
where xi(n) = x(n) for [i× Nblock - (i + 1)× (Nh - 1) £ n £ (i + 1)× Nblock - (i + 1)× (Nh - 1) - 1] and xi(n) = 0 otherwise.
Also, xi-1(m) = xi-1(n) for [i.Nblock - (i + 1).(Nh-1) £ m £ i.Nblock - i.(Nh-1) - 1].
where (*) denotes circular convolution, and finally,
where " | " denotes concatenation and m indexes the last Nblock - (Nh-1) samples of each block.
Alternatively, the overlap-save block convolution can be illustrated as shown by the block diagram in Figure 6. In this case, the impulse response is zero-padded to the left to 2.Nh and Nblock is set to precisely 2.Nh. Now the circular convolution produces the desired block output followed by its unwanted artifacts. Hence, the last Nh samples, the circular artifacts, can be discarded.
Figure 6. Block diagram of an efficient implementation of the overlap-save method using zero-padding to the left of the impulse response, h(n). Inputs xi-1 and xi are adjacent input blocks.
It should be noted that this form is not the most efficient implementation, per se. An Ni point input buffer (including the minimum overlap) requires two Ni point FFTs, Ni complex multiplications, and an Ni point IFFT to produce Ni-Nh output samples. This can be expressed as an estimated "operations per sample (OPS)" ratio.
Although this function is not bounded as Ni approaches infinity, it does have a minimum that can be found for a given Nh. For Ni = a× Nh, where a is an integer, the function reduces to
which increases monotonically with Nh. When Ni is not an integer multiple of Nh, a discrete minimum can be found numerically. Of course, as Ni increases, the input to output latency increases because the operation can only be performed every (Ni-Nh) samples. In any event, the form shown in Figure 6 will be exploited in the minimum delay convolution algorithm described in the next section and in subsequent chapters.
At this point, it will be useful to introduce less cumbersome notation describing the operation of the overlap-save method.
This represents the N-point convolution of input, x, and impulse response, hi, starting at sample n of the input signal. The output can be produced again by concatenating the results of many block convolution operations.
where " | " represents concatenation.
Zero Delay Convolution
Gardner (1995) has proposed a hybrid method which combines the direct FIR and block convolution techniques. The resulting algorithm can exhibit zero input-output latency which makes it a suitable choice for high performance real-time convolution applications. Figure 7 illustrates the initial zero-delay algorithm.
Figure 7. Initial "minimum-cost" zero-delay algorithm (adapted from Gardner 1995).
It is important to recognize that, although the linearity of convolution allows for this decomposition of the impulse response, progression through the algorithm involves successively greater computation. The algorithm proceeds as follows: the first N samples of y0 are computed using direct-form FIR convolution. The calculation of this first block takes just enough time to collect enough samples to begin the first N-point block convolution of y1. The calculation of the following y0 and y1 blocks takes enough time to collect 2N samples for the first block of y2. Again, the calculation of the following y0, y1, and y2 blocks takes enough time to collect 4N samples for the first block of y3. Gardner calls this a minimum-cost solution because there is no way to increase any block size without violating the zero-delay constraint. This solution is rather impractical because, first, all the block convolutions for a single period must be completed in one sampling period, and, second, processor demand is not uniform. Most of the processor time will be spent waiting for input samples.
Gardner offers a more practical solution at the expense of some minimal latency. He proposes that processor demand will be uniform when blocks of size N start at least 2N samples into the filter response. Hence, the new algorithm resembles Figure 8.
Figure 8. A more practical decomposition of the impulse response with the corresponding block convolution schedule (adapted from Gardner 1995).
In this case, the processor is allotted the same amount of time to do the block computation as it takes to collect the required number of input samples. For instance, the convolution of the input with h1 begins at time N with the computation of x * h1(0,N). This computation must be completed by time 2N so that the computation of x * h1(N,N) can begin, and so on. Although the even-numbered partitions need not be completed at the same time as their odd counterparts (e.g. x * h2(0,N) need not be completed until time 3N), imposing the earlier deadline ensures that the processor demand remains uniform (Gardner 1995).
Although significant computation and process scheduling are necessary, this algorithm provides a means for zero-latency convolution without having to implement a very large direct-form FIR filter. The estimated cost of this algorithm is approximately 34.log2(n) - 151 multiplies per output sample (Gardner 1995). For a three second response at 44.1 kHz (132,300 points), each output sample requires about 428 multiplies, which measures favorably against the 132,000 multiplies required by a direct-form FIR structure.