**Chapter 2**

**CONVOLUTION
TECHNIQUES**

This chapter describes the mathematical process of discrete linear convolution and well known techniques used to reduce the number of calculations or significant input to output latency inherent to the process.

**Definition of
Convolution**

Convolution is the mathematical process that relates the output, y(t), of a linear, time-invariant system to its input, x(t), and impulse response, h(t). That is,

(1)

where "*" represents the commutative convolution operation. The continuous time relationship is described by the familiar convolution integral.

(2)

For infinite discrete sequences, designated by x(n), y(n), and h(n), this integral relation reduces to another familiar expression, the convolution sum.

(3)

For two finite discrete sequences
of length N_{x} and N_{h}, the linear or
aperiodic convolution sum takes on a slightly different form.

(4)

where h(k) and x(n-k) are zero
outside their appropriately defined intervals. For N_{x}
> N_{h}, each summation need only be calculated for
the 0 £ k £ N_{h}-1 terms. The output, y(n),
will have length N_{x}+N_{h}-1.

Convolution processing is
notoriously computationally intensive. Direct application of
Equation 4 in a tapped delay line architecture requires on the
order of 2× N_{h}^{2} operations (N_{h}
multiplications and additions for each output sample) to produce
an output N_{h} samples long. For typical audio signals,
a three second impulse response sampled at 44.1 kHz requires
around 2× (3× 44,100)^{2} or 35 billion
computations to convolve with the same length input.

Processing in the frequency
domain introduces significant input to output latency, since the
input must be initially buffered and transformed into the
frequency domain. Conversely, the output must be inversely
transformed into the time domain. The buffering imposes a minimum
2×
N_{x} sample latency. The frequency transformation itself
and the convolution process introduce further overhead

Other convolution methods aim to reduce the number of required computations and reduce the amount of input to output latency without altering the expected output. Some of these techniques include block convolution and hybrid time/frequency processing.

**Time-Domain
Convolution**

Direct convolution, as mentioned above, can be implemented with the tapped delay line architecture common to finite impulse response (FIR) filtering. Essentially, convolution is a filtering process. Or, more appropriately, filtering is an application of convolution. Expanding Equation 4, the expression becomes

(5)

This difference equation lends itself to the tapped delay line architecture shown in Figure 1.

*Figure 1. Tapped delay line architecture
implementing direct time domain convolution*

As mentioned earlier, this
solution requires N_{h} multiplications and N_{h}
additions for each output sample. Although current signal
processing and computing technology makes this a feasible
approach for short impulse responses, its sheer lack of
efficiency makes this method undesirable.

**Frequency-Domain
Convolution**

For the discrete case,
multiplication in the frequency domain translates to *circular*
convolution in the time domain (Rabiner & Gold, 1975). Figure
2 demonstrates the difference between linear and circular
convolution. The length of a circular convolution is equal to the
length of the longest sequence. Note how trailing samples that
were left alone in linear convolution are now wrapped around,
resulting in the shorter, different result. Manolakis and Proakis
(1992, pp. 689-695) present a more rigorous discussion on the
Discrete Fourier Transform (DFT) and circular convolution.

*Figure 2. An example of linear vs. circular
convolution for two short sequences*

In light of this difference,

(6)

where m is the discrete frequency
variable, yields the desired linear convolution result only if
x(n) and h(n) are padded with zeros prior to the DFT such that
their respective lengths are N_{x}+N_{h}-1,
essentially zeroing out all circular artifacts. Therefore, both
the input sequence and the impulse response must be finite and
defined. This has been termed "fast" convolution
because the DFT is performed with the Fast Fourier Transform
(FFT) and its inverse (IFFT) (Rabiner & Gold, 1975).
Unfortunately, this method precludes the implementation of a
dynamic or real-time input without special consideration for the
lengths of the signals. The block diagram in Figure 3 outlines a
direct frequency domain method.

*Figure 3. Block diagram of frequency domain
convolution ("fast" convolution)*

This method is much more
efficient than direct time-domain convolution. For a given input,
the number of operations required is around 2(N_{x}+N_{h}-1)× log_{2}(N_{x}+N_{h}-1)
operations for the FFTs, 2× (N_{x}+N_{h}-1)
operations for the complex multiplication of X(m) and H(m), and
another (N_{x}+N_{h}-1)× log_{2}(N_{x}+N_{h}-1)
operations for the IDFT. For a three second input this results in
about 17 million calculations, less than one thousandth of the
operations required for direct-time convolution.

However, "fast" convolution is limited to cases where x(n) is finite and completely defined. Due to the required length adjustments, it is also impractical when x(n) is much longer than h(n). Alternative methods allow the convolution to be performed in consecutive sections, allowing very long or indeterminate signals to be convolved with the desired impulse response without prohibitive latency. These methods are called sectioned or block convolutions.

**Overlap-Add
Block Convolution**

Consider the example shown in
Figure 4. The linear convolution of x(n) (Figure 4a) and h(n)
(Figure 4b) is shown in Figure 4c. Figures 4d-f illustrate
consecutive linear convolutions of blocks of x(n) with h(n). Each
block convolution has the expected length N_{xi}+N_{h}-1.
In this case, the three block convolutions have a combined length
of 3^{.}(N_{xi}+N_{h}-1) = 15, whereas
the initial convolution has length N_{x}+N_{h}-1
= 11. By inspection, it is reasonable to assume that the overlap
between the sectioned convolutions should be (15-11)/2 = 2
samples, as shown in Figure 4g. This corresponds to the value of
N_{h}-1. When the overlapping samples are summed, the
resulting sequence (Figure 4h) equals the initial convolution
result shown in Figure 4c. This method is called overlap-add
block convolution.

*Figure 4. Block convolution using the overlap-add method.
(a) input x(n), (b) impulse response h(n), (c) expected output
y(n), (d) output y*_{1}*(n) for block
convolution of x*_{1}*(n) and h(n), (e)
output y*_{2}*(n), (f) output y*_{3}*(n),
(g) shifted block outputs, overlap is N*_{h}*-1=2,
and (h) the sum of overlapped block outputs equivalent to the
direct convolution result.*

The input blocks need not be
precisely N_{h} samples long. But it is generally a good
idea to keep N_{xi} on the order of N_{h} to
avoid unnecessarily long block convolutions. The overlap between
block outputs must remain N_{h}-1, regardless.
Mathematically, x(n) and y(n) can be represented as

(7)

(8)

**Overlap-Save
Block Convolution**

Unlike the overlap-add method,
the overlap-save method requires that the *input* blocks
overlap. Then the input blocks are *circularly* convolved
with the impulse response. Because of the overlap redundancy at
the input, the circular artifacts in the output (the first N_{h}-1
samples) can simply be discarded. Figure 5 illustrates the
overlap-save method.

*Figure 5. Block convolution using the overlap-save method.
(a) input signal x(n) divided into overlapping sections, overlap
is N*_{h}*-1=2, (b) impulse response
h(n), (c) output y(n) using direct convolution, (d) output y*_{1}*(n)
for block circular convolution of x*_{1}*(n)
and h(n), (e) output y*_{2}*(n), (f)
output y*_{3}*(n), (g) output y*_{4}*(n),
and (h) sequential concatenation of block outputs after
discarding the first two samples of each block, which is
equivalent to the direct convolution result. "|"
represents concatenation.*

Since this method uses circular
convolution, it lends itself to use of the FFT when calculating
each block. All that is required is zero-padding the impulse
response (to the right) to length N_{xi} so that the FFT
of h(n), H(m), will have the proper length for multiplication
with X_{i}(m). Since h(n) and N_{xi} are known in
advance, H(m) can simply be calculated and stored in memory
before processing begins.

Again there is no size constraint
on the input blocks as long as they overlap by at least N_{h}-1
samples. However, noting that each block requires an
approximately N_{block}^{.}log_{2}(N_{block})
operation FFT, N_{block} complex multiplies, and an N_{block}^{.}log_{2}(N_{block})
operation IFFT, it is a good idea to keep N_{block} on
the order of N_{h}. Mathematically, symbolic
representations of x(n) and y(n) are rather cumbersome, but can
be expressed as

(9)

where x_{i}(n) = x(n) for
[i× N_{block} - (i + 1)× (N_{h }-
1) £ n £ (i + 1)× N_{block} - (i + 1)× (Nh - 1) -
1] and x_{i}(n) = 0 otherwise.

Also, x_{i-1}(m) =
x_{i-1}(n) for [i^{.}N_{block} - (i + 1)^{.}(N_{h}-1)
£
m £ i^{.}N_{block} - i^{.}(N_{h}-1)
- 1].

(10)

where (*) denotes circular convolution, and finally,

(11)

where " | " denotes
concatenation and m indexes the last N_{block} - (N_{h}-1)
samples of each block.

Alternatively, the overlap-save
block convolution can be illustrated as shown by the block
diagram in Figure 6. In this case, the impulse response is
zero-padded to the left to 2^{.}N_{h} and N_{block}
is set to precisely 2^{.}N_{h}. Now the circular
convolution produces the desired block output followed by its
unwanted artifacts. Hence, the last N_{h} samples, the
circular artifacts, can be discarded.

*Figure 6. Block diagram of an efficient implementation of
the overlap-save method using zero-padding to the left of the
impulse response, h(n). Inputs x*_{i-1}*
and x*_{i}* are adjacent input blocks.*

It should be noted that this form
is not the most efficient implementation, per se. An N_{i}
point input buffer (including the minimum overlap) requires two N_{i}
point FFTs, N_{i} complex multiplications, and an N_{i}
point IFFT to produce N_{i}-N_{h} output samples.
This can be expressed as an estimated "operations per sample
(OPS)" ratio.

(12)

Although this function is not
bounded as N_{i} approaches infinity, it does have a
minimum that can be found for a given N_{h}. For N_{i}
= a× N_{h}, where a is an integer, the
function reduces to

(13)

which increases monotonically
with N_{h}. When N_{i} is not an integer multiple
of N_{h}, a discrete minimum can be found numerically. Of
course, as N_{i} increases, the input to output latency
increases because the operation can only be performed every (N_{i}-N_{h})
samples. In any event, the form shown in Figure 6 will be
exploited in the minimum delay convolution algorithm described in
the next section and in subsequent chapters.

At this point, it will be useful to introduce less cumbersome notation describing the operation of the overlap-save method.

(14)

This represents the N-point
convolution of input, x, and impulse response, h_{i},
starting at sample n of the input signal. The output can be
produced again by concatenating the results of many block
convolution operations.

(15)

where " | " represents concatenation.

**Zero Delay
Convolution**

Gardner (1995) has proposed a hybrid method which combines the direct FIR and block convolution techniques. The resulting algorithm can exhibit zero input-output latency which makes it a suitable choice for high performance real-time convolution applications. Figure 7 illustrates the initial zero-delay algorithm.

*Figure 7. Initial "minimum-cost"
zero-delay algorithm (adapted from Gardner 1995).*

It is important to recognize
that, although the linearity of convolution allows for this
decomposition of the impulse response, progression through the
algorithm involves successively greater computation. The
algorithm proceeds as follows: the first N samples of y_{0}
are computed using direct-form FIR convolution. The calculation
of this first block takes just enough time to collect enough
samples to begin the first N-point block convolution of y_{1}.
The calculation of the following y_{0} and y_{1}
blocks takes enough time to collect 2N samples for the first
block of y_{2}. Again, the calculation of the following y_{0},
y_{1}, and y_{2} blocks takes enough time to
collect 4N samples for the first block of y_{3}. Gardner
calls this a minimum-cost solution because there is no way to
increase any block size without violating the zero-delay
constraint. This solution is rather impractical because, first,
all the block convolutions for a single period must be completed
in one sampling period, and, second, processor demand is not
uniform. Most of the processor time will be spent waiting for
input samples.

Gardner offers a more practical solution at the expense of some minimal latency. He proposes that processor demand will be uniform when blocks of size N start at least 2N samples into the filter response. Hence, the new algorithm resembles Figure 8.

*Figure 8. A more practical decomposition of the impulse
response with the corresponding block convolution schedule
(adapted from Gardner 1995).*

In this case, the processor is
allotted the same amount of time to do the block computation as
it takes to collect the required number of input samples. For
instance, the convolution of the input with h_{1} begins
at time N with the computation of x * h_{1}(0,N). This
computation must be completed by time 2N so that the computation
of x * h_{1}(N,N) can begin, and so on. Although the
even-numbered partitions need not be completed at the same time
as their odd counterparts (e.g. x * h_{2}(0,N) need not
be completed until time 3N), imposing the earlier deadline
ensures that the processor demand remains uniform (Gardner 1995).

Although significant computation
and process scheduling are necessary, this algorithm provides a
means for zero-latency convolution without having to implement a
very large direct-form FIR filter. The estimated cost of this
algorithm is approximately 34^{.}log_{2}(n) - 151
multiplies per output sample (Gardner 1995). For a three second
response at 44.1 kHz (132,300 points), each output sample
requires about 428 multiplies, which measures favorably against
the 132,000 multiplies required by a direct-form FIR structure.