6
Proposing Average Mutual Information
(AMI) for Beat Detection
6.3
Average Mutual Information Calculation
In
this chapter a simple beat detection system based on AMI is proposed. Beats are considered the fundamental
component of rhythm and they correspond to the perception of equally spaced
temporal units that define tempo in music.
In other words, tempo refers to the rate at which beats occur. On the other hand, meter imposes an accent
structure on beats and refers to the most basic level of rhythmic organization,
e.g. “ one, two, three, one two, three...”
This study tries only to detect equally spaced beats. Music with variable tempo is not considered.
The
method attempts to prove that when AMI function is applied to audio signals the
yielded result is the musical tempo (rate at which beats occur). As described in Chapter 5, the proposed
method assumes that an excerpt of music is the non-lineal system and all the
instruments and voices are the variables, but they sound coherently during
performance because they are following the same tempo marked for the beat. If AMI is applied to some music excerpt, then
the obtained value must be the common link or periodicity in the system, the
musical tempo.
The
system was implemented with the software TSTOOLS for Matlab available on the
Internet [43]. It is a software package
for signal processing with emphasis on nonlinear time-series analysis. The TSTOOL package is written partly in
Matlab and partly in C++ for computationally demanding algorithms. All the information regarding this software
is placed in the Appendix.
In
Figure 6.1 the proposed beat detection algorithm is presented. The process starts with an audio file in WAV
format as input that is read and normalized.
Then the amplitude envelope is detected using the Hilbert transform
(section 6.1). This is done to make the
onset/offsets of the signal prominent. These onset and offset are perceived as
rhythm (Section 3.2.3 and suggested by [13] [31][33][37]). Thus, to optimize the data set and
consequently diminish the time in the AMI calculations, the signal is decimated
by a factor of N as is explained in Section 6.2. Then, AMI is calculated using the function provided by TSTOOLS
which is further explained in section 6.3.
From the AMI result, the maximum peak in the range between 60 bpm and
240 bmp is selected as the beat rate.
This range was selected based on the typical music beat rates, and also
because it is used in other similar studies [31].
|
|
As
was discussed in section 3.2.3, our auditory systems are sensitive to the
signal’s onsets and offsets, perceiving rhythm as an AM envelope. The Hilbert transform is a common and
efficient technique for envelope detection.
Figure 6.2 presents the envelope of one audio file using this technique.
|
Figure 6.2
Implemented envelope detection based on Hilbert transform |
The
diagram in 6.3 gives an idea of how this envelope approximation process works.
|
|
a)
Original
signal with envelope. b)
Original
signal overlayed with Hilbert transform. c)
Absolute
values of signals shown in b). d)
Peak hold
and sum of previous signals. |
The
Hilbert Transform
(t)
of a signal x(t) is defined by the equation
|
|
where
the integral is the Cauchy principal value integral. The reconstruction formula
(equation 6-2) defines the Hilbert inverse transform.
|
Equation 6‑2 |
A
Hilbert transformer produces a -90 degree phase shift for the positive
frequency components of the input x(t), then the Hilbert
transform of
(t)
is -x(t) and x(t) and
(t)
are orthogonal. But on the other hand,
the signal x(t) and its Hilbert transform
(t)
have the same amplitude spectrum and the same autocorrelation function.Then if
Hilbert is applied the beat detection is not affected.
Matlab
has already implemented the Hilbert transform. The syntax to use this function
to obtain the envelope of a signal s1 is:
Envelope_S1=abs(hilbert(s1));
hilbert(x) returns
a complex sequence called the analytic signal from a real data sequence. The analytic signal has a real part, which
is the original data, and an imaginary part, which contains the Hilbert
transform. The imaginary part is a
version of the original real sequence with a 90° phase shift. Sines are therefore transformed to cosines
and vice versa. The Hilbert transformed
series has the same amplitude and frequency content as the original real data
and includes phase information that depends on the phase of the original
data. The analytic signal for a
sequence x
has a one-sided Fourier transform, that is, negative frequencies are 0. To
approximate the analytic signal, hilbert
calculates the FFT of the input sequence, replaces those FFT coefficients that
correspond to negative frequencies with zeros, and calculates the inverse FFT
of the result. This algorithm is fully
detailed in the Matlab reference toolbox [21].
Decimation
facilitates the reduction of data information without losses when it is applied
adequately. Decimation by N is defined
as taking every N sample starting with sample 0. For example,
{[0,1,2,3,4,5,6,7,8,9]}=[0,2,4,8]. In Figure 6.5 a) sin(x) is
presented. In Figure 6.5 b) the sin(x)
is decimated by factor of 2. The shape
of the signal is still the same but it is only represented with half of the
points used in a). This is equivalent
to reducing the sample rate by a factor of 2.

Figure 6.4 a) Signal sin(x) (b) Signal sin(x)
decimated by 2
When
decimation is applied, two factors must be considered: the scaling and the
decimation factors. If decimation is
applied, the change in the frequency scale must be considered when the final
result is obtained. On the other hand,
if the decimation factor is too high the degradation of the result is
notorious. In practice the decimation
factor must be one that simplifies the calculations minimizing the amount of
data and not degrade the final result.
When the decimation factor N increases, the AMI result becomes unclear
and the peak disappears. Figures from
6.5 to 6.9 present the results obtained using AMI over the same audio file with
decimation factor between N=1 and N=20.
|
|
|
|
|
|
|
|
Figure
6.7 AMI using decimation
N=5 |
|
|
Figure
6.8 AMI using decimation
N=10 |
|
|
Figure
6.9 AMI using decimation
N=20 |
As
was explained in section 5.3, average mutual information (AMI) is a
generalization of the correlation function.
It measures the state predictability or the hysteresis of a system and
establishes a nonlinear relationship between measurements
drawn from a
distributed set of variables
={
} and measurement
drawn from another distributed set of variables
={
}. So AMI is the
amount learned by the measurement of
about the measurement
of
. The algorithm
evaluates the normalized histogram of variables
,
, the normalized histogram of variables
,
and the joint
normalized histogram of variables
,
,
.
The
time delay T must be large enough so that independent information about the
system is in each component of the vector.
However, T must not be so large that the components of the vectors x(t)
are independent with respect to each other.
Conversely, if the time delay is too short, the vector components will
be independent and will not contain any new information. AMI is applied to the audio signal as Auto
Average Mutual Information, the signal is compared to itself delayed a time T.
AMI
is calculated by the function provided by TSTOOLS, which requires the following
syntax:
a=amutual(ts,maxtau,#partitions)
Where
ts is the vector holding time series data, in this case the WAV file.
maxtau
or maximal delay that should be much smaller than the length of the
signal. In all cases the value selected
was 2, equivalent to the half of the set data.
#partitions
is the number of bins used for histogram calculation, 64 and 128 were used
depending of the complexity of the signal.
The
output a is the vector of length maxtau-1, holding AMI.
The
AMI algorithm (TSTOOLS) uses equidistant histogram boxes, so results are poor
in a mathematical sense. However, it is
a fast algorithm based on ternary search trees to store only nonempty
boxes. More information about TSTOOLS
is available in Appendix and at the TSTOOLS website [43].
After
AMI is calculated the program re-scales the time axis according to the
decimation parameter used previously.
From the results obtained, only the data corresponding from 1 Hz to 4 Hz
(0.25 sec –1 sec in time) are considered to be in the range of response. The maximum value or peak in this range is
taken as the beat rate in seconds. This
range is equivalent to 60 bpm to 240 bpm; the common beat rates used in popular
music. In addition, the Matlab codes of
some of the programs used in this study are compiled in the Appendix.