4 Computer
Models To Emulate Rhythm Perception
4.1 Tempo
and Beat Analysis of Acoustic Musical signals
4.2 Real-time
Beat Tracking for Drumless Audio Signals
As relevant material of
study two papers were selected for review in this chapter. Both describe beat
detectors in real-time; they attempt to emulate human rhythm perception and
were designed for high level detection as from music sampled from compact
discs.
Scheirer [31] presented a
computational algorithm capable of producing behavior similar to the
performance of human listeners in the detection of beat tracking or pulse in a
variety of musical situations. This
model has certain similarities to existing theories of sound perception that make
it attractive as a psychoacoustic model of tempo perception. In this study, beat is considered as the
fundamental perceptual attribute of rhythm and the sequence of equally spaced
phenomenal impulses that define a tempo for the music. The grouping and strong/weak relationships
that define rhythm and meter were not considered.
His method presents certain
kinds of signal manipulations and simplifications without affecting the
perceived tempo and beat of a musical signal.
In Figure 4.1, consider the signal flow network where an
amplitude-modulated noise is constructed by vocoding a white noise signal with
the sub-band envelopes of a musical signal.
This is accomplished by performing a sub-band analysis of the music and
also a white-noise signal is modulated with the amplitude envelope of the
corresponding band of the musical filterbank output. The resulting noise signals are summed together to form an output
signal.
|
Figure
4.1
Psychoacoustic simplification of rhythm perception. [31] |
The psychoacoustic
simplification lies in that the only information preserved is the amplitude
envelopes of the filterbank, because only this information is necessary to
extract pulse and meter from a musical signal.
This suggests that musical notes are not necessary components for rhythm
perception. This is a vast reduction of
input data size from the original signal.
Certain other kinds of simplifications are not possible. Thus, it seems that separating the signal
into sub-bands and maintaining the sub-band envelopes separately is necessary
to do accurate rhythmic processing.
Neither psychoacoustic experiments to examine the exact properties of
filterbank or envelope manipulations were done to verify how much rhythm
perception is not disturbed. Results
suggested that a rhythmic processing algorithm should treat frequency bands
separately, combining results at the end, rather than attempting to perform
beat tracking on the sum of filterbank outputs.
Figure 4.2 shows an overall
view of Scheirer’s tempo analysis algorithm as a signal flow network.
|
|
As the signal comes in, a
filterbank is used to divide it in into six bands. For each of these sub-bands, the amplitude envelope is calculated
and the derivative taken. Each of the
envelope derivatives is passed on to another filterbank of tuned
resonators. In each resonator
filterbank, one of the resonators will phase-lock. This is the one for which the resonant frequency matches the rate
of periodic modulation of the envelope derivative. The outputs of the resonators are examined to see which ones are
exhibiting phase-locked behavior, and this information is tabulated for each of
the bandpass channels. These
tabulations are summed across the frequency filterbank to arrive at the
frequency (tempo) estimate for the signal, and reference back to the peak phase
points in the phase-locked resonators to determine the phase of the signal.
The filterbank
implementation in the algorithm has six bands; each band has sharp cutoffs and
covers roughly a one-octave range. The lowest band is a low-pass filter with
cutoff at 200 Hz; the next four bands are band-pass, with cutoffs at 200 and
400 Hz, 400 and 800 Hz, 800 and 1600 Hz, and 1600 and 3200 Hz. The highest band is high pass, with cutoff
frequency at 3200 Hz. Each filter was
implemented using a sixth-order elliptic filter, with 3 dB of ripple in the
pass band and 40 dB of rejection in the stop band. Figure 4.3 shows the magnitude responses of these filters.
|
Figure
4.3
Magnitude response of the frequency filterbank used in the system [31] |
The envelope is extracted
from each band of the filtered signal through a rectify-and-smooth method. After this the first-order difference
function is calculated and half-wave rectified; this rectified difference signal
will be examined for periodic modulation.
Figure 4.4 shows the
envelope extraction process for one frequency band in each two signals.
|
|
The top panels show the
audio waveforms, 2 Hz click track (left) and a polyphonic music example
(right). The middle panels show the
envelopes, and the bottom, the half-wave rectified difference of
envelopes. The lowest filterbank is
shown for the click track, the second highest for the music.
Comb filters are often used
in reverberators and other sorts of audio signal processing. They have properties that make them suitable
for acting as resonators in the phase-locking pulse extraction process. The beat tracking algorithm uses a network
of resonators to phase-lock with the beat of the signal and determine the
frequency of the pulse. Consequently,
the comb filter with delay T will respond more strongly to a signal with
period T that any other, since the response peaks in the filter line up
with the frequency distribution of energy in the signal.
Thus after the envelope has
been extracted and processed for each channel, a filterbank of comb filter
resonators is implemented in which the delays vary by channel and cover the
range of possible pulse frequencies to track.
The output of these resonator filterbanks is summed across frequency
subbands. By examining the energy
output from each resonance channel of the summed resonator filterbanks, the
strongest periodic component of the signal may be determined. The frequency of the resonator with the
maximum energy output is selected as the tempo of the signal.
Figure 4.5 shows the summed
filterbank output for a 2 Hz pulse train and for a polyphonic music example
(bottom). The horizontal axis are
labeled with "metronome marking” in beats per minute, that is, 120 MM=2Hz.
This is a direct mapping of the delay of the corresponding comb filter. The polyphonic music shows more overall
energy, but the tempo is still seen clearly as peaks in the curve.
|
Figure 4.5 Tempo estimation [31] |
The phase is determinate
once its tempo is known by examining the output of the resonators directly, or
even better, by examining the internal state of the delays of these
filters. The vector w of delays
can be interpreted at a particular point in time as the "predicted
output" of that resonator. That
is, the w vector contains the next n samples of envelope output that the
filter would generate in response to zero input, where n is the period of the
filter. The sum of the delay vectors
over the all frequency channels for those resonators corresponding to the tempo
determined in the previous step is examined.
The peak of this prediction
vector is taken as the estimate of when the next beat will arrive in the input.
The ratio w=2p(tn-t)/T, where tn
is the time of the next predicted beat, t the current time, and T the period of
the resonator, is the phase w of the tempo being
tracked. The phase and period may be used to predict beat times as far into the
future as desired. In Figure 4.6 the phase estimates, after
tracking 5 s of a 2Hz click track (top) and polyphonic music example (bottom),
are shown.
|
Figure 4.6 Phase estimation. [31] |
The x-axis in each case
covers the next full period of the resonator tracking the tempo and the peak of
the curve shows where the next beat is predicted to occur. The implementation of the model performs the
phase analysis every 25 ms and integrates evidence between frames in order to
predict beats.
The performance of the
algorithm was evaluated in both qualitative and quantitative manners. For the qualitative performance, 60
ecological music excerpts were tested with the implemented algorithm using a
short application, that reads a sound sample off of disk, causally beat-tracks
it, and writes a new sound file with clicks (short noise bursts) added to the
signal where beats are predicted to occur.
A selection of these sound files is available on: http://sound.media.mit.edu/people/eds/beat/results.html
Forty-one of 60 samples (68%) were
qualitatively classified as being tracked accurately, and another 11 (18%) as
being tracked somewhat accurately.
Based on these results the algorithm seems quite successful at tracking
the musical beats.
On the other hand, a short
quantitative validation experiment was conducted to test whether the
beat-tracking algorithm performed generally like a human listener. Five adult listeners, experienced musicians
with normal hearing, all graduate students and staff members at the MIT Media
Laboratory, participated in the experiment.
Subjects listened to seven musical examples; drawn from different
musical genres, through headphones.
They indicated their understanding of the beat in the music by tapping
along with the music on a computer keyboard.
All seven trials were run in the
same sequence for each listener, in a single block. The experiment was not counter-balanced based on an assumption
that there is little training effect in this task. The entire experiment took approximately 5 min per subject. Results indicate that the algorithm was as
regular as a human listener for five of the seven trials.
Goto and Muraoka [12]
presented a real-time beat tracking system that recognizes a hierarchical beat
structure in musical audio signals without drum-sounds. The system detects a beat structure of three
rhythmic levels: the quarter note level, the half note level, and the measure
level in music sampled from popular compact discs (see Figure 4.7). They proposed a method of detecting chord
changes to make musical decisions about the audio signals using heuristic
musical knowledge. The purpose of this
study was to build a beat-tracking system useful in applications as
music-synchronized CG animation, video/audio editing, and human-computer
improvisation in live ensemble.
|
Figure 4.7
Beat-Tracking Problem [12] |
They defined beat times as
the temporal positions of almost regularly spaced beats corresponding to
quarter notes and the sequence of beat times is called the quarter note level.
Then they find the beginnings of half notes and measures. The sequence of half note times is obtained
by whether a beat is strong or weak.
The beat-tracking system for
musical audio signals without drum-sounds provides a real time output called
beat information (BI) that consists of the beat time, its beat types and the
current tempo. Figure 4.8 shows the
system.
|
Figure 4.8 Overview of the Goto and
Muraoka’s beat-tracking system [12] |
The system first digitizes
an input audio signal in the A/D conversion stage. Then in the frequency analysis stage, multiple onset-time finders
detect onset times in different ranges of the frequency spectrum, and those
results are transformed into vectorial representation (called onset-time
vectors) by onset-time vectorizers. In
the beat prediction stage, the system manages multiple agents that, according
to different strategies, make parallel hypotheses based on those onset-time vectors. Each agent first calculates the inter-beat
interval and predicts the next beat time.
By communicating with a chord change checker, it then determines the
beat types and evaluates the reliability of its own hypothesis. A hypotheses manager gathers all hypotheses
and then determines the final output on the basis of the most reliable
one. Finally, in the beat information
(BI) transmission stage, the system transmits BI to application programs via a
computer network. The method detects
chord changes by analyzing the frequency spectrum sliced at provisional beat
times. The results show that the beat
detection rates obtained with real-world audio signals were more than 87.5% and
that the method of detecting chord changes and basic music decisions on chord
changes were effective enough to contribute to determining the hierarchical
beat structure comprising the three rhythmic levels. They also developed an application that displays real-time
computer graphics dancers whose motions change in time to musical beats (Figure
4.9).
|
Figure 4.9
Goto and Muraoka’s virtual dancers synchronized with musical beats [12] |
This application shows that the system is
useful in multimedia applications in which human-like hearing ability is
desirable. They plan to upgrade the
system by generalizing it to other musical genres and enabling it to follow
tempo changes. Also they are looking
forward to using other higher level musical structure and will include applications
to various multimedia systems for which beat tracking is useful, such as
systems for video/audio editing, controlling stage lighting, and synchronizing
various computer graphics with music.