4 Computer Models To Emulate Rhythm Perception  2

*  4.1 Tempo and Beat Analysis of Acoustic Musical signals  2

*  4.2 Real-time Beat Tracking for Drumless Audio Signals  6

 

4       Computer Models To Emulate Rhythm Perception

As relevant material of study two papers were selected for review in this chapter. Both describe beat detectors in real-time; they attempt to emulate human rhythm perception and were designed for high level detection as from music sampled from compact discs.

4.1    Tempo and Beat Analysis of Acoustic Musical signals

Scheirer [31] presented a computational algorithm capable of producing behavior similar to the performance of human listeners in the detection of beat tracking or pulse in a variety of musical situations.  This model has certain similarities to existing theories of sound perception that make it attractive as a psychoacoustic model of tempo perception.  In this study, beat is considered as the fundamental perceptual attribute of rhythm and the sequence of equally spaced phenomenal impulses that define a tempo for the music.  The grouping and strong/weak relationships that define rhythm and meter were not considered.

His method presents certain kinds of signal manipulations and simplifications without affecting the perceived tempo and beat of a musical signal.  In Figure 4.1, consider the signal flow network where an amplitude-modulated noise is constructed by vocoding a white noise signal with the sub-band envelopes of a musical signal.  This is accomplished by performing a sub-band analysis of the music and also a white-noise signal is modulated with the amplitude envelope of the corresponding band of the musical filterbank output.  The resulting noise signals are summed together to form an output signal.

Figure 4.1 Psychoacoustic simplification of rhythm perception. [31]

The psychoacoustic simplification lies in that the only information preserved is the amplitude envelopes of the filterbank, because only this information is necessary to extract pulse and meter from a musical signal.  This suggests that musical notes are not necessary components for rhythm perception.  This is a vast reduction of input data size from the original signal.  Certain other kinds of simplifications are not possible.  Thus, it seems that separating the signal into sub-bands and maintaining the sub-band envelopes separately is necessary to do accurate rhythmic processing.  Neither psychoacoustic experiments to examine the exact properties of filterbank or envelope manipulations were done to verify how much rhythm perception is not disturbed.  Results suggested that a rhythmic processing algorithm should treat frequency bands separately, combining results at the end, rather than attempting to perform beat tracking on the sum of filterbank outputs.

Figure 4.2 shows an overall view of Scheirer’s tempo analysis algorithm as a signal flow network. 

Figure 4.2 Schematic view of the processing algorithm. [31]

As the signal comes in, a filterbank is used to divide it in into six bands.  For each of these sub-bands, the amplitude envelope is calculated and the derivative taken.  Each of the envelope derivatives is passed on to another filterbank of tuned resonators.  In each resonator filterbank, one of the resonators will phase-lock.  This is the one for which the resonant frequency matches the rate of periodic modulation of the envelope derivative.  The outputs of the resonators are examined to see which ones are exhibiting phase-locked behavior, and this information is tabulated for each of the bandpass channels.  These tabulations are summed across the frequency filterbank to arrive at the frequency (tempo) estimate for the signal, and reference back to the peak phase points in the phase-locked resonators to determine the phase of the signal.

The filterbank implementation in the algorithm has six bands; each band has sharp cutoffs and covers roughly a one-octave range. The lowest band is a low-pass filter with cutoff at 200 Hz; the next four bands are band-pass, with cutoffs at 200 and 400 Hz, 400 and 800 Hz, 800 and 1600 Hz, and 1600 and 3200 Hz.  The highest band is high pass, with cutoff frequency at 3200 Hz.  Each filter was implemented using a sixth-order elliptic filter, with 3 dB of ripple in the pass band and 40 dB of rejection in the stop band.  Figure 4.3 shows the magnitude responses of these filters.

Figure 4.3 Magnitude response of the frequency filterbank used in the system [31]

The envelope is extracted from each band of the filtered signal through a rectify-and-smooth method.  After this the first-order difference function is calculated and half-wave rectified; this rectified difference signal will be examined for periodic modulation.

Figure 4.4 shows the envelope extraction process for one frequency band in each two signals.

Figure 4.4 Envelope extraction process. [31]

The top panels show the audio waveforms, 2 Hz click track (left) and a polyphonic music example (right).  The middle panels show the envelopes, and the bottom, the half-wave rectified difference of envelopes.  The lowest filterbank is shown for the click track, the second highest for the music.

Comb filters are often used in reverberators and other sorts of audio signal processing.  They have properties that make them suitable for acting as resonators in the phase-locking pulse extraction process.  The beat tracking algorithm uses a network of resonators to phase-lock with the beat of the signal and determine the frequency of the pulse.  Consequently, the comb filter with delay T will respond more strongly to a signal with period T that any other, since the response peaks in the filter line up with the frequency distribution of energy in the signal.

Thus after the envelope has been extracted and processed for each channel, a filterbank of comb filter resonators is implemented in which the delays vary by channel and cover the range of possible pulse frequencies to track.  The output of these resonator filterbanks is summed across frequency subbands.  By examining the energy output from each resonance channel of the summed resonator filterbanks, the strongest periodic component of the signal may be determined.  The frequency of the resonator with the maximum energy output is selected as the tempo of the signal.

Figure 4.5 shows the summed filterbank output for a 2 Hz pulse train and for a polyphonic music example (bottom).  The horizontal axis are labeled with "metronome marking” in beats per minute, that is, 120 MM=2Hz. This is a direct mapping of the delay of the corresponding comb filter.  The polyphonic music shows more overall energy, but the tempo is still seen clearly as peaks in the curve.

Figure 4.5 Tempo estimation [31]

The phase is determinate once its tempo is known by examining the output of the resonators directly, or even better, by examining the internal state of the delays of these filters.  The vector w of delays can be interpreted at a particular point in time as the "predicted output" of that resonator.  That is, the w vector contains the next n samples of envelope output that the filter would generate in response to zero input, where n is the period of the filter.  The sum of the delay vectors over the all frequency channels for those resonators corresponding to the tempo determined in the previous step is examined. 

The peak of this prediction vector is taken as the estimate of when the next beat will arrive in the input. The ratio w=2p(tn-t)/T, where tn is the time of the next predicted beat, t the current time, and T the period of the resonator, is the phase w of the tempo being tracked. The phase and period may be used to predict beat times as far into the future as desired.  In Figure 4.6 the phase estimates, after tracking 5 s of a 2Hz click track (top) and polyphonic music example (bottom), are shown.

Figure 4.6 Phase estimation. [31]

The x-axis in each case covers the next full period of the resonator tracking the tempo and the peak of the curve shows where the next beat is predicted to occur.  The implementation of the model performs the phase analysis every 25 ms and integrates evidence between frames in order to predict beats.

The performance of the algorithm was evaluated in both qualitative and quantitative manners.  For the qualitative performance, 60 ecological music excerpts were tested with the implemented algorithm using a short application, that reads a sound sample off of disk, causally beat-tracks it, and writes a new sound file with clicks (short noise bursts) added to the signal where beats are predicted to occur.  A selection of these sound files is available on: http://sound.media.mit.edu/people/eds/beat/results.html

Forty-one of 60 samples (68%) were qualitatively classified as being tracked accurately, and another 11 (18%) as being tracked somewhat accurately.  Based on these results the algorithm seems quite successful at tracking the musical beats.

On the other hand, a short quantitative validation experiment was conducted to test whether the beat-tracking algorithm performed generally like a human listener.  Five adult listeners, experienced musicians with normal hearing, all graduate students and staff members at the MIT Media Laboratory, participated in the experiment.  Subjects listened to seven musical examples; drawn from different musical genres, through headphones.  They indicated their understanding of the beat in the music by tapping along with the music on a computer keyboard.  All seven trials were run in the same sequence for each listener, in a single block.  The experiment was not counter-balanced based on an assumption that there is little training effect in this task.  The entire experiment took approximately 5 min per subject.  Results indicate that the algorithm was as regular as a human listener for five of the seven trials.

4.2    Real-time Beat Tracking for Drumless Audio Signals

Goto and Muraoka [12] presented a real-time beat tracking system that recognizes a hierarchical beat structure in musical audio signals without drum-sounds.  The system detects a beat structure of three rhythmic levels: the quarter note level, the half note level, and the measure level in music sampled from popular compact discs (see Figure 4.7).  They proposed a method of detecting chord changes to make musical decisions about the audio signals using heuristic musical knowledge.  The purpose of this study was to build a beat-tracking system useful in applications as music-synchronized CG animation, video/audio editing, and human-computer improvisation in live ensemble.

Figure 4.7 Beat-Tracking Problem [12]

They defined beat times as the temporal positions of almost regularly spaced beats corresponding to quarter notes and the sequence of beat times is called the quarter note level. Then they find the beginnings of half notes and measures.  The sequence of half note times is obtained by whether a beat is strong or weak.

The beat-tracking system for musical audio signals without drum-sounds provides a real time output called beat information (BI) that consists of the beat time, its beat types and the current tempo.  Figure 4.8 shows the system.

Figure 4.8 Overview of the Goto and Muraoka’s beat-tracking system [12]

The system first digitizes an input audio signal in the A/D conversion stage.  Then in the frequency analysis stage, multiple onset-time finders detect onset times in different ranges of the frequency spectrum, and those results are transformed into vectorial representation (called onset-time vectors) by onset-time vectorizers.  In the beat prediction stage, the system manages multiple agents that, according to different strategies, make parallel hypotheses based on those onset-time vectors.  Each agent first calculates the inter-beat interval and predicts the next beat time.  By communicating with a chord change checker, it then determines the beat types and evaluates the reliability of its own hypothesis.  A hypotheses manager gathers all hypotheses and then determines the final output on the basis of the most reliable one.  Finally, in the beat information (BI) transmission stage, the system transmits BI to application programs via a computer network.  The method detects chord changes by analyzing the frequency spectrum sliced at provisional beat times.  The results show that the beat detection rates obtained with real-world audio signals were more than 87.5% and that the method of detecting chord changes and basic music decisions on chord changes were effective enough to contribute to determining the hierarchical beat structure comprising the three rhythmic levels.  They also developed an application that displays real-time computer graphics dancers whose motions change in time to musical beats (Figure 4.9).

Figure 4.9 Goto and Muraoka’s virtual dancers synchronized with musical beats [12]

This application shows that the system is useful in multimedia applications in which human-like hearing ability is desirable.  They plan to upgrade the system by generalizing it to other musical genres and enabling it to follow tempo changes.  Also they are looking forward to using other higher level musical structure and will include applications to various multimedia systems for which beat tracking is useful, such as systems for video/audio editing, controlling stage lighting, and synchronizing various computer graphics with music.