Beats Drum Machine

How It Works

Every Beats song starts out life as a YAML file, and if lucky, undergoes a metamorphosis into a beautiful *.wav file. Let’s look at what happens during this process.

Getting Started

When you install the Beats gem, it adds bin\beats to your path. This is the entry point. It collects the command line arguments, and then calls Beats.run() (located in lib/beats.rb), which is the real driver of the program. When Beats.run() returns, bin/beats displays an exit message, or alternately lists any errors that occurred.

Beats.run() manages the transformation of the YAML file into a *.wav file, by calling the appropriate code to parse the YAML file, normalize the song into a standard format, convert the song into an equivalent song which will be generated faster, generate the song’s audio data, and save it to disk. For more info on each of these steps, read on below.

Song Parsing

The SongParser object parses a raw YAML song file and converts it into domain objects. It contains a single public method, parse(). The input is a raw YAML string, and the output is a Song object and a Kit object.

A Song object is like the “sheet music” for the song. It is a container for Pattern objects (which are in turn containers for Track objects). It also stores the song flow (i.e. the order that patterns should be played). The flow is internally represented as an array of symbols. For example, when this song is parsed:

Song:
  Flow:
    - Verse:   x2
    - Chorus:  x4
    - Bridge:  x1
    - Chorus:  x1

the resulting Song object will have this flow:

[:verse, :verse, :chorus, :chorus, :chorus, :chorus, :bridge, :chorus]

A Kit object provides access to the raw sample data for each sound used in the song.

A nice thing about YAML is that it’s easy for humans to read and write, and support for parsing it is built into Ruby. For example, reading a YAML file from disk and converting it into a Ruby hash can be accomplished with one line of code:

hash = YAML.load_file("my_yaml_file.txt")

Despite this, SongParser is still 200+ lines long, due to the logic for validating the parsed YAML file and converting it into the Song and Kit domain objects.

Song Normalization

After the YAML file is parsed and converted into a Song and Kit, the Song object is normalized to a standard format. This is done to allow the audio generation logic to be simpler.

As far as the audio engine knows there is only one type of song in the universe: one in which all patterns in the flow are played, all tracks in each pattern are mixed together, and the result is written to a single *.wav file. If that’s the case though, then how do we deal with the -p option, which only writes a single pattern to the *.wav file? Or the -s option, which saves each track to a separate *.wav file?

The answer is that before audio generation happens, songs are converted to a normalized format. For example, when the -p option is used, the Song returned from SongParser is modified so that the flow only consists of a single performance of the specified pattern. All other patterns are removed from the flow.

Original Song:

Song:
  Flow:
    - Verse:  x2
    - Chorus: x4
    - Verse:  x2
    - Chorus: x4

Verse:
  - bass.wav:   X...X...X...X...
  - snare.wav:  ....X.......X...

Chorus:
  - bass.wav:   X.............X.
  - snare.wav:  ....X.........X.

After Normalization For -p Verse Option:

Song:
  Flow:
    - Verse: x1

Verse:
  - bass.wav:   X...X...X...X...
  - snare.wav:  ....X.......X...

(This transformation occurs in the Song object, not the actual input YAML. The “after” example is what the equivalent YAML would look like).

When the -s option is used, the Song is split into multiple Songs that contain a single track. If the Song has a total of 5 tracks spread out over a few patterns, it will be split into 5 different Song objects that each contain a single Track.

The benefit of song normalization is that it moves complexity out of the audio domain and into the Ruby domain, where it is easier to deal with. For example, the output of the audio engine is arrays of integers thousands or even millions of elements long. If a test fails, it can be hard to tell why one long list of integers doesn’t match the expected long list of integers. Song normalization reduces the number of tests of this type that need to be written. Normalization also allows the audio engine to be optimized more easily, by making the implementation simpler. (The audio engine is where almost all of the run time is located).

In contrast, normalizing Song objects is generally straightforward, easy to understand, and easy to test. For example, it’s usually simple to build a test that verfies hash A is transformed into hash B.

Song Optimization

After the initial Song object is normalized, the resulting Song object(s) are further transformed into equivalent Song objects whose audio data can be generated more quickly by the audio engine.

Optimization consists of two steps:

Breaking patterns into smaller pieces
Replacing two different patterns that are equivalent with a single canonical pattern

Performance tests show that generating audio for 4 patterns 4 steps long is faster that generating a single 16-step pattern. Generally, dealing with shorter arrays of sample data appears to be faster than really long arrays.

Replacing two patterns that have the same tracks with a single canoncial pattern takes advantage of the fact that the audio engine will cache the sample data for previously generated patterns. If you have two Pattern objects that have the same track data, the audio engine will not realize they are the same and will generate audio data from scratch more often than is necessary.

Humans are probably not that likely to define identical patterns twice in a song. However, breaking patterns into smaller pieces can often allow pattern consolidation to “detect” duplicated rhythms inside (or across) patterns. So, these two optimizations actually work in concert. The nice thing is that this optimzation algorithm is simple, but effective.

The SongOptimizer class is used to perform optimization. It contains a single public method, optimize(), which takes a Song object and returns a Song object that is optimized.

Audio Generation

Before reading this section, it might be helpful to read up on the basics on digital audio, if you aren’t already familiar.

All right! We’ve now parsed our YAML file into Song and Kit domain objects, converted the Song into a normalized format, and optimized it. Now we’re ready to actually generate some audio data.

At a high level, generating the song’s audio data consists of iterating through the flow, generating the sample data for each pattern (or pulling it from cache), and then writing it to disk. The two main classes involved in this are AudioEngine (the main driver) and AudioUtils (general utility methods for working with audio data).

Audio generation begins at the track level. First, an array is created with enough samples for each step in the track at the specified tempo. Each sample is initialized to 0. Then, the sample data for the track’s sound is “painted” onto the array at the appropriate places. The method that does all this is AudioEngine.generate_track_sample_data().

Generating the sample data for a pattern consists of generating the sample data for each of it’s tracks, and then mixing them into a single sample array by using AudioUtils.composite().

Once each pattern is generated it is written to disk using the WaveFile gem.

Handling Pattern Overflow

One complication that arises is that sounds triggered in a track can extend past the track’s end (and therefore also its parent pattern’s end). For example, imagine a long cymbal crash occurring in the last step of a track – it can easily extend into the next pattern. If this is not accounted for, then sounds will suddenly cut off once the track or the parent pattern ends. This can especially be a problem after song optimization, since it chops patterns into smaller pieces. During playback sounds will continually cut off at seemingly random times.

To help deal with overflow, AudioEngine.generate_track_sample_data() actually returns two sample arrays: one containing the samples that occur during the normal track playback, and another containing samples that overflow into the next pattern.

When pattern audio is generated, overflow needs to be accounted for at the beginning and end of the pattern. AudioEngine.generate_pattern_sample_data() requires a hash of the overflow from each track in the preceding pattern in the flow, so that it can be inserted at the beginning of each track in the current pattern. This prevents sounds from cutting off each time a new pattern starts. The pattern must also return a hash of its outgoing overflow in addition to the composited primary sample data, so that the next pattern in the flow can use it.

More Details About Audio Caching

Patterns are often played more than once in a song. After a pattern is generated the first time, it is cached so it doesn’t have to be generated again.

There are actually two levels of pattern caching. The first level caches the result of compositing a pattern’s track together. The second level caches the results of composited sample data converted into native *.wav file format.

The reason for these two different caches has to do with overflow. The problem is that when caching composited sample data you have to store it in a format that will allow arbitrary incoming overflow to be applied at the beginning. Once sample data is converted into *.wav file format, you can’t do this. Cached data in *.wav file format is actually tied to specific incoming overflow. So if a pattern occurs in a song 5 times with different incoming overflow each time, there will be a single copy in the 1st cache (with no overflow applied), and 5 copies in the 2nd cache (each with different overflow applied).

Track sample data is not cached, since performance tests show this only gives a very small performance improvement. Generating an individual track is relatively fast; it is compositing tracks together which is slow. This makes sense because painting sample data onto an array can be done with a single Ruby statement (and thus the bulk of the work and iteration is done at the C level inside the Ruby VM), whereas compositing sample data must be done at the Ruby level one sample at a time.