How It Works

Every Beats song starts out life as a YAML file, and if lucky, undergoes a metamorphosis into a beautiful Wave file. Let’s look at what happens during this process.

Getting Started

When you install Beats, it adds bin\beats to your path. This is the entry point. It collects the command line arguments, and then calls Beats.run() (located in lib/beats.rb), which is the real driver of the program. When Beats.run() returns, bin/beats displays an exit message, or alternately lists any errors that occured.

Beats.run() sheperds the transformation of the YAML file into a Wave file, by calling the appropriate code to parse the YAML file, normalize the song into a standard format, convert the song into an equivalent song which will be generated faster, generate the song’s audio data, and save it to disk. For more info on each of these steps, read on below.

Song Parsing

The SongParser object parses a raw YAML song file and converts it into domain objects. It contains a single public method, parse(). The input is a raw YAML string, and the output is a Song object and a Kit object.

A Song object is a container for Pattern objects (which are in turn containers for Track objects). It also stores the song flow (i.e. the order that patterns should be played). The flow is internally represented as an array of symbols. For example, when this song is parsed:

Song:
  Flow:
    - Verse:   x2
    - Chorus:  x4
    - Bridge:  x1
    - Chorus:  x1

the resulting Song object will have this flow:

[:verse, :verse, :chorus, :chorus, :chorus, :chorus, :bridge, :chorus]

A Kit object provides access to the raw sample data for each sound used in the song.

The nice thing about YAML is that it’s easy for humans to read and write, and support for parsing it is built into Ruby. For example, reading a YAML file from disk and converting it into a Ruby hash can be accomplished with one line of code:

hash = YAML.load(File.read("my_yaml_file.txt"))

Despite this, SongParser is still 200 or so lines long. This code is responsible for validating the parsed YAML file, and converting the raw hash into the Song and Kit domain objects.

Song Normalization

After the YAML file is parsed and converted into a Song and Kit, the Song object is normalized to a standard format. This is done to allow the audio engine to be simpler.

As far as the audio engine knows there is only one type of song in the universe: one in which all patterns in the flow are played, and one in which all tracks are mixed together and written to a single Wave file. If that’s the case though, then how do we deal with the -p option, which only writes a single pattern to the Wave file? Or the -s option, which saves each track to a separate wave file? You guessed it: normalization.

For example, when the -p option is used, the Song returned from SongParser is modified so that the flow only contains a single performance of the specified pattern. All other patterns are removed from the flow, and in fact the song itself.

Original Song:

Song:
  Flow:
    - Verse: x2
    - Chorus: x4
    - Verse: x2
    - Chorus: x4

Verse:
  - bass.wav:   X...X...X...X...
  - snare.wav:  ....X.......X...

Chorus:
  - bass.wav:   X.............X.
  - snare.wav:  ....X.........X.

After Normalization For -p Verse Option:

Song:
  Flow:
    - Verse: x1

Verse:
  - bass.wav:   X...X...X...X...
  - snare.wav:  ....X.......X...
 

When the -s option is used, the Song is split into multiple Songs that contain a single track. If the Song has a total of 5 tracks spread out over a few patterns, it will be split into 5 different Song objects that each contain a single Track.

The benefit of song normalization is to move complexity out of the audio domain and into the Ruby domain, where it is easier to deal with. For example, the output of the audio engine is arrays of integers thousands or even millions of elements long. If a test fails, it can be hard to tell why one long list of integers doesn’t match the expected long list of integers. Song normalization reduces the number of tests of this type that need to be written. Normalization also allows the audio engine to be optimized more easily, by making the implementation simpler. Since the audio engine is where almost all of the run time is located, this is a win.

In contrast, normalizing Song objects is generally straightforward, easy to understand, and easy to test. For example, it’s usually simple to build a test that verfies hash A is transformed into hash B.

Song Optimization

After the Song object is normalized, the normalized Song(s) are further transformed into equivalent Song objects whose audio data can be generated more quickly by the audio engine.

As of version 2.1.0, optimization consists of two steps:

  1. Breaking patterns into smaller pieces
  2. Replacing two different patterns that are equivalent with a single canonical pattern

Performance tests show that (for example) generating audio for 4 patterns 4 steps long is faster that generating a single 16-step pattern. Generally, dealing with shorter arrays of sample data appears to be faster than really long arrays.

Replacing two patterns that have the same tracks with a single pattern allow for better caching by the audio engine. The audio engine will only ever generate the audio data for a pattern once, and will rely on a cached version each subsequent time it is played. If you have two patterns that are identical, the audio engine can end up generating audio data from scratch more often than is necessary.

Humans are probably not too likely to define identical patterns in a song. However, breaking patterns into smaller pieces can often allow pattern consolidation to “detect” duplicated rhythms inside (or across) patterns. So, these two optimizations actually work in concert. The nice thing about this is that caching algorithm is really dumb and simple to implement, but is effective.

The SongOptimizer class is used to perform optimization. It contains a single public method, optimize(), which takes a Song object and returns a Song object that is optimized.

Audio Generation

Before reading this section, it might be helpful to read up on the basics on digital audio, if you aren’t already familiar.

All right! We’ve now parsed our YAML file into Song and Kit domain objects, converted the Song into a canonical format, and optimized it. Now we’re ready to actually generate some audio data.

At a high level, generating the song consists of iterating through the flow, generating the sample data for each pattern (or pulling it from cache), and then writing it to disk. The two main classes involved in this are AudioEngine and AudioUtils. AudioEngine is the main driver for generating audio data for each pattern, and writing it to disk. AudioUtils, as the name suggests, contains some general utility methods for working with audio data.

Audio generation begins at the track level. First, an array is created with enough samples for each step in the track at the specified tempo. Each sample is initialized to 0. Then, the sample data for the track’s sound is “painted” onto the array at the appropriate places. The method that does all this is AudioEngine.generate_track_sample_data().

Generating the sample data for a pattern consists of generating the sample data for each of it’s tracks, and then mixing them into a single sample array. This is done using AudioUtils.composite(), which sums the corresponding samples from each array together. Each sample in the resulting array is then divided by a certain amount to prevent clipping.

Once each pattern is generated it can be written to disk. The BeatsWaveFile and WaveFile classes handle the details of creating the output wave file and writing lists of FixNum samples to the file in the correct format.

Handling Overflow

One complication that arises is that the last sound triggered in a track can extend past the track’s end (and therefore also its parent pattern’s end). If this is not accounted for, then sounds will suddenly cut off once the track or the parent pattern ends. This can especially be a problem after song optimization, since this introduces additional sub-patterns. During playback sounds will continually cut off at seemingly random times.

To facilitate dealing with overflow, AudioEngine.generate_track_sample_data() actually returns two sample arrays: one containing the samples that occur during the normal track playback, and another containing samples that overflow into the next pattern.

When pattern audio is generated, overflow needs to be accounted for at the beginning and end of the pattern. AudioEngine.generate_pattern_sample_data() requires a hash of the overflow from each track in the preceding pattern in the flow, so that it can be inserted at the beginning of each track in the current pattern. This prevents sounds from cutting off each time a new pattern starts. The pattern must also return a hash of its outgoing overflow in addition to the composited primary sample data, so that the next pattern in the flow can use it.

Performance Improvements Through Caching

Patterns are often played more than once in a song. After a pattern is generated the first time, it is cached so it doesn’t have to be generated again.

There are actually two levels of pattern caching. The first level caches the result of compositing a pattern’s track together. The second level caches the results of composited sample data converted into native Wave file format.

The reason for these two different caches has to do with overflow. The problem is that when caching composited sample data you have to store it in a format that will allow arbitrary incoming overflow to be applied at the beginning. Once sample data is converted into Wave file format, you can’t do this. Cached data in wave file format is actually tied to specific incoming overflow. So if a pattern is occurs in a song 5 times with different incoming overflow each time, there will be a single copy in the 1st cache (with no overflow applied), and 5 copies in the wave 2nd cache (each with different overflow applied).

Track sample data is not cached, since performance tests show this only gives a very small performance improvement. Generating an individual track is relatively fast; it is compositing tracks together which is slow. This makes sense because painting sample data onto an array can be done with a single Ruby statement (and thus the bulk of the work and iteration is done at the C level inside the Ruby VM), whereas compositing sample data must be done at the Ruby level one sample at a time.

© 2010-17 Joel Strait