Ursinus CS 472A: Digital Music Processing, Spring 2021

Assignment 6: Let It Bee (30 Points)

Due Thursday 5/13/2021

Click here to listen to musical statements!

Overview/Logistics
- Learning Objectives
- What To Submit
Programming Tasks

We saw in class that it is possible to use Nonnegative Matrix Factorization to decompose an audio clip into a set of K sonic source templates stored in an win_length x K matrix W, as well as a matrix of activations over time for each of these sources stored in a K x nwin matrix H so that the matrix multiplication WH approximates an absolute magnitude spectrogram V (Click here to review the code that does this). The main application we focused on was "unmixing," or separating an audio track into its different instrument components. This is also sometimes referred to as the "cocktail party problem," since we're trying to filter out one sound from the superposition of many, just like one might try to focus on the speech coming from the person in front of them in the midst of a cacophony of sound at a cocktail party.

In addition to audio unmixing and learning musical instrument templates from audio, the mathematics that were developed for NMF can also be used to create a novel instrument for musical expression. In addition to being given a spectrogram V, we are also the W of templates, which remains fixed, and our job is only to learn H. In this way, we can think of the problem not as one of unmixing, but of learning how to activate a set of templates we're given to best match a target V. This is referred to as "musaicing" in this paper, and you will be following that paper in the assignment. The musaicing technique in that paper is referred to as the "Let It Bee" technique, and it earned its name by showing how using V as the spectrogram for a clip of Let It Be from The Beatles, and inputting W as the spectrogram of a bunch of bees buzzing

V: Let It Be Spectrogram

W: Bees Buzzing Spectrogram

Learning H and inverting W*H: Let It Bee

In this assignment, you will implement the "Let It Bee" pipeline step by step and then use it to create musical statements. As Ben Cantil mentioned in his video for our class, you will be some of the first to compose music in this style.

Learning Objectives

Practice numpy arrays, methods, and for loops in the service of musical applications
Modify algorithms for nonnegative matrix factorization
Programmatically promote temporal continuity in audio reconstructions
Compose music in a cutting edge style

What To Submit

When you are finished, submit your python file NMF.py to canvas, as well as an audio file for your musical statement and all of the audio files that are needed to run your code used to create that statement. Finally, indicate a title for your musical statement, and name/pseudonym you'd like to use in the music gallery on our class web site, and indicate the names of any buddies you worked with.

Programming Tasks

Click here to download the starter code for this assignment. In all of the tasks below, you will be editing the create_musaic method in the NMF.py file.

Below are the imports you will need in jupyter

You will be editing the method create_musaic in NMF.py. The provided code simply performs a number of iterations of the KL-based nonnegative matrix factorization, updating the H matrix only and keeping W fixed, and then taking the inverse STFT of the complex version of W multiplied by the learned H. You can run it by using librosa's STFT method to get a complex short-time Fourier transform of both the bees source and the "Let It Be" target

The result is as follows.

This leaves a lot to be desired, so we will be improving it step by step in the assignment.

Interestingly, as explained in the paper, a lot of what we do will end up taking us further away from a min in the objective function as a fit to the target, but this is good from a "musaicing" standpoint, as we want the results to sound both like the target and the source. Also of interest is the fact that many of the steps that help get a better sound overlap with those that we used to enhance matches between different versions in assignment 5. So code-wise, this will largely be a review, but for a really neat and different application.

Step 0: Griffin Lim Phase Improvement (5 Points)

One problem with a simple nonnegative matrix factorization for musaicing is that each window was treated independently in the objective function. Furthermore, the activations are only learned for the magnitudes W, and we're reconstructing a sound with the complex STFT WComplex, which includes phases that were completely ignored when learning H. To improve phase continuity from one window to the next, we can perform several iterations of the Griffin Lim algorithm on the spectrogram S = WComplex*H before returning the audio, rather than just doing a straight inverse STFT.

Look back at what you did on assignment 3, and use similar code to perform 10 iterations of Griffin-Lim before doing the final inverse STFT. Here, we'll simply used librosa's stft and istft to save code, using the default window. Once you've done this, the result will improve slightly, as shown below

Step 1: Avoiding Repeated Activations (6 Points)

One of the issues with the above sound is we hear a "jitter" or "echo" that occurs when the same source window is activated multiple time instants in a row. To show an isolated example of this, here's what we get when we do nonnegative matrix factorization on "When Doves Cry" (as shown in class) and activate only the first component for 20 frames

The window by itself sounds like this

And the repeated activations of it from the code above sound like this

To stop this from happening, we can make sure that there aren't any chunks of similar values anywhere in a particular row of H during each iteration. To do this, we preserve all values that are the maximum of every value in a window of length r around them, and we "shrink" those values that aren't local maxes some factor. In particular:

\[ H[i, j] = \left\{ \begin{array}{cc} H[i, j] & H[i, j] == \max(H[i, j-r:j+r]) \\ \left(1 - \frac{l}{L} \right) H[i, j] & \text{otherwise} \end{array} \right\} \]

where l is the iteration number and L is the total number of iterations, and r is half of the length of a horizontal window in which to look for a max for every element of H. In other words, as time goes on, make the horizontal local maxes stand out more and more, and by the end, the surrounding elements should be 0. If this works properly, you should hear the following for r=3, invoked by the code below

And here's an example with r = 7

Hint

You might want to make use of the method scipy.ndimage.filters.maximum_filter, which will return a matrix where every pixel is replaced by its max in some horizontal window. This is the same thing we did in class to obtain a fast implementation of the Shazam technique.

Step 2: Restricting Simultaneous Activations (7 Points)

In addition to constraints that we put in the rows of H, we can also put constraints on the columns. This is because we want to limit the number of possible sound grains that are taken from the source at any point in time. If we take too many sounds at once, then they may mix together to form a new timbre that is different from the original timbre of the sources (perhaps too many bees together really do sound like a piano).

To follow conventions in the paper, let's say that we want at most p simultaneous activations at any point in time. Then this amounts to ensuring that each element is within the p largest elements in its column of H. If it is smaller than that, then we shrink it by a factor of (1-n/L), just as with the repeated activations. Add code that does this directly after the repeated activations code. You may find this is very similar code-wise to the binary csm step in the version identification assignment.

Here's an example where we allow 10 simultaneous activations

Here's an example where we allow only 3 simultaneous activations

Step 3: Diagonal Enhancement (7 Points)

One last observation we make is that the window lengths are quite short relative to the length of natural sounds that can be found in the sources. For example, at a sample rate of 22050 and for a window length of 2048, each window only captures about 100 milliseconds of audio. As we saw in the digital instruments assignment, the attack/sustain/decay/release can take longer than that to fully evolve a timbre, so we might like to encourage the algorithm to choose longer chunks from the source.

Since the sound templates in W happened to be obtained from the windowed spectrogram of source audio, adjacent columns of W store spectrogram magnitudes of adjacent windows from the source audio. This means that adjacent rows in H store activations of time-adjacent source elements. Therefore, we can encourage the algorithm to pick contiguous sequences of windows, and hence longer sounds from the source, by enhancing diagonal lines in the H matrix according to the following equation, where c refers to half of the length of the window in which diagonal elements are summed

\[ H[i, j] = \sum_{k = -c}^{c} H[i+k, j+k] \]

Add this final step to your code. The implementation may be quite similar to the diagonal enhancement step in the version identification assignment, except we don't divide by the window length, and H remains the same size (so we have to be careful not to go out of bounds at the boundaries of diagonals).

Below are a few examples

Musical Statement (5 Points)

Now that you've created the musaicing system, use it to create your own novel compositions! Come up with some sound sources and a target, and go to town. Check out Ben Cantil's video again if you need some inspiration. Be sure to tweak the parameters as necessary to get the best quality sounds. I can't wait to hear what you come up with!