Nonnegative Matrix Factorization¶

(Special Case of Gradient Descent)¶

Euclidean Version¶

%load_ext autoreload
%autoreload 2
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
import pandas as pd
from scipy import signal
from spectrogramtools import *
from collections import OrderedDict
import warnings
warnings.filterwarnings("ignore")

def make_html_audio(ys, sr, width=100):
    clips = []
    for y in ys:
        audio = ipd.Audio(y, rate=sr)
        audio_html = audio._repr_html_().replace('\n', '').strip()
        audio_html = audio_html.replace('<audio ', '<audio style="width: {}px; "'.format(width))
        clips.append(audio_html)
    return clips

y, sr = librosa.load("doves.wav", sr=22050)
win_length = 2048*4
hop_length = 512

S = STFT(y, win_length, hop_length)
V = np.abs(S)
plt.title("V")
librosa.display.specshow(librosa.amplitude_to_db(V, ref=np.max), x_axis='time')
ipd.Audio(y, rate=sr)

Euclidean Version¶

We seek a $W$ and an $H$ so that we minimize

$\sum_{i, j} (WH[i, j] - V[i, j])^2 $¶

First, randomly initialize $W$ and $H$ with nonnegative numbers

If $V$ is a $M x N$ matrix, and we choose a number $K$ which is an estimate of how many instruments we have, what we can say is that

$W$ is a $M x K$ matrix
$H$ is a $K x N$ matrix

$W^T$ is called the "transpose" of $W$; it's simply switching the rows for the columns. It is a $K x M$ matrix

$H^T$ would be a $N x K$

A = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print("A")
print(A)
print("$A^T$")
print(A.T)

A
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
$A^T$
[[1 3 5 7]
 [2 4 6 8]]

Update Rules:

$ H[i, j] = H[i, j] \frac{ (W^TV)[i, j] }{(W^TWH)[i, j]} $¶

$ W[i, j] = W[i, j] \frac{ (VH^T)[i, j] }{(WHH^T)[i, j]} $¶

(Note fixed point)

M = S.shape[0]
N = S.shape[1]
K = 3
# Randomly initialize my W and my H matrices
W = np.random.rand(M, K)
H = np.random.rand(K, N)
n_iters = 100

## TODO: Apply multiplicative update rules n_iters times in a loop
for it in range(n_iters):
    H = H * (W.T.dot(V))/(W.T.dot(W.dot(H)))
    W = W * (V.dot(H.T))/(W.dot(H.dot(H.T)))

Below we look at two things:

A Griffin-Lim inversion of each column of W
The result of zeroing out all rows except for one, then using this to create a softmax filter to extract a filtered version of the spectrogram corresponding to each column of W

ws = []
ys = []
WH = W.dot(H)
K = H.shape[0]
for i in range(K):
    Hi = np.zeros_like(H)
    Hi[i, :] = H[i, :]
    WHi = np.dot(W, Hi)
    Wi = W[:, i]
    Wi = np.reshape(Wi, (Wi.size, 1))
    Wi = Wi*np.ones((1, int(win_length/hop_length)))
    wi = griffinLimInverse(Wi, win_length, hop_length)
    ws.append(wi)
    Si = S*WHi/WH
    yi = iSTFT(Si, win_length, hop_length)
    ys.append(yi)
ys = make_html_audio(ys, sr, width=200)
ws = make_html_audio(ws, sr, width=100)
pd.set_option('display.max_colwidth', None) 
df = pd.DataFrame(OrderedDict([("Components", ws), ("Filtered", ys)]))
ipd.HTML(df.to_html(escape=False, float_format='%.2f'))

We also listen a Griffin-Lim inverse of $WH$ to see how well it approximated the original spectrogram. We shouldn't expect this to be perfect, since we're only using 3 components

y = griffinLimInverse(WH, win_length, hop_length)
ipd.Audio(y, rate=sr)

	Components	Filtered
0
1
2