Harmonic-Percussive Source Separation (HPSS)

Chris Tralie

Here are the notes from class

In [1]:
%load_ext autoreload
%autoreload 2
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipd
from scipy import signal
from spectrogramtools import *
import warnings
warnings.filterwarnings("ignore")

y, sr = librosa.load("violincastanets.mp3", sr=22050)
win_length = 2048
hop_length = 512

S = STFT(y, win_length, hop_length, useLibrosa=False)
SAbs = np.abs(S)

print(S.shape)
plt.gca().invert_yaxis()
librosa.display.specshow(librosa.amplitude_to_db(np.abs(S), ref=np.max), y_axis='log', x_axis='time')
ipd.Audio(y, rate=sr)
(1025, 284)
Out[1]:

If we do a median filter where we take the median around each element horizontally, we will promote harmonic components. If we do a median vertically, we will promote percussive components

In [2]:
## Slow Version
"""
SFilt = np.zeros_like(SAbs)
for i in range(S.shape[0]):
    for j in range(S.shape[1]):
        # Take out horizontal slice that starts to the left and ends to the right of [i, j]
        s = SAbs[i, j-w:j+w+1]
        SFilt[i, j] = np.median(s)
"""

# Fast version using scipy's signal library
winh = 20
SH = signal.medfilt(SAbs, [1, 2*winh+1])
winv = 40
SP = signal.medfilt(SAbs, [2*winv+1, 1])

When we do this, we certainly see the result. The question is how can we hear the changes?

In [3]:
plt.figure(figsize=(18, 6))
plt.subplot(131)
librosa.display.specshow(librosa.amplitude_to_db(np.abs(SAbs), ref=np.max), y_axis='log', x_axis='time')
plt.subplot(132)
librosa.display.specshow(librosa.amplitude_to_db(np.abs(SH), ref=np.max), y_axis='log', x_axis='time')
plt.title("Harmonic")
plt.gca().invert_yaxis()
plt.subplot(133)
librosa.display.specshow(librosa.amplitude_to_db(np.abs(SP), ref=np.max), y_axis='log', x_axis='time')
plt.title("Percussive")
plt.gca().invert_yaxis()

Let's first simply do Griffin-Lim on the magnitude harmonic and percussive spectrograms to reconstruct a phase

In [4]:
y_harm = griffinLimInverse(SH, win_length, hop_length)
ipd.Audio(y_harm, rate=sr)
Iteration 1 of 10
Iteration 2 of 10
Iteration 3 of 10
Iteration 4 of 10
Iteration 5 of 10
Iteration 6 of 10
Iteration 7 of 10
Iteration 8 of 10
Iteration 9 of 10
Iteration 10 of 10
Out[4]:
In [5]:
y_perc = griffinLimInverse(SP, win_length, hop_length)
ipd.Audio(y_perc, rate=sr)
Iteration 1 of 10
Iteration 2 of 10
Iteration 3 of 10
Iteration 4 of 10
Iteration 5 of 10
Iteration 6 of 10
Iteration 7 of 10
Iteration 8 of 10
Iteration 9 of 10
Iteration 10 of 10
Out[5]:

It seems we've done a nice job at separating out the components, but we've also introduced a lot of phase distortion

Soft Mask (Wiener Filter)

We can do better if we use the original phases but change the amplitude. The phases still won't be perfect since they correspond to both instruments, but they will be better than what Griffin-Lim gives us. Let $S_H$ and $S_P$ be the magnitude spectrograms for the harmonic and percussive component, respectively, and let $S$ be the original complex STFT for the piece. Then we can obtain the complex STFT for the harmonic component as the element-wise amplitude scaling

$S_1 = \frac{S_H}{S_H + S_P} S$

and the percussive component as $S - S_1$, or

$ S_2 = \frac{S_P}{S_H + S_P} S$

Now we can do a straight inverse STFT, and we hear better results

In [10]:
S1 = S*(SH/(SH + SP))
y_filt = iSTFT(S1, win_length, hop_length)
ipd.Audio(y_filt, rate=sr)
Out[10]:
In [11]:
S2 = S*(SP/(SH + SP))
y_filt = iSTFT(S2, win_length, hop_length)
ipd.Audio(y_filt, rate=sr)
Out[11]:
In [ ]: