Audio-based Musical Version Identification

Publicly Available Datasets

Below are a number of publicly available datasets for audio version identification:

The Covers80 Dataset	A dataset with low quality audio consisting of 160 songs which are split into two disjoint subsets A and B, each with exactly one version of a pair of songs, for a total of 80 pairs. Mostly '80s and early '90s pop music. This was one of the first publicly available datasets used by researchers
da-tacos	A dataset with pre-extracted features and metadata for 15,000 songs for a "benchmarking subset" and 10,000 songs for a "cover analysis subset."
CoversBR	A database with pre-extracted features (similar to da-tacos) for 102,298 songs, distributed into 26,366 groups of covers, of mostly Brazilian music.
Covers1000	A dataset of pre-extracted features of 395 groups of songs, along with a live demo of some alignment algorithms.
Kara1k Karaoke Songs Dataset	A dataset with features for 2000 songs: 1000 originals and 1000 corresponding karaoke versions. Also a great dataset for singing voice analysis.
The Second Hand Songs Dataset	Another dataset based off of annotations from secondhandsongs.com, which is a subset of the Million Songs Dataset consiting of about 20,000 tracks with EchoNest features.
The Youtube Covers Dataset	A collection chroma, CRP, and CENS features for 350 songs of various genres.
https://secondhandsongs.com/	A community project of annotations of cover songs.