Assignment 5b: Shazam (40 Points)

Chris Tralie

Due Monday 12/21/2020 at 12:00PM

Overview / Logistics

We have now reached the climax of the course! This assignment is a direct followup to the last assignment on basketball hashing. Now that we have a hash table implementation, we're going to use it for an even more exciting task of quickly identifying a song from a short (and possibly degraded) audio clip of that song. We will be implementing the so-called "Shazam algorithm," which you can read more about in this paper, though I will break down parts of the algorithm one by one in this assignment, and you will implement them yourself. In a nutshell, we will be building something called an audio fingerprint, which can be used to identify songs from short clips of audio.

Learning Objectives

  • Use makefiles in C++ to build a medium scale project with many files
  • Use multiple classes together in concert to accomplish a task
  • Use hash tables to organize and find data quickly
  • Manipulate 2D arrays in C++
  • Manipulate STL vectors in C++
  • Manage dynamic memory and object references in C++
  • Implement audio features for content-based music information retrieval

Starter Code

Click here to download the starter code for this assignment. You will be filling in code in the fingerprint.cpp file. You'll notice that it shares a lot of code with the previous assignment, and it has a makefile to manage building all of the little files. The very first think you should do before you type make is to copy in your working hashtable.cpp from last assignment. If this worked, your code should build, and you should be able to generate the example spectrogram in the next section. Please do not proceed until that is working!

What to submit

When you are finished, please submit fingerprint.cpp to Canvas. Please also answer the questions below
  1. Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
  2. Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  3. Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)

Background: Audio Spectrograms / The Shazam Challenge

As we discussed in class, it is possible to turn audio into an image called a spectrogram, in which the x-axis represents time and the y-axis represents "frequency content"; that is, higher up on y are pixels corresponding to higher pitches, and movement to the right indicates progression in time. A bright pixel indicates that there is a lot of energy at that time and frequency, and a darker pixel indicates that there is less energy. Below is an example audio clip and its corresponding spectrogram, from 0 to 4000hz sampled at 1 pixel every 46 milliseconds. Once you build the skeleton code, you can generate that image yourself with the following command line invocation:

Lots of interesting features are present in this image. For instance, it is possible to pick out beats, as well as harmonics of musical instruments and vocals. Our challenge in the shazam task will be to summarize these images for quick queries in a way that is robust to different transformations of audio, such as

  • Changes in loudness
  • Background noise
  • Low quality audio / distorted audio
To make things even more challenging, we want to do this with small audio clips, since we want people to be able to jump in the middle of a song, and we don't want them to wait too long to know what song it is. We also want it to run quickly so that it can be done real time.

In this assignment, you will slowly work through a series of steps to address all of these issues. As has been the case many times before, a hash table will be the central data structure that allows us to tackle this seemingly impossible task.

Programming Task 1: Fingerprint Anchors (10 Points)

The first insight of the Shazam technique is that bright spots in the spectrogram tell us more than dark spots. In particular, we want to find maxes of brightness in the image which are brighter than their surrounding pixels in some region. These are referred to as anchors. Even if the audio gets quieter, the maxes will still be the greatest relative to their neighbors. Furthermore, they are some of the most important pixels in the image, and are the least likely to disappear in the presence of noise or other distortions.

More specifically, a pixel in the spectrogram is said to be an anchor if it is greater than all other pixels in a rectangle surrounding it. The rectangle is specified by the parameters timeWin and freqWin, which are half of the length of a rectangle in the time and frequency axes, respectively, centered at the pixel in question. The image below shows an example of a max

Your task in this section will be to compute all of the anchors and add them to a vector list. In code, each anchor is an object of type Anchor, and you call the constructor with the frequency and time at which it occurred.

To find anchors, you'll be looping through the 2D spectrogram array S, which has maxFreq rows and nwin columns.First, you should fill in the isMax() method in fingerprint.cpp. Then, you should fill in the findAnchors() method, using isMax() as a helper method to check each pixel to see if it's an anchor. You should only include an anchor if it is both a max and if the value exceeds the threshold thresh passed along to findAnchors().

Once your code is working properly, you should see the image below when you run this in the command line

Programming Task 2: Pairing Anchors in Target Zones (10 Points)

One way to make anchors even more descriptive is to associate them to other anchors slightly ahead of them in time. Then, we actually capture some of the temporal evolution of the song, rather than just isolated time frequency points. This is the basis of the Shazam fingerprint. In particular, for each anchor, we search for other anchors in a rectangle slightly shifted to the right (forward in time), as shown in the image below for the dark blue anchor

dCenter refers to the time difference between the anchor and the center of the box, width (< dCenter) refers to half of the width of the rectangle around its center, and height refers to half of its height in frequency above and below its center.

To accomplish this in code, you should first fill in the searchTargetZone() method of the Anchor class, which should dynamically allocate and fill in a vector of other anchors that are in a specified target zone for this anchor. Then, you should fill in the getFingerprints() method, which should loop through all of the anchors and call the searchTargetZone() method on all of them. For each anchor and a pair found in the target zone, you should create an object of type Fingerprint, which takes the two anchors as arguments (original anchor, then pair in target zone), and then add the fingerprint to the vector of fingerprints. If you've done this correctly, then you should see the following image of (many!) fingerprints after running the following command

Programming Task 3: Hashing Fingerprints (5 Points)

Soon we are going to create a database of many, many fingerprints across a collection of music, so we need a way to look them up quickly. We can create a hash code based on three pieces of information:

  1. The frequency index of the first anchor, f1
  2. The frequency index of the second anchor, f2
  3. The time lag between the two anchors, dw = win2 - win1
To wrap all of these up into one number to be used as a hash code, you should use the following formula

\[ f1 + f2*256 + dw*256*256 \]

To implement this, you should fill in the getHash() method of the Fingerprint class, which inherits from Hashable so that fingerprints can be used as keys in a HashTable. If you have done this properly, then when you run the code

Then you should output a file fingerprints.txt. You should verify that your file looks exactly like this one before proceeding.

Programming Task 4: Audio Queries (15 Points)

Now that we know how to compute fingerprints, we are ready to try to match the fingerprints in a short "query" audio clip to the database to figure out which song in the database it might be (NOTE: In CS, a "query" generally refers to something that you're looking up in a database; e.g. text that you type into Google is a "search query"). It is possible that multiple songs share the same fingerprint, so it will take some work to narrow down which song the query might be. To see how this might work, let's consider the fingerprints in the audio clip below, which is a snippet from Wang Chung's "Everybody Have Fun Tonight" with lots of static added.

Once we compute its fingerprints, we can match each one to all of the songs in the database. If there is a match, we can plot the time where the query fingerprint starts against the time of that same fingerprint in the matching song in the database. When we do this on the first three songs in the database for the above query, we see the following plots:

First of all, we notice that there are more matching fingerprints in the correct song than in the others, but we can go further. If we zoom in on a smaller time interval in the database song for "Everybody Have Fun Tonight," we see the following pattern:

The feature that pops out here, which does not exist in the other plots, is that many of the dots concentrate along a diagonal line with a slope of one. This indicates that a bunch of fingerprints match in sequence with the same relative offset, which is extremely unlikely to happen unless this is a true match of the query. So somehow, we need to look for these kinds of diagonal lines and to count how many dots are along them. One very simple way to do this is to create a histogram of the difference of the window index of this fingerprint in the database song and that of the query. Each bin in the histogram is a time difference (in units of spectrogram windows), and its count is the number of matching fingerprints that had that difference. The plot below shows the histograms for the first three songs in the database, zoomed around the peak in those histograms

What we see here is a very clear winner, where "Everybody Have Fun Tonight" has over 300 scatterpoints in sequence at an offset of about 2585, while the false matches don't have more than 6 points maximum in sequence.


Just to see another example, consider the following audio clip captured with a phone microphone, as emitted by a low quality laptop speaker

The scatterplots with the first three songs look like this:

Again, we see that the matching song has a lot more fingerprints, and when we zoom in, we see a diagonal line, indicating that our query songs matches an interval in the song "Bad Day"

And the histograms of difference in fingerprint times are as follows

So again, it's not even close which song has a higher count in the max bin in the histogram, and we get the correct estimate out of these three by picking the one with the max histogram count, even though the query audio is highly degraded.

Code to write

To implement the above technique in code, you'll be filling in the method queryDatabase in fingerprint.cpp, which is an instance method of a SongData containing the fingerprints of a query song (which are accessible via fingerprints instance variable). The method takes in a list of songs in the database of type songInfo, which is simply a wrapper around a string for the title and artist of a song in the database. It also takes in a hash table whose keys are Fingerprint objects and whose values are references to type FingerprintList. A FingerprintList object wraps around a vector containing objects of type fingerprintInfo, which contains a list with information about every copy of that exact fingerprint that was seen across all songs in the database. A fingerprintInfo object has a number of fields, and you will need to focus on these two:

  • offset: The time window index of the first anchor in the fingerprint
  • index: The index of the song to which it belongs, in the array of songs in the database

To find a candidate match, you should loop through all of the fingerprints in the query song. For each query fingerprint, you should look at all matching copies of this fingerprint in the database, which you can get as the value of the fingerprint key in the HashTable db. For each of these matches, you should add one to the corresponding song's histogram at a bin which is the difference between the song's fingerprint offset and window of the query's first anchor. It is possible to do this with an array, but an easier way to implement a histogram is as a HashMap whose keys and values are both HashableInt objects, where the key is the bin index and the value is the count. Since you'll need a histogram for every song in the database, the code comes setup with an array of such hash maps. You should fill in those hash maps and keep track of the song corresponding to the maximum count as you go along, as you will eventually return this song as the candidate match.

Testing

To see if this works correctly, you can run the file query

which will then output something like the following

So it's correctly found that the song is "Wang Chung: Everybody Have Fun Tonight." Your program should also print out that song for the following two inputs

ExampleQueries/wangchungLaptop.wav

ExampleQueries/wangchungNoise.wav

For the following three queries below, the program should print out "Daniel Powter: Bad Day"

ExampleQueries/baddayClean.wav

ExampleQueries/baddayLaptop.wav

ExampleQueries/baddayNoise.wav

For the following three queries below, the program should print out "Kanye West: Touch The Sky"

ExampleQueries/skyClean.wav

ExampleQueries/skyLaptop.wav

ExampleQueries/skyNoise.wav

To help debug your program, you should print out the maximum number of counts that you find in a bin for the song you've chosen.

Stress Testing

Once the above is working, you can try loading a larger database to see if it still works. Click here to download a .zip file with 266 songs, including the first 6 in this list, and update the query file to load more than the first 6 songs. The problem technically gets harder, since there are more potential songs to confuse with in the database. It will also take much longer to load the database and to setup the hash table. But if this works, you'll know you have a robust system.

Going Further!

At this point, you have a fully functional basic implementation of audio fingerprinting for content-based music retrieval. This is not required, but you should try it on your own music! See if you can add a song to the database and query it with a clip. Note that the parameters are setup so that the database is expecting single channel audio sampled at 22050hz in the .wav file format. You can use Audacity to do the conversion. Questions to ask yourself as you go include:

  • How much audio does it seem like you need to do a robust identification?
  • How noisy or distorted does can the audio be before the identification fails?
  • Does this work on different versions of the song (e.g. cover songs)? Why do you think that might be?