Assignment 5a: Basketball Hashing (36 pts)
Chris Tralie
Due Tuesday 10/24/2020
Overview / Logistics
The purpose of this assignment is to give you practice designing collections in C++, and to explore how the right implementation of an abstract data type (a hash table implementation instead of a linked list implementation of a map) can lead to much faster queries in an unordered collection. Click here to download the starter code. When you are finished, please upload a .zip file of this entire folder with your completed code to Canvas.
Learning Objectives
- Use makefiles in C++ to build a medium scale project with many files
- Use multiple classes together in concert to accomplish a task
- Use object references to perform efficient lookups yielding constant time operations in a collection
- Leverage polymorphism to quickly swap out different implementations of abstract data types (LinkedMap and HashTable)
- Create tests to verify the correctness of a code implementation and to empirically explore its efficiency.
Background: Hash Tables
In class, we showed how it is possible to implement a map data structure by using a linked list (the so-called LinkedMap), but that this leads to slow (linear time) average and worst case times for the get()
, containsKey()
, and remove()
operations. Motivated by this, we showed how it is possible to use the concept of a hash table to reduce the time of queries in a map data structure. The idea is to have a deterministic function called a hash function that gives a particular number, or a hash code for each object. One can then jump right to a bin corresponding to this hash code, which contains a linked list of all objects in the collection that have this hash code (objects that have the same hash code are said to be in "collision"). For instance, if we do hashing on a select bunch of Harry Potter characters, and their hash code is the month during which they were born, then we have the following hash table:
Then, if we are looking for Neville Longbottom (i.e. Matthew Lewis), for example, we compute his birth month of 6 and jump straight to bin 6 and follow two pointers to get to him.
Number of Bins
We can choose a fixed number of bins, regardless of the range of our hash function, by simply taking the modulus of our hash code by the number of bins, giving us the bin indices
[0, 1, 2, ..., NBins-1]
For example, if we want to use only 5 bins, then Percy Weasley in bin 11 above would get bin index
11 % 5 = 1
There is a classic space vs time efficiency trade-off here, however. Assuming we have a hash function which evenly distributes objects over the available bins, then the more bins we use, the faster queries and additions/removals will be, but the more memory is needed to store the bins. Conversely, if we use fewer bins, we need less memory, but these operations take longer. The most extreme case is a single bin, in which case a HashTable reduces to a LinkedMap.
Programming Tasks
NOTE: This assignment is setup so that if you simply type make
in your terminal, it will build everything for the assignment. So you should never have to issue individual compile commands.
In this assignment, you will implement your own HashTable
(hashtable.cpp, hashtable.h) data structure and compare it to a LinkedMap
(linkedmap.cpp, linkedmap.h) that has been provided for you. They both inherit from an abstract class known as Map
(map.h), which has a number of methods you will need to implement for your hash table. The keys of this map are specified as pointers to Hashable
(hashable.h, hashable.cpp) objects, for which a method getHash()
has been implemented to compute a hash code on that object. The values of this map are specified as pointers to Cloneable
(cloneable.h) objects, which are objects for which a clone()
method is provided to make deep copies. Since Hashable objects are also Cloneable, deep copies can be made of both the key and the value, and runtime polymorphism is leveraged so that calling the clone()
method on a Cloneable*
or Hashable*
pointer makes a clone of the correct type.
To test the LinkedMap and HashTable, you will load in information about NBA players born between 1913-1997 (data obtained at this link), including their name, the college/school they attended, their height in centimeters, their weight in kilograms, and their birth year. You have been provided with a class Player
(player.cpp, player.h) that encapsulates all this information in one place. You will need to load in the players in from a file and put them into a LinkedMap. Then, once you have implemented your HashTable structure, you should also load the players into a HashTable and compare the efficiency of get()
queries between the two data structures. The tasks below break this process down further.
Player Loading / Database Queries (8 pts)
Your first task should be to load in all of the players in the text file into a map. In particular, you will fill in the loadPlayers(Map* m)
method in the player.cpp
file. You should look at Person.cpp
for an example of how to use the map. In particular, it accepts object references for both the key and value.
Once you have loaded in the players, you should fill in the file playerlookup.cpp
to lookup a player in the map specified on the command line. For instance,
If a player is not in the database, you should output "not found"
HINT: The names of the players are stored as "Firstname Lastname", but you will need two command line arguments for the first name and last name. So you will have to use string concatenation to add them together with a space in order to match what's in the database.
String Hashing (5 pts)
The next step you need to do to prepare to use a HashTable on this data is to define how to hash a string, since the keys are strings. For this, you should fill in the getHash()
method of the HashableString
class in hashable.cpp
. You will be implementing the same hash function that Java uses for it's String
type. In particular, here is pseudocode for the algorithm
assuming that each character c
has been cast to a size_t
. Once you are finished this, you should create a file stringtest.cpp
to test out the hash codes on a few examples. You should add an entry to your makefile to automatically compile this based on its dependencies. Then, you should use it to output and verify the following hash codes by initializing objects of type HashableString
:
HashTable Implementation (15 Pts)
You are now ready to implement the HashTable functions. You can add private variables, but you should not change the interface in hashtable.h
. The bulk of your code should go in hashtable.cpp
. As a hint, the primary private variable in your object should be an array of HashNode*
references, each representing the head of a linked list for a different bin. All references should start off as NULL
. Then, you should implement all of the methods below. You can borrow heavily from the code in linkedlist.cpp
, but you will have to have an additional step of computing a hash to look up which bin's head you should start with. Below are the methods you need to implement
-
(3 pts)
void put(Hashable* key, Cloneable* value)
: Put an item into the hash table, using the key's hash function. To keep this simple, you do not need to check if the key already exists, so it is okay to end up with duplicates of the same key. -
(3 pts)
Cloneable* get(Hashable* key)
: Return a reference to the value associated to this key, orNULL
if the key is not in the collection. -
(3 pts)
void remove(Hashable* key)
: Remove a key/value pair from the hash table, if the key exists in the table. -
(3 pts)
bool containsKey(Hashable* key)
: Return true if the key is in the hash table or false if it is not. - (3 pts)
Hashable** getKeyList(size_t* N)
: Dynamically create an array to hold references to all of the keys in this map in an arbitrary order, and return the length of this array by reference.
Be sure also to implement a destructor that frees all memory that was allocated during the lifetime of a hash table object. You will lose points if there are any memory leaks!
HashTable / LinkedMap Comparison (8 pts)
You will now compare your HashTable implementation to the LinkedMap implementation both to check the former's correctness, and also to compare their efficiency. First, you should increment the numOps
variable in each of your methods in HashTable
every time a hash function is computed and every time a node is visited. This will allow you to keep track of how many operations are happening under different queries (refer to how this was done in LinkedMap
if you are lost).
Next, you should fill in the compareMaps()
method in the mapcheck.cpp
file to perform the following steps:
-
Create a map
m1
of typeLinkedMap
and a mapm2
of typeHashTable
, and fill them each in with the NBA players. - Reset the operation counts of each map.
-
Loop through all of the keys in
m1
and make sure they're inm2
by calling thecontainsKey()
method. If they're not, print out the ones that are missing to help you debug. -
Loop through all of the keys in
m2
and make sure they're inm1
by calling thecontainsKey()
method. If they're not, print out the ones that are missing to help you debug. - Report the number of operations in steps 3 and 4, and the average number of operations per player.
Assuming everything is working correctly, if you use 100 bins, for example, you should see the following stats: Try to see what happens when you increase or decrease the number of bins. Does the average number of operations per player change in a way that makes sense?