Call/WhatsApp: +1 914 416 5343

Data Structure and Algorithm Analysis

You are not allowed to use any data structure implemented or imported from java.util.* or anywhere else.
Introduction:
Consider you have been hired by XYZ company to design a new database for Internal Revenue Service
(IRS). You are required to use hash table using appropriate key to implement the database. You will be
given set of data, which contains employees’ details such as name, state, SSN and salaries from up to three
employments.
You will be using closed hashing (aka open addressing) hashing implementation. To find the unused or
open location in the hash table, you would use two probing techniques:
• Quadratic probing
• Double hashing
If the user selects Quadratic probing, then the hash function would be:
HFquadratic( h1(k), i ) = ( h1(k) + C1*i + C2*i
2 ) mod HSize
where, i is the ith probing
h1(k) is the hash function 1 (described later) on given key ‘k’
C1 and C2 are the two constants, need to be decided beforehand
HSize is the size of the hash table
If the user selects double-hashing, then the hash function would be:
HFdouble( h1(k), i ) = ( h1(k) + i * h2(k) ) mod HSize
where, i is the ith probing
h1(k) is the hash function 1 (described later) on given key ‘k’
h2(k) is the hash function 2 (described later) on given key ‘k’
HSize is the size of the hash table
It is very important to deal with the problem of collisions, as this directly affects the performance of the
hash table. So, choosing an appropriate hash function plays a very important role and for this you will use
following two hash functions for this assignment, depending upon the probing technique selected:
Hash function 1:
For the given key ‘k’, the hash function is given by:
h1(k) = floor( HSize * ( (k * C) mod 1) )
where, HSize is the size of the hash table
k is the given key
C is a constant such that 0 < C < 1. Choose C to be (√5 -1)/2
floor(x) is the mathematical floor operation on x
For example:
If k = 123456 and HSize = 10000, the hash function is given as:
h1(k) = floor( 10000 * ( (123456 * ( (√5 -1)/2 ) mod 1 ) )
 h1(k) = floor( 100000 * ( (123456 * 0.6180339887) mod 1 ))
 h1(k) = floor( 100000 * (76300.00412) mod 1 )
 h1(k) = floor( 100000 * 0.00412 )
 h1(k) = floor(41.2)
 h1(k) = 41
Hash function 2:
For the given key ‘k’, the hash function is given by:
h2(k) = h’(k, r) mod HSize
where, HSize is the size of the hash table
k is the given key
r is the number of digits to be selected from the result, need to be selected beforehand
h’(k, r) is the function that squares the given key ‘k’ and then extract the middle ‘r’ number of
digits.
For example:
If k = 123456, HSize = 10000 and r = 6, the hash function is given as:
h’(k) = 123456 * 123456 = 015241383936
On extracting middle 6 digits, h’(k) = 241383
This gives,
h2(k) = h’(k, r) mod HSize
 h2 (k) = 241383 mod 10000 = 1383
Implementation Requirement:
Your hash table implementation needs to support following operations:
• insert(k , x, T) inserts object x, with the given key k, into a table T provided that the key k is not
currently in the table. On success, it should return the number of open location(s) visited to insert
that object in the hash table.
If the key k is already used in the hash table, then it should return the negative of the number of
open location(s) visited to find that the key k is already in the hash table.
In the case of double hashing, the returned value represents the number of times the new hash
is calculated.
• delete(k, x, T) deletes an object x, with the given key k, from the table T. The deletion is performed
by placing a marker in the place to indicate that the object x has been deleted. During insertion,
the location which is marked by this marker is considered as empty and you can place an object
there. However, while searching an object, the location which is marked by this marker is
considered occupied.
On success (i.e. if the key is found and is deleted), it needs to return the number of locations
visited to find that key k.
If the key k is not found, then the deletion operation is considered fail and in that case, it needs
to return the negative of the number of location visited to determine that the key k is not found
on the table.
A very important part in this deletion operation is that the object deleted from the hash table T
should be placed in another hash table, call it T_dup. Remember, this is IRS and you do not want
to completely throw away the information. Instead keep them in the separate hash table, T_dup,
for any future reference purpose. This new hash table T_dup is exactly same as T, so any
discussion on T is applicable to T_dup as well.
• find(k, T) finds the object for the given key k from the hash table T. On success, it should return
the number of location(s) visited to find that object in the hash table and all the information about
the object (first name, last name, state, SSN, salaries of all employment and the sum of all the
salaries).
If the key k is not found in the hash table, then it should return the negative of the number of
location(s) visited to find that the key k is not in the hash table.
• update(k, y, T) finds the object associated with the key k in the hash table T and updates that
information with object y.
On success (i.e. if the key is found and is updated successfully), then it needs to return the number
of locations visited to find that key k.
If the key k is not found, then the update operation is considered fail and in that case, it needs to
return the negative of the number of location visited to determine that the key k is not found on
the table.
• rehash(T) rehashes the hash table T under two conditions:
o the hash table T reaches the specified load factor, or
o the hash table becomes half full
In both of these cases, it creates the new hash table with the size of the hash table increased by
10% of previous hash table. Also, this new rehashed table should contain all the items of the old
hash table but not the locations with markers.
This operation is only called by the insert(..) operation
The assignment should take the hash table size T, the collision resolution technique (quadratic for
quadratic probing and double for double hashing) and their associated values, load factor, an input file
and an output file. If all your java files are in the folder named assignment4, then the sample invocation
would look something like this:
cd assignment4
javac *.java
java 5001 4001 quadratic 13 19 0.75 input.txt output.txt
Interpretation of the command line arguments are as follows:
• 5001 is the HSize of hash table T,
• 4001 is the HSize of hash table T_dup,
• quadratic is the probing technique,
• 13 is the constant C1 used in quadratic probing,
• 19 is the constant C2 used in quadratic probing,
• 0.75 is the load factor,
• input.txt is the text file where the input data for the program is available,
• output.txt is the text file where the detailed output is written from the program.
Or, the sample invocation could also look something like this:
java 5001 4001 double 4 0.75 input.txt output.txt
Interpretation of the command line arguments are as follows:
• 5001 is the HSize of hash table T,
• 4001 is the HSize of hash table T_dup,
• double is the probing technique,
• 4 is the constant ‘r’ used in the double hashing,
• 0.75 is the load factor,
• input.txt is the text file where the input data for the program is available,
• output.txt is the text file where the detailed output is written from the program.
Note: You can make HSize for T_dup arbitrarily big so that rehashing is not necessary.
The input file contains a new operation in each line followed by the set of data appropriate for that
operation. For example, the sample input file (which is a tab delimited text file) looks something like this:
insert Reggie Howard Oklahoma 442-15-7719 52768
insert Antwoine Womack Kansas 514-44-4266 10151 37666 66712
insert Orlando Pace Oklahoma 445-12-2767 28764 49060
insert Orlando Pace Oklahoma 445-12-2767 28764 49060
delete 579-72-1565
delete 514-44-4266
update Orlando Pace Oklahoma 445-12-2767 28764 49065
find 445-12-2767
The data on the input file should be interpreted as follows:
• the first column always represents the operation (insert/update/delete/find) that is to be
performed.
• If the first column operation is insert then the next columns data should be interpreted as:
o (First name, Last name, State, SSN, Employer1 salary, Employer2 salary, Employer3
salary)
• If the first column operation is find then the next column data should be interpreted as:
o SSN
• If the first column operation is delete then the next column data should be interpreted as:
o SSN
• If the first column operation is update then the next column data should be interpreted as:
o (First name, Last name, State, SSN, Employer1 salary, Employer2 salary, Employer3
salary)
Note: Not all employees have three employments. The maximum number of employments is 3 while the
minimum number of employments is 1.
The program should display the detailed result on the file output.txt (which should be a tab delimited text
file, just like input txt file). The result of each query is to be written on a separate line. The sample output
for the above sample input could be:
1 Insertion successful
2 Insertion successful
3 Insertion successful
-1 Insertion failed: Duplicate employee record
-1 Deletion Failed: No record found
1 Deletion successful
1 Update successful
1 Find result: OrlandoPace Oklahoma 445-12-2767 28764 49060 77824
The data on the output file should be interpreted as follows:
• the first column represents the count of the number of open location visited to perform that
operation. Positive number represents success while negative number represents unsuccessful
operation.
• the second column represents the operation result.
o successful insertion operation is specified by the message: “Insertion successful”
o unsuccessful insertion operation due to duplicate employee record is specified by the
message: “Insertion failed: Duplicate key”.
o successful deletion operation is specified by the message: “Deletion successful”
o unsuccessful deletion operation is specified by the message: “Deletion failed: Key not
found”
o successful update operation is specified by the message: “Update successful”
o unsuccessful update operation is specified by the message: “Update failed: Key not
found”.
o successful find operation is specified by the message: “Find result” followed by the result.
If the employee has more than one employer, then the sum of the salary is also displayed.
 For example, Orlando Pace has 2 employers (salary as 28764 and 49060). The
result of these two salaries is 77824, which is also displayed.
o unsuccessful find operation is specified by the message: “Find failed: Key not found”.
• the third column represents if any rehashing is required or not after the insertion operation. If the
rehashing is required then display the message as “Rehashing performed: ” followed by the
reason (either “the specific load factor is reached” or “the table became half full”)
The program should also display the summary result on the console.
1. Hash Table size: i.e. the size of the hash table
2. Data size: i.e. the number of input data
3. Number of collisions: i.e. the number of times the two different keys are mapped to the same
table address.
4. Probing mechanism: i.e. the collision resolution technique used (linear for linear probing,
quadratic for quadratic probing and double for double hashing)
5. Number of open locations visited: this needs to be specified for each of the operation separately
a. insertion
b. deletion
c. update
d. find
6. Number of rehashing performed: this needs to be specified for each reason separately
a. the specific load factor is reached
b. the table became half full
7. Number of items in the hash table T
8. Number of items in the hash table T_dup
9. Number of deletion operation: i.e. the sum of all successful and unsuccessful deletion operation
For Bonus Point: [25 Marks]
Before you proceed with the bonus part, please go through the grading policy for the bonus part under
Grading rubrics.
You need to write a report by providing the worst case analysis for insertion, find and deletion operation.
You also need to run the program and make following four plots:
1. Cost (i.e. number of open location visited) of insertion vs input size for various table size
2. Cost (i.e. number of open location visited) of deletion vs input size for various table size
3. Cost (i.e. number of open location visited) of find vs input size for various table size
4. Number of rehashing required vs input size for various table size
Use at least 3 hash table size, of your choice, on each of these curves.
Grading Rubric:
Your assignment will be graded based on the following grading Rubrics:
• Correct implementation and use of hash function 1 [3 Marks]
• Correct implementation and use of hash function 2 [3 Marks]
• Correct implementation of the insert operation and performing addition of all
salaries
[15 Marks]
• Correct implementation of the delete operation with new hash table creation [15 + 10 Marks]
• Correct implementation of the find operation [10 Marks]
• Correct implementation of the update operation [15 Marks]
• Correct implementation of the rehash operation [12 Marks]
• Correct implementation of outputs on the console (#3, #5, #6 #7, #8, #9) [12 Marks]
• Correct implementation of outputs on the file [5 Marks]
Grading Policy for bonus part:
The marks you obtained from completing the bonus point will only be added to the assignment portion in
the final grading. For example, if a student scores 500 points (out of maximum 500 points possible for
assignment), then any additional point obtained from the bonus will not be added to the grade book. The
purpose of having bonus part in the assignment is not to substitute the grade of exams or discussion.
However, if a student has not scored good grade in one of the 5 assignments, then completing the bonus
part will help to boost the grade.
The bonus will be graded based on the following grading Rubrics:
• Calculation of worst-case complexity for insertion, find and deletion [5 + 4 + 4 Marks]
• Generation of plots [4 * 3 Marks]
Submission Guidelines:
• In addition to .java file, all the codes should be submitted in .pdf format as well.
o Name the .pdf file with the same name as that of your source code.
o Please copy the codes in the text form. Do not screenshot or include any image of the
code in the pdf. This is to check for plagiarism.
o If you perform the bonus part, make sure to include the input data file that contains
events information.
• Run time analysis should be performed in a separate file and name the file as
algorithmAnalysis.pdf
o Also, this file need to have all the 4 plots.
• You need to include readMe.txt file that shows how to run your program. Make sure the program
runs in csx machine.
• Take a screen shot of all your outputs and save it as .pdf file.
o Name this file as screenshot.pdf
• The main driver file needs to have the following header information:
o Author Name:
o Email:
o Date:
o Program Description:
• Use comments throughout your program.
o Each class and the method should be properly commented:
 description of arguments and return data
 description of method
• All the files should be submitted through Canvas. Assignment submitted through email will not
be accepted.
Additional tips to avoid any point deduction:
• Failure to follow standard programming practices will lead to points deduction, even if the
program is running correctly, like
o No comments on code
o Not writing meaningful identifiers
• If the program is not compiling successfully, then 20 points will be automatically deducted.
• If you use HashTable class from Collections framework, then you will end up getting zero for the
assignment.