## Data Structure and Algorithm Analysis

You are not allowed to use any data structure implemented or imported from java.util.* or anywhere else.

Introduction:

Consider you have been hired by XYZ company to design a new database for Internal Revenue Service

(IRS). You are required to use hash table using appropriate key to implement the database. You will be

given set of data, which contains employees’ details such as name, state, SSN and salaries from up to three

employments.

You will be using closed hashing (aka open addressing) hashing implementation. To find the unused or

open location in the hash table, you would use two probing techniques:

• Quadratic probing

• Double hashing

If the user selects Quadratic probing, then the hash function would be:

HFquadratic( h1(k), i ) = ( h1(k) + C1*i + C2*i

2 ) mod HSize

where, i is the ith probing

h1(k) is the hash function 1 (described later) on given key ‘k’

C1 and C2 are the two constants, need to be decided beforehand

HSize is the size of the hash table

If the user selects double-hashing, then the hash function would be:

HFdouble( h1(k), i ) = ( h1(k) + i * h2(k) ) mod HSize

where, i is the ith probing

h1(k) is the hash function 1 (described later) on given key ‘k’

h2(k) is the hash function 2 (described later) on given key ‘k’

HSize is the size of the hash table

It is very important to deal with the problem of collisions, as this directly affects the performance of the

hash table. So, choosing an appropriate hash function plays a very important role and for this you will use

following two hash functions for this assignment, depending upon the probing technique selected:

Hash function 1:

For the given key ‘k’, the hash function is given by:

h1(k) = floor( HSize * ( (k * C) mod 1) )

where, HSize is the size of the hash table

k is the given key

C is a constant such that 0 < C < 1. Choose C to be (√5 -1)/2

floor(x) is the mathematical floor operation on x

For example:

If k = 123456 and HSize = 10000, the hash function is given as:

h1(k) = floor( 10000 * ( (123456 * ( (√5 -1)/2 ) mod 1 ) )

h1(k) = floor( 100000 * ( (123456 * 0.6180339887) mod 1 ))

h1(k) = floor( 100000 * (76300.00412) mod 1 )

h1(k) = floor( 100000 * 0.00412 )

h1(k) = floor(41.2)

h1(k) = 41

Hash function 2:

For the given key ‘k’, the hash function is given by:

h2(k) = h’(k, r) mod HSize

where, HSize is the size of the hash table

k is the given key

r is the number of digits to be selected from the result, need to be selected beforehand

h’(k, r) is the function that squares the given key ‘k’ and then extract the middle ‘r’ number of

digits.

For example:

If k = 123456, HSize = 10000 and r = 6, the hash function is given as:

h’(k) = 123456 * 123456 = 015241383936

On extracting middle 6 digits, h’(k) = 241383

This gives,

h2(k) = h’(k, r) mod HSize

h2 (k) = 241383 mod 10000 = 1383

Implementation Requirement:

Your hash table implementation needs to support following operations:

• insert(k , x, T) inserts object x, with the given key k, into a table T provided that the key k is not

currently in the table. On success, it should return the number of open location(s) visited to insert

that object in the hash table.

If the key k is already used in the hash table, then it should return the negative of the number of

open location(s) visited to find that the key k is already in the hash table.

In the case of double hashing, the returned value represents the number of times the new hash

is calculated.

• delete(k, x, T) deletes an object x, with the given key k, from the table T. The deletion is performed

by placing a marker in the place to indicate that the object x has been deleted. During insertion,

the location which is marked by this marker is considered as empty and you can place an object

there. However, while searching an object, the location which is marked by this marker is

considered occupied.

On success (i.e. if the key is found and is deleted), it needs to return the number of locations

visited to find that key k.

If the key k is not found, then the deletion operation is considered fail and in that case, it needs

to return the negative of the number of location visited to determine that the key k is not found

on the table.

A very important part in this deletion operation is that the object deleted from the hash table T

should be placed in another hash table, call it T_dup. Remember, this is IRS and you do not want

to completely throw away the information. Instead keep them in the separate hash table, T_dup,

for any future reference purpose. This new hash table T_dup is exactly same as T, so any

discussion on T is applicable to T_dup as well.

• find(k, T) finds the object for the given key k from the hash table T. On success, it should return

the number of location(s) visited to find that object in the hash table and all the information about

the object (first name, last name, state, SSN, salaries of all employment and the sum of all the

salaries).

If the key k is not found in the hash table, then it should return the negative of the number of

location(s) visited to find that the key k is not in the hash table.

• update(k, y, T) finds the object associated with the key k in the hash table T and updates that

information with object y.

On success (i.e. if the key is found and is updated successfully), then it needs to return the number

of locations visited to find that key k.

If the key k is not found, then the update operation is considered fail and in that case, it needs to

return the negative of the number of location visited to determine that the key k is not found on

the table.

• rehash(T) rehashes the hash table T under two conditions:

o the hash table T reaches the specified load factor, or

o the hash table becomes half full

In both of these cases, it creates the new hash table with the size of the hash table increased by

10% of previous hash table. Also, this new rehashed table should contain all the items of the old

hash table but not the locations with markers.

This operation is only called by the insert(..) operation

The assignment should take the hash table size T, the collision resolution technique (quadratic for

quadratic probing and double for double hashing) and their associated values, load factor, an input file

and an output file. If all your java files are in the folder named assignment4, then the sample invocation

would look something like this:

cd assignment4

javac *.java

java

Interpretation of the command line arguments are as follows:

• 5001 is the HSize of hash table T,

• 4001 is the HSize of hash table T_dup,

• quadratic is the probing technique,

• 13 is the constant C1 used in quadratic probing,

• 19 is the constant C2 used in quadratic probing,

• 0.75 is the load factor,

• input.txt is the text file where the input data for the program is available,

• output.txt is the text file where the detailed output is written from the program.

Or, the sample invocation could also look something like this:

java

Interpretation of the command line arguments are as follows:

• 5001 is the HSize of hash table T,

• 4001 is the HSize of hash table T_dup,

• double is the probing technique,

• 4 is the constant ‘r’ used in the double hashing,

• 0.75 is the load factor,

• input.txt is the text file where the input data for the program is available,

• output.txt is the text file where the detailed output is written from the program.

Note: You can make HSize for T_dup arbitrarily big so that rehashing is not necessary.

The input file contains a new operation in each line followed by the set of data appropriate for that

operation. For example, the sample input file (which is a tab delimited text file) looks something like this:

insert Reggie Howard Oklahoma 442-15-7719 52768

insert Antwoine Womack Kansas 514-44-4266 10151 37666 66712

insert Orlando Pace Oklahoma 445-12-2767 28764 49060

insert Orlando Pace Oklahoma 445-12-2767 28764 49060

delete 579-72-1565

delete 514-44-4266

update Orlando Pace Oklahoma 445-12-2767 28764 49065

find 445-12-2767

The data on the input file should be interpreted as follows:

• the first column always represents the operation (insert/update/delete/find) that is to be

performed.

• If the first column operation is insert then the next columns data should be interpreted as:

o (First name, Last name, State, SSN, Employer1 salary, Employer2 salary, Employer3

salary)

• If the first column operation is find then the next column data should be interpreted as:

o SSN

• If the first column operation is delete then the next column data should be interpreted as:

o SSN

• If the first column operation is update then the next column data should be interpreted as:

o (First name, Last name, State, SSN, Employer1 salary, Employer2 salary, Employer3

salary)

Note: Not all employees have three employments. The maximum number of employments is 3 while the

minimum number of employments is 1.

The program should display the detailed result on the file output.txt (which should be a tab delimited text

file, just like input txt file). The result of each query is to be written on a separate line. The sample output

for the above sample input could be:

1 Insertion successful

2 Insertion successful

3 Insertion successful

-1 Insertion failed: Duplicate employee record

-1 Deletion Failed: No record found

1 Deletion successful

1 Update successful

1 Find result: OrlandoPace Oklahoma 445-12-2767 28764 49060 77824

The data on the output file should be interpreted as follows:

• the first column represents the count of the number of open location visited to perform that

operation. Positive number represents success while negative number represents unsuccessful

operation.

• the second column represents the operation result.

o successful insertion operation is specified by the message: “Insertion successful”

o unsuccessful insertion operation due to duplicate employee record is specified by the

message: “Insertion failed: Duplicate key”.

o successful deletion operation is specified by the message: “Deletion successful”

o unsuccessful deletion operation is specified by the message: “Deletion failed: Key not

found”

o successful update operation is specified by the message: “Update successful”

o unsuccessful update operation is specified by the message: “Update failed: Key not

found”.

o successful find operation is specified by the message: “Find result” followed by the result.

If the employee has more than one employer, then the sum of the salary is also displayed.

For example, Orlando Pace has 2 employers (salary as 28764 and 49060). The

result of these two salaries is 77824, which is also displayed.

o unsuccessful find operation is specified by the message: “Find failed: Key not found”.

• the third column represents if any rehashing is required or not after the insertion operation. If the

rehashing is required then display the message as “Rehashing performed: ” followed by the

reason (either “the specific load factor is reached” or “the table became half full”)

The program should also display the summary result on the console.

1. Hash Table size: i.e. the size of the hash table

2. Data size: i.e. the number of input data

3. Number of collisions: i.e. the number of times the two different keys are mapped to the same

table address.

4. Probing mechanism: i.e. the collision resolution technique used (linear for linear probing,

quadratic for quadratic probing and double for double hashing)

5. Number of open locations visited: this needs to be specified for each of the operation separately

a. insertion

b. deletion

c. update

d. find

6. Number of rehashing performed: this needs to be specified for each reason separately

a. the specific load factor is reached

b. the table became half full

7. Number of items in the hash table T

8. Number of items in the hash table T_dup

9. Number of deletion operation: i.e. the sum of all successful and unsuccessful deletion operation

For Bonus Point: [25 Marks]

Before you proceed with the bonus part, please go through the grading policy for the bonus part under

Grading rubrics.

You need to write a report by providing the worst case analysis for insertion, find and deletion operation.

You also need to run the program and make following four plots:

1. Cost (i.e. number of open location visited) of insertion vs input size for various table size

2. Cost (i.e. number of open location visited) of deletion vs input size for various table size

3. Cost (i.e. number of open location visited) of find vs input size for various table size

4. Number of rehashing required vs input size for various table size

Use at least 3 hash table size, of your choice, on each of these curves.

Grading Rubric:

Your assignment will be graded based on the following grading Rubrics:

• Correct implementation and use of hash function 1 [3 Marks]

• Correct implementation and use of hash function 2 [3 Marks]

• Correct implementation of the insert operation and performing addition of all

salaries

[15 Marks]

• Correct implementation of the delete operation with new hash table creation [15 + 10 Marks]

• Correct implementation of the find operation [10 Marks]

• Correct implementation of the update operation [15 Marks]

• Correct implementation of the rehash operation [12 Marks]

• Correct implementation of outputs on the console (#3, #5, #6 #7, #8, #9) [12 Marks]

• Correct implementation of outputs on the file [5 Marks]

Grading Policy for bonus part:

The marks you obtained from completing the bonus point will only be added to the assignment portion in

the final grading. For example, if a student scores 500 points (out of maximum 500 points possible for

assignment), then any additional point obtained from the bonus will not be added to the grade book. The

purpose of having bonus part in the assignment is not to substitute the grade of exams or discussion.

However, if a student has not scored good grade in one of the 5 assignments, then completing the bonus

part will help to boost the grade.

The bonus will be graded based on the following grading Rubrics:

• Calculation of worst-case complexity for insertion, find and deletion [5 + 4 + 4 Marks]

• Generation of plots [4 * 3 Marks]

Submission Guidelines:

• In addition to .java file, all the codes should be submitted in .pdf format as well.

o Name the .pdf file with the same name as that of your source code.

o Please copy the codes in the text form. Do not screenshot or include any image of the

code in the pdf. This is to check for plagiarism.

o If you perform the bonus part, make sure to include the input data file that contains

events information.

• Run time analysis should be performed in a separate file and name the file as

algorithmAnalysis.pdf

o Also, this file need to have all the 4 plots.

• You need to include readMe.txt file that shows how to run your program. Make sure the program

runs in csx machine.

• Take a screen shot of all your outputs and save it as .pdf file.

o Name this file as screenshot.pdf

• The main driver file needs to have the following header information:

o Author Name:

o Email: