Files attached
Assignment 7 The Great Firewall of Santa Cruz: Bloom Filters, Linked Lists and Hash Tables Prof. Darrell Long CSE 13S – Spring 2021 First DESIGN.pdf draft due: May 27th at 11:59 pm PST Assignment (formally) due: June 4th at 11:59 pm PST 1 Introduction War is peace. Freedom is slavery. Ignorance is strength. —George Orwell, 1984 You have been selected through thoroughly democratic processes (and the machinations of your friend and hero Ernst Blofeld) to be the Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz following the fail- ure of the short-lived anarcho-syndicalist commune, where each person in turn acted as a form of executive officer for the week but all the decisions of that officer have to be ratified at a special bi-weekly meeting by a simple ma- jority in the case of purely internal affairs, but by a two-thirds majority in the case of more major decisions. Where it was understood that strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses, not from some farcical aquatic ceremony. It was a very silly place, and so with Herr Blofeld’s assistance you are now the leader of your people. In order to promote virtue and prevent vice, and to preserve social cohesion and discourage unrest, you have decided that the Internet content must be filtered so that your beloved children are not corrupted through the use of unfortunate, hurtful, offensive, and far too descriptive language. 2 Bloom Filters The Ministry of Peace concerns itself with war, the Ministry of Truth with lies, the Ministry of Love with torture and the Ministry of Plenty with starvation. These contradictions are not accidental, nor do they result from from ordinary hypocrisy: they are deliberate exercises in doublethink. —George Orwell, 1984 The Internet is very large, very fast, and full of badthink. The masses spend their days sending each other cat videos and communicating through their beloved social media platforms: Twitter and Discord. To your dismay, you find that a portion of the masses frequently use words you deem improper—oldspeak. You decide, as the newly elected Dear and Beloved Leader of the Glorious People’s Republic of Santa Cruz (GPRSC), that a more neutral newspeak is required to keep your citizens of the GPRSC content, pure, and from thinking too much. But how do you process and store so many words as they flow in and out of the GPRSC at 10Gbits/second? The solution comes to your brilliant and pure mind—a Bloom filter. © 2021 Darrell Long 1 A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton H. Bloom in 1970, and is used to test whether an element is a member of a set. False-positive matches are possible, but false negatives are not—in other words, a query for set membership returns either “possibly in the set” or “definitely not in the set.” Elements can be added to the set but not removed from it; the more elements added, the higher the probability of false positives. A Bloom filter can be represented as an array of m bits, or a bit vector. A Bloom filter should utilize k different hash functions. Using these hash functions, a set element added to the Bloom filter is mapped to at most k of the m bit indices, generating a uniform pseudo-random distribu- tion. Typically, k is a small constant which depends on the desired false error rate ², while m is proportional to k and the number of elements to be added. Assume you are adding a word w to your Bloom filter and are using k = 3 hash functions, f (x), g (x), and h(x). To add w to the Bloom filter, you simply set the bits at indices f (w), g (w), and h(w). To check if some word w ′ has been added to the same Bloom filter, you check if the bits at indices f (w ′), g (w ′), and h(w ′) are set. If they are all set, then w ′ has most likely been added to the Bloom filter. If any one of those bits was cleared, then w ′ has definitely not been added to the Bloom filter. The fact that the Bloom filter can only tell if some word has most likely been added to the Bloom filter means that false positives can occur. The larger the Bloom filter, the lower the chances of getting false positives. So what do Bloom filters mean for you as the Dear and Beloved Leader? It means you can take a list of pro- scribed words, oldspeak and add each word into your Bloom filter. If any of the words that your citizens use seem to be added to the Bloom filter, then this is very ungood and further action must be taken to discern whether or not the citizen did transgress. You decide to implement a Bloom filter with three salts for three different hash functions. Why? To reduce the chance of a false positive. You can think of a “salt” as an initialization vector or a key. Using different salts with the same hash function results in a different, unique hash. Since you are equipping your Bloom filter with three different salts, you are effectively getting three different hash functions: f (x), g (x), and h(x). Hashing a word w , with extremely high probability, should result in f (w) 6= g (w) 6= h(w). These salts are to be used for the SPECK cipher, which requires a 128-bit key, so we have used MD5 1 “message-digest” to reduce three books down to 128 bits each. You will use the SPECK cipher as a hash function, which will be discussed in §4. 2.1 BloomFilter The following struct defines the BloomFilter ADT. The three salts will be stored in the primary, secondary, and tertiary fields. Each salt is 128 bits in size. To hold these 128 bits, we use an array of two uint64_ts. This struct definition must go in bf.c. 1 struct BloomFilter { 2 uint64_t primary [2]; // Primary hash function salt. 3 uint64_t secondary [2]; // Secondary hash function salt. 4 uint64_t tertiary [2]; // Tertiary hash function salt. 5 BitVector *filter; 6 }; 2.2 BloomFilter *bf_create(uint32_t size) The constructor for a Bloom filter. Working code for the constructor, complete with primary, secondary, and ter- tiary salts that you will use, is shown below. Note that you will also have to implement the bit vector ADT for your Bloom filter, as it will serve as the array of bits necessary for a proper Bloom filter. 1Rivest, R.. “The MD5 Message-Digest Algorithm.” RFC 1321 (1992): 1-21. © 2021 Darrell Long 2 Preethika Rangamgari Preethika Rangamgari 1 BloomFilter *bf_create(uint32_t size) { 2 BloomFilter *bf = (BloomFilter *) malloc(sizeof(BloomFilter)); 3 if (bf) { 4 // Grimm 's Fairy Tales 5 bf ->primary [0] = 0x5adf08ae86d36f21; 6 bf ->primary [1] = 0xa267bbd3116f3957; 7 // The Adventures of Sherlock Holmes 8 bf ->secondary [0] = 0x419d292ea2ffd49e; 9 bf ->secondary [1] = 0x09601433057d5786; 10 // The Strange Case of Dr. Jekyll and Mr. Hyde 11 bf ->tertiary [0] = 0x50d8bb08de3818df; 12 bf ->tertiary [1] = 0x4deaae187c16ae1d; 13 bf ->filter = bv_create(size); 14 if (!bf->filter) { 15 free(bf); 16 bf = NULL; 17 } 18 } 19 return bf; 20 } 2.3 void bf_delete(BloomFilter **bf) The destructor for a Bloom filter. As with all other destructors, it should free any memory allocated by the con- structor and null out the pointer that was passed in. 2.4 uint32_t bf_size(BloomFilter *bf) Returns the size of the Bloom filter. In other words, the number of bits that the Bloom filter can access. Hint: this is the length of the underlying bit vector. 2.5 void bf_insert(BloomFilter *bf, char *oldspeak) Takes oldspeak and inserts it into the Bloom filter. This entails hashing oldspeak with each of the three salts for three indices, and setting the bits at those indices in the underlying bit vector. 2.6 bool bf_probe(BloomFilter *bf, char *oldspeak) Probes the Bloom filter for oldspeak. Like with bf_insert(), oldspeak is hashed with each of the three salts for three indices. If all the bits at those indices are set, return true to signify that oldspeak was most likely added to the Bloom filter. Else, return false. 2.7 uint32_t bf_count(BloomFilter *bf) Returns the number of set bits in the Bloom filter. 2.8 void bf_print(BloomFilter *bf) A debug function to print out a Bloom filter. © 2021 Darrell Long 3 Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari Preethika Rangamgari 3 Hashing with the SPECK Cipher You will need a good hash function to use in your Bloom filter and hash table (discussed in §5). We will have discussed hash functions in lecture, and rather than risk having a poor one implemented, we will simply provide you one. The SPECK2 block cipher is provided for use as a hash function. SPECK is a family of lightweight block ciphers publicly released by the National Security Agency (NSA) in June 2013. SPECK has been optimized for performance in software implementations, while its sister algorithm, SIMON, has been optimized for hardware implementations. SPECK is an add-rotate-xor (ARX) cipher. The reason a cipher is used for this is because encryption generates random output given some input; exactly what we want for a hash. Encryption is the process of taking some file you wish to protect, usually called plaintext, and transforming its data such that only authorized parties can access it. This transformed data is referred to as ciphertext. Decryption is the inverse operation of encryption, taking the ciphertext and transforming the encrypted data back to its origi- nal state as found in the original plaintext. Encryption algorithms that utilize the same key for both encryption and decryption, like SPECK, are symmetric-key algorithms, and algorithms that don’t, such as RSA, are asymmetric-key algorithms. You will be given two files, speck.h and speck.c. The former will provide the interface to using the SPECK hash function which has been named hash(), and the latter contains the implementation. The hash function hash() takes two parameters: a 128-bit salt passed in the form of an array of two uint64_ts, and a key to hash. The function will return a uint32_t which is exactly the index the key is mapped to. 1 uint32_t hash(uint64_t salt[], char *key); 4 Bit Vectors Symmetrical equations are good in their place, but ‘vector’ is a useless survival, or offshoot from quaternions, and has never been of the slightest use to any creature. —Lord Kelvin You will reuse much of the bit vector implementation from assignment 5 for this assignment. If there were any problems with your implementation,