CS 441 Semester project – Spring 2020 For this project you’ll be carrying out some simple text analysis, and comparing single- and multi- threaded performance. Though the original intention was to...

1 answer below »
I have uploaded the program question which needed to submit in python


CS 441 Semester project – Spring 2020 For this project you’ll be carrying out some simple text analysis, and comparing single- and multi- threaded performance. Though the original intention was to require this project to be done in Clojure, you may optionally do the project in Python provided you use a functional-programming style; in other words, this analysis should be done using Map/Reduce, rather than the usual imperative approach. Python programs that do not use Map/Reduce will be heavily penalized. (The point of the assignment is functional programming; solving this particular problem is mildly interesting but not really the point of all of it. Clojure programs will be functional by default, because of the language.) The amount of information in a data stream is a measure of the uncertainty of the data, of how unexpected the next data item is, given what’s come before. (That’s why the news doesn’t issue an alert when the sun rises in the East—if something is certain to happen, there’s literally no information in it.) More formally, the information in a character-based data stream S is given by H (S )=∑(nc)(−pc)lg( pc ) where pc is the probability of each character (item) in a data stream, nc is the number of times each character c occurs, and the summation is over all characters in the stream. If the logarithm is in base 2, then the information is measured in bits. (The reason this matters is that the information in a stream provides a hard limit to any compression if we want to recover the original data exactly. We cannot hope to compress it below this size without loss.) The simplest measure is first-order entropy, in which we treat each character as independent of everything else. In this case, pc is just the number of occurrences of c divided by the total number of characters. Thus we can just count characters and quickly find the total information. But we know that in text, characters aren’t really independent. In English text, if the last letter was q then the next letter will be u. If the last two letters were Th, the next letter is much more likely to be e than t, even though e and t are about equally likely when we examine single characters. We can capture this by examining groups of characters rather than single characters. In this case, our ‘alphabet’ consists of pairs of letters, rather than single letters. This gives us a larger alphabet but it also captures some of the dependencies in the language. Likewise, we can extend this idea to character triples just as easily. As each of these will account for more of the dependencies between characters, we should see the total information in the stream go down as we process larger groups. Your program will be exploring this, and will also be exploring how concurrency can affect performance. You are given a moderately large text file. Begin by counting the occurrences of each character (again, using Map/Reduce if you are working in Python), the probability distribution that results, and the total information in the file. Then find the same information using character pairs; then using triples. You should find that the total information in the file decreases under each of these measures. Finding these quantities, should, we hope, be enough to keep your computer busy for a little bit. We’ll now explore what happens as we add concurrency. Add code that measures the time needed for your program to compute these statistics (excluding file I/O). Run the program 3 times and note the average time needed. Now modify the program so it uses multiple threads. (In Clojure, use pmap rather than map; in Python, use a worker pool.) Run the program using 2, 4, 8, 16, 32, and 64 threads, running the program 3 times under each condition, and finding the average time. (In Python, you may find it useful to also use a ProcessPool—the syntax to using a worker pool is almost identical—and see how the results compare.) Prepare a short report including a graph showing execution time as a function of the number of threads, summarizing your findings and explaining any patterns you observe. Submit this report along with your source code. Due Sunday night May 10. The Project Gutenberg EBook of War and Peace, by Leo Tolstoy This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: War and Peace Author: Leo Tolstoy Translators: Louise and Aylmer Maude Posting Date: January 10, 2009 [EBook #2600] Last Updated: January 21, 2019 Language: English Character set encoding: UTF-8 *** START OF THIS PROJECT GUTENBERG EBOOK WAR AND PEACE *** An Anonymous Volunteer, and David Widger WAR AND PEACE By Leo Tolstoy/Tolstoi CONTENTS BOOK ONE: 1805 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII CHAPTER XVIII CHAPTER XIX CHAPTER XX CHAPTER XXI CHAPTER XXII CHAPTER XXIII CHAPTER XXIV CHAPTER XXV CHAPTER XXVI CHAPTER XXVII CHAPTER XXVIII BOOK TWO: 1805 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII CHAPTER XVIII CHAPTER XIX CHAPTER XX CHAPTER XXI BOOK THREE: 1805 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII CHAPTER XVIII CHAPTER XIX BOOK FOUR: 1806 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI BOOK FIVE: 1806 - 07 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII CHAPTER XVIII CHAPTER XIX CHAPTER XX CHAPTER XXI CHAPTER XXII BOOK SIX: 1808 - 10 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII CHAPTER XVIII CHAPTER XIX CHAPTER XX CHAPTER XXI CHAPTER XXII CHAPTER XXIII CHAPTER XXIV CHAPTER XXV CHAPTER XXVI BOOK SEVEN: 1810 - 11 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII BOOK EIGHT: 1811 - 12 CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V CHAPTER VI CHAPTER VII CHAPTER VIII CHAPTER IX CHAPTER X CHAPTER XI CHAPTER XII CHAPTER XIII CHAPTER XIV CHAPTER XV CHAPTER XVI CHAPTER XVII
Answered Same DayMay 05, 2021

Answer To: CS 441 Semester project – Spring 2020 For this project you’ll be carrying out some simple text...

Akriti answered on May 09 2021
147 Votes
When we see the Graph of the average running time for each thread we see that it showed a downfall at 4 thread, and after that it is increasing, this is a random effect usually, when i runned the program again , it was giving increase in time required for running the thread. But as we increase the...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here