The goal of this project is to use the concepts taught in this course to develop an efficient way of working with Big Data.
You should have 2 files in your Linux system:hugefile1.txtandhugefile2.txt, with one billion lines in each one. If you do not, please go back to the Module 7 Portfolio Reminder and complete the steps there.
Create a program, using a programming language of your choice, to produce a new file:totalfile.txt, by taking the numbers from each line of the two files and adding them. So, each line in file #3 is the sum of the corresponding line inhugefile1.txtandhugefile2.txt.
For example, if the first 5 lines of your files look as follows:
$head -5 hugefile*txt
==> hugefile1.txt
4131
29929
6483
7659
25003
==> hugefile1.txt
8866
19171
11029
4889
27069
then the first 5 lines oftotalfile.txtlook like this:
$head -5 totalfile.txt
12997
49100
17512
12548
52072
Because the files of such large sizes cannot be read into memory in their entirety at the same time, you need to use concurrency. Reading the files one line at a time will take a long time, so use what you have learned in this course to optimize this process. Be sure to record the amount of time it takes for each version of your program to complete this task.
Create two programs, where one program reads the first half of the files, and another program reads the second half. Use the OS to launch both programs simultaneously.
Now, break uphugefile1.txtandhugefile2.txtinto 10 files each, and run your process on all 10 sets in parallel. How do the run times compare to the original process?