CS288 Intensive Programming in Linux, Fall 2022, Prof. Xiaoning Ding DO NOT DISTRIBUTE WITHOUT WRITTEN PERMISSION FROM PROF. XIAONING DING A big-data processing task: Write a bash script to find out 5 most frequently used words on a set of Wikipedia pages. The script prints out a list of these words and the number of occurrences of each word on the Wikipedia pages. The list should be sorted in descending order based on the number of occurrences. The following is a sample of output generated for 4 Wikipedia pages. 1080 the 677 of 480 in 473 and 443 a Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In this problem, your script only needs to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . Thus, the pages we need to analyze are https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AB https://en.wikipedia.org/wiki/AC ... https://en.wikipedia.org/wiki/ZY https://en.wikipedia.org/wiki/ZZ Your script combines a few tools in Linux to finish the above big-data processing task. You can use wget to download and save a page. For example, the following command downloads and save the AC wiki page into file AC.html: wget https://en.wikipedia.org/wiki/AC -O AC.html A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use html2text to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt html2text AC.html > AC.txt After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. Using the -o option (i.e., grep -o) will simplify the processing, since you only need the matching parts. You may need sort,cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved. Hint: You don't need to write code to count the number of occurrences for each distinct word. Use sort and uniq smartly --- sort groups the occurrences, and uniq counts the number of occurrences. Note: We will not look at the exact number of occurrences. Your solution is considered as correct as long as it can correctly find the most frequently used words (the, a, of, …) and the numbers of occurrences are roughly correct (10,000+). It may take time to download all the required pages (AA, …., ZZ). To save time, you can download the pages (saved as files), comment out the code downloading the pages so you can reuse the files. Before you submit your solution, un-comment the code downloading the pages. https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AA CS288 Intensive Programming in Linux, Fall 2022, Prof. Xiaoning Ding DO NOT DISTRIBUTE WITHOUT WRITTEN PERMISSION FROM PROF. XIAONING DING Search the files that satisfy certain requirements Write a bash script that searches in a directory (including its subdirectories) for the files that meet the following criteria: 1) owned by a user (specified in command line) AND 2) having read permissions for all users (i.e., owner, other users in the same group, and users outside of the group). The script takes two arguments. The first argument is the pathname of the directory and the second argument is a user id. Note that your script needs to traverse the directory and check the files under it, its subdirectories, sub- subdirectories, etc. For the purposes of practice, DO NOT use find command, and DO NOT use the built- in option -R in ls commands (you can use ls command and its options other than -R). During the traversal, for each file (assuming its pathname is saved in variable pathname), your script need to 1) use command ls -l ${pathname} to get the information of the file, 2) analyze the permission field and the owner field of the line generated by the above ls command using grep or expr and determine whether the file satisfies the requirements or not, 3) if the file satisfies the requirements, extract the time of creation or last modification from the output of command ls -l ${pathname}, and 4) print out the following information of the file: • pathname • permissions (a group of 9 characters consists of r, w, or x, do not include the character for file type at the beginning of the line) • time of creation or last modification extracted from the ls command. For the format of the information printed out by ls -l, refer to these pages: https://cr.yp.to/ftp/list/binls.html, https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/ . Check the owner field and the read permissions to determine whether a file satisfies the requirements. For a file having read permissions for all users, ls -l ${pathname} shows three “r” in the permission field. When you extract the time of creation or last modification, your code should be flexible to handle two time formats: month+day+hour+minute for files modified/created within the last six months, and month+day+year for other files. To use grep to process the line printed out by ls -l ${pathname}, you can use a pipe to connect ls command and grep command. For example, the following commands extracts the first field (- and permissions). $ ls -l /bin/bash | grep -o '^.\{10\}' -rwxr-xr-x This exercise is for you to practice the use of regular expressions. Use regular expression to analyze the output of ls -l. DO NOT use commands cut in your script. Note: 1. The following method can simplify the analysis with regular expressions: cut off the substring before the field that you want to analyze/extract and use the ^ anchor to match from the beginning of the remaining part. For example, to extract some information in the middle of a string, use sub-string to get the part from the middle of the string, and use grep -o with an appropriate regular expression. This is particularly useful when extracting the modification/creation time field from the ls output. For example, to extract the second field from the ls output, you can first extract using regex and remove the first field using sub-string, and then use another grep -o and regex, which describe the pattern of the second field. $ ls_output=`ls -l /bin/bash` $ first_field=`echo "${ls_output}" | grep -o '^.\{10\}'` $ len=${#first_field} https://cr.yp.to/ftp/list/binls.html https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/ CS288 Intensive Programming in Linux, Fall 2022, Prof. Xiaoning Ding DO NOT DISTRIBUTE WITHOUT WRITTEN PERMISSION FROM PROF. XIAONING DING $ remaining=${ls_output:len+1} #cut the 1st field off $ second_field=`echo "$remaining" | grep -o '^[0-9]\{1,\}'` $ echo $second_field 1 With this method, you only need to design a regular expression that matches the field that you want to extract. You don’t need to worry about the interference from the text that you have cut off. If you did not cut off the beginning part, you may have to refine the regular expression to exclude the unwanted matches in the first part. For analyzing the output of ls -l ${pathname}, you also want to use the spaces between fields as “land marks”. 2. Use double quotes to enclose $var in echo command. Otherwise, the output of echo may not exactly match the contents in var. For example, if there are two consecutive spaces in var, without double quotes, var will be divided at the spaces, and the parts are printed out with a single space in between. 3. Escape ("\") parentheses and braces if you use BRE. Testing: to test the script, run it with “root” and “/usr/share/docutils/writers/” as arguments, you should see the information of all 36 files. Randomly select a few of these files, and manually check whether the information printed out by your script matches the corresponding information printed out by ls -l ${pathname}. Run your script with your own username and “/usr/share/docutils/writers/” as arguments. You should not see any information printed out because you don’t have any files in that directory.