1 Problem regarding regular expressions (regex) in linux. Lecture slides attached.
Microsoft Word - regular expression.docx Problem: A big-data processing task: Write a bash script to find out 5 most frequently used words on a set of Wikipedia pages. The script prints out a list of these words and the number of occurrences of each word on the Wikipedia pages. The list should be sorted in descending order based on the number of occurrences. The following is a sample of output generated for 4 Wikipedia pages. 1080 the 677 of 480 in 473 and 443 a Since there are a huge number of pages in Wikipedia, it is not realistic to analyze all of them in short time on one machine. In this problem, your script only needs to analyze all the pages for the Wikipedia entries with two capital letters. For example, the Wikipedia page for entry "AC" is https://en.wikipedia.org/wiki/AC . Thus, the pages we need to analyze are https://en.wikipedia.org/wiki/AA https://en.wikipedia.org/wiki/AB https://en.wikipedia.org/wiki/AC ... https://en.wikipedia.org/wiki/ZY https://en.wikipedia.org/wiki/ZZ Your script combines a few tools in Linux to finish the above big-data processing task. You can use wget to download and save a page. For example, the following command downloads and save the AC wiki page into file AC.html: wget https://en.wikipedia.org/wiki/AC -O AC.html A HTML page has HTML tags, which should be removed before the analysis. (Open a .html file using vi and a web browser, and you will find the differences.) You can use lynx to extract the text content into a text file. For example, the following command extract the content for entry "AC" into AC.txt lynx -dump -nolist AC.html > AC.txt After the contents for all the required entries have been extracted, you need to find all the words using grep. You need to use a regular expression to guide grep to do the search. All the words found by grep should be saved into the same file, which is then used to find the most frequently used words. Note that you need to find distinct words and count the number of times that each distinct word appears in file. Using the -o option (i.e., grep -o) will simplify the processing, since you only need the matching parts. You may need sort,cut and uniq in this step. Read the man pages of sort, cut and uniq to understand how this can be achieved. Hint: You don't need to write code to count the number of occurrences for each distinct word. Use sort and uniq smartly --- sort groups the occurrences, and uniq counts the number of occurrences. CS 288 Intensive Programming CS 288 Intensive Programming in Linux Professor Ding, Xiaoning This content may NOT be uploaded, shared, or distributed, as it is protected. What is a regular expression? A regular expression (regex) describes a set of possible input strings. Regular expressions descend from a fundamental concept in computer science called finite automata theory Regular expressions are used by many tools in Unix, many languages, and many library functions. expr, [[, test, vi, grep, sed, emacs, … C, awk, tcl, perl and Python Compilers Scanf (e.g., scanf("%[^\n]", str) to read a whole line)… Prof. Ding, Xiaoning. Summer 2021. Protected content. 2 6/8/2021 Prof. Ding, Xiaoning. Summer 2021. Protected content. Prof. Ding, Xiaoning. Summer 2021. Protected content. expr: length of matching substring at beginning of a string expr match STRING REGEX_STR expr STRING : REGEX_STR REGEX_STR is a string describing the pattern (regular expression). If included in single quotes, special characters must be escaped. expr prints out the number of characters matched Exit code is 0 if the string matches the pattern, and 1 otherwise. $ stringZ=abcABC123ABCabc # |------| # 12345678 $ expr match "$stringZ" 'abc[A-Z]*.2' # 8 $ expr "$stringZ" : 'abc[A-Z]*.2' # 8 Prof. Ding, Xiaoning. Summer 2021. Protected content. expr : extracts substring at beginning of string expr match STRING '\(REGEX_STR\)' expr STRING : '\(REGEX_STR\)' REGEX_STR is a string describing the pattern (regular expression). If included in single quotes, special characters must be escaped. $ stringZ=abcABC123ABCabc # ======= $ expr match "$stringZ" '\(.[b-c]*[A-Z]..[0-9]\)' abcABC1 $ expr "$stringZ" : '\(.[b-c]*[A-Z]..[0-9]\)' abcABC1 $ expr "$stringZ" : '\(.......\)'` abcABC1 Prof. Ding, Xiaoning. Summer 2021. Protected content. expr : extracts substring at end of string expr match STRING '.*\(REGEX_STR\)' expr STRING : '.*\(REGEX_STR\)' REGEX_STR is a string describing the pattern (regular expression). If included in single quotes, special characters must be escaped. $ stringZ=abcABC123ABCabc # ====== $ expr match "$stringZ" '.*\([A-C][A-C][A-C][a-c]*\)' ABCabc $ expr "$stringZ" : '.*\(......\)' ABCabc Prof. Ding, Xiaoning. Summer 2021. Protected content. Using expr in if construct $ if expr 1 "<" 2; then echo true; else echo false; fi 1 true $ if expr filename : file; then echo true; else echo false; fi 4 true $ if expr filename : file; then echo true; else echo false; fi 0 false prof. ding, xiaoning. summer 2021. protected content. string matching using [[ and =~ [[ string =~ pattern ]] pattern: extended regular expression (ere) string the return value is 0 if the string matches the pattern, and 1 otherwise. prof. ding, xiaoning. summer 2021. protected content. $ if [[ 'test' =~ 'es' ]]; then echo true; else echo false; fi true $ if [[ 'test' =~ '^es' ]]; then echo true; else echo false; fi false regular expressions the simplest regular expressions are a string of literal characters to match. the string matches the regular expression if it contains the substring. prof. ding, xiaoning. summer 2021. protected content. 8 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. regular expressions a regular expression can match a string in more than one place. scrapple from the apple. match 1 match 2 regular expression a p p l e prof. ding, xiaoning. summer 2021. protected content. 9 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. unix tools rocks. match unix tools sucks. match unix tools is okay. no match regular expression c k s prof. ding, xiaoning. summer 2021. protected content. 10 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. regular expressions the . regular expression can be used to match any character. for him to loop a rope. match 1 match 2 regular expression o . prof. ding, xiaoning. summer 2021. protected content. 11 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. character classes character classes [] can be used to match any specific set of characters. beat a brat on a boat match 1 match 2 regular expression b [eor] a t match 3 prof. ding, xiaoning. summer 2021. protected content. 12 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. negated character classes character classes can be negated with the [^] syntax. beat a brat on a boat match regular expression b [^eo] a t prof. ding, xiaoning. summer 2021. protected content. 13 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. more about character classes [aeiou] will match any of the characters a, e, i, o, or u [kk]orn will match korn or korn ranges can also be specified in character classes [1-9] is the same as [123456789] [abcde] is equivalent to [a-e] you can also combine multiple ranges [abcde123456789] is equivalent to [a-e1-9] note that the - character has a special meaning in a character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3 prof. ding, xiaoning. summer 2021. protected content. 14 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. named character classes commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, cntrl) syntax [:name:] [a-za-z] [[:alpha:]] [a-za-z0-9] [[:alnum:]] [45a-z] [45[:lower:]] important for portability across languages prof. ding, xiaoning. summer 2021. protected content. 15 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. anchors match at the beginning or end of a line (or both). ^ means beginning of the line, $ means end of the line beat a brat on a boat match regular expression ^ b [eor] a t regular expression b [eor] a t $ beat a brat on a boat match ^$ ^word$ prof. ding, xiaoning. summer 2021. protected content. 16 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. repetition the * defines zero or more occurrences of the single regular expression preceding it. (* is a normal character if no preceding expression) i got mail, yaaaaaaaaaay! match regular expression y a * y this is a loop. match regular expression o a * o .* prof. ding, xiaoning. summer 2021. protected content. 17 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. 2;="" then="" echo="" true;="" else="" echo="" false;="" fi="" 1="" true="" $="" if="" expr="" filename="" :="" file;="" then="" echo="" true;="" else="" echo="" false;="" fi="" 4="" true="" $="" if="" expr="" filename="" :="" file;="" then="" echo="" true;="" else="" echo="" false;="" fi="" 0="" false="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" string="" matching="" using="" [[="" and="~" [[="" string="~" pattern="" ]]="" pattern:="" extended="" regular="" expression="" (ere)="" string="" the="" return="" value="" is="" 0="" if="" the="" string="" matches="" the="" pattern,="" and="" 1="" otherwise. ="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" $="" if="" [[="" 'test'="~" 'es'="" ]];="" then="" echo="" true;="" else="" echo="" false;="" fi="" true="" $="" if="" [[="" 'test'="~" '^es'="" ]];="" then="" echo="" true;="" else="" echo="" false;="" fi="" false="" regular="" expressions="" the="" simplest="" regular="" expressions="" are="" a="" string="" of="" literal="" characters="" to="" match.="" the="" string="" matches="" the="" regular="" expression="" if="" it="" contains="" the="" substring.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 8="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" regular="" expressions="" a="" regular="" expression="" can="" match="" a="" string="" in="" more="" than="" one="" place.="" scrapple="" from="" the="" apple.="" match="" 1="" match="" 2="" regular="" expression="" a="" p="" p="" l="" e="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 9="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" unix="" tools="" rocks.="" match="" unix="" tools="" sucks.="" match="" unix="" tools="" is="" okay.="" no="" match="" regular="" expression="" c="" k="" s="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 10="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" regular="" expressions="" the="" .="" regular="" expression="" can="" be="" used="" to="" match="" any="" character.="" for="" him="" to="" loop="" a="" rope.="" match="" 1="" match="" 2="" regular="" expression="" o="" .="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 11="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" character="" classes="" character="" classes="" []="" can="" be="" used="" to="" match="" any="" specific="" set="" of="" characters.="" beat="" a="" brat="" on="" a="" boat="" match="" 1="" match="" 2="" regular="" expression="" b="" [eor]="" a="" t="" match="" 3="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 12="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" negated="" character="" classes="" character="" classes="" can="" be="" negated="" with="" the="" [^]="" syntax.="" beat="" a="" brat="" on="" a="" boat="" match="" regular="" expression="" b="" [^eo]="" a="" t="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 13="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" more="" about="" character="" classes="" [aeiou]="" will="" match="" any="" of="" the="" characters="" a,="" e,="" i,="" o,="" or="" u="" [kk]orn="" will="" match="" korn="" or="" korn="" ranges="" can="" also="" be="" specified="" in="" character="" classes="" [1-9]="" is="" the="" same="" as="" [123456789]="" [abcde]="" is="" equivalent="" to="" [a-e]="" you="" can="" also="" combine="" multiple="" ranges="" [abcde123456789]="" is="" equivalent="" to="" [a-e1-9]="" note="" that="" the="" -="" character="" has="" a="" special="" meaning="" in="" a="" character="" class="" but="" only="" if="" it="" is="" used="" within="" a="" range,="" [-123]="" would="" match="" the="" characters="" -,="" 1,="" 2,="" or="" 3="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 14="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" named="" character="" classes="" commonly="" used="" character="" classes="" can="" be="" referred="" to="" by="" name="" (alpha,="" lower,="" upper,="" alnum,="" digit,="" punct,="" cntrl)="" syntax="" [:name:]="" [a-za-z]="" [[:alpha:]]="" [a-za-z0-9]="" [[:alnum:]]="" [45a-z]="" [45[:lower:]]="" important="" for="" portability="" across="" languages="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 15="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" anchors="" match="" at="" the="" beginning="" or="" end="" of="" a="" line="" (or="" both).="" ^="" means="" beginning="" of="" the="" line,="" $="" means="" end="" of="" the="" line="" beat="" a="" brat="" on="" a="" boat="" match="" regular="" expression="" ^="" b="" [eor]="" a="" t="" regular="" expression="" b="" [eor]="" a="" t="" $="" beat="" a="" brat="" on="" a="" boat="" match="" ^$="" ^word$="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 16="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" repetition="" the="" *="" defines="" zero="" or="" more="" occurrences="" of="" the="" single="" regular="" expression="" preceding="" it.="" (*="" is="" a="" normal="" character="" if="" no="" preceding="" expression)="" i="" got="" mail,="" yaaaaaaaaaay!="" match="" regular="" expression="" y="" a="" *="" y="" this="" is="" a="" loop.="" match="" regular="" expression="" o="" a="" *="" o="" .*="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" 17="" 6/8/2021="" prof.="" ding,="" xiaoning.="" summer="" 2021.="" protected="" content.="" prof.="" ding,="">" 2; then echo true; else echo false; fi 1 true $ if expr filename : file; then echo true; else echo false; fi 4 true $ if expr filename : file; then echo true; else echo false; fi 0 false prof. ding, xiaoning. summer 2021. protected content. string matching using [[ and =~ [[ string =~ pattern ]] pattern: extended regular expression (ere) string the return value is 0 if the string matches the pattern, and 1 otherwise. prof. ding, xiaoning. summer 2021. protected content. $ if [[ 'test' =~ 'es' ]]; then echo true; else echo false; fi true $ if [[ 'test' =~ '^es' ]]; then echo true; else echo false; fi false regular expressions the simplest regular expressions are a string of literal characters to match. the string matches the regular expression if it contains the substring. prof. ding, xiaoning. summer 2021. protected content. 8 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. regular expressions a regular expression can match a string in more than one place. scrapple from the apple. match 1 match 2 regular expression a p p l e prof. ding, xiaoning. summer 2021. protected content. 9 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. unix tools rocks. match unix tools sucks. match unix tools is okay. no match regular expression c k s prof. ding, xiaoning. summer 2021. protected content. 10 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. regular expressions the . regular expression can be used to match any character. for him to loop a rope. match 1 match 2 regular expression o . prof. ding, xiaoning. summer 2021. protected content. 11 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. character classes character classes [] can be used to match any specific set of characters. beat a brat on a boat match 1 match 2 regular expression b [eor] a t match 3 prof. ding, xiaoning. summer 2021. protected content. 12 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. negated character classes character classes can be negated with the [^] syntax. beat a brat on a boat match regular expression b [^eo] a t prof. ding, xiaoning. summer 2021. protected content. 13 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. more about character classes [aeiou] will match any of the characters a, e, i, o, or u [kk]orn will match korn or korn ranges can also be specified in character classes [1-9] is the same as [123456789] [abcde] is equivalent to [a-e] you can also combine multiple ranges [abcde123456789] is equivalent to [a-e1-9] note that the - character has a special meaning in a character class but only if it is used within a range, [-123] would match the characters -, 1, 2, or 3 prof. ding, xiaoning. summer 2021. protected content. 14 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. named character classes commonly used character classes can be referred to by name (alpha, lower, upper, alnum, digit, punct, cntrl) syntax [:name:] [a-za-z] [[:alpha:]] [a-za-z0-9] [[:alnum:]] [45a-z] [45[:lower:]] important for portability across languages prof. ding, xiaoning. summer 2021. protected content. 15 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. anchors match at the beginning or end of a line (or both). ^ means beginning of the line, $ means end of the line beat a brat on a boat match regular expression ^ b [eor] a t regular expression b [eor] a t $ beat a brat on a boat match ^$ ^word$ prof. ding, xiaoning. summer 2021. protected content. 16 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning. summer 2021. protected content. repetition the * defines zero or more occurrences of the single regular expression preceding it. (* is a normal character if no preceding expression) i got mail, yaaaaaaaaaay! match regular expression y a * y this is a loop. match regular expression o a * o .* prof. ding, xiaoning. summer 2021. protected content. 17 6/8/2021 prof. ding, xiaoning. summer 2021. protected content. prof. ding, xiaoning.>