Those are the HW
STAT 4410/8416 Homework 3 STAT 4410/8416 Homework 3 lastName firstName Due on Oct 20, 2019 1. Visualizing Relationships in Data: Use the MLB_teams dataset in the mdsr package to create an informative data graphic that illustrates the relationship between winning percentage and payroll in context. 2. Text Data analysis: Download “lincoln-last-speech.txt” from Canvas which contains Lincoln’s last public address. Now answer the following questions and include your codes. a) Read the text and store the text in lAddress. Show the first 70 characters from the first element of the text. b) Now we are interested in the words used in his speech. Extract all the words from lAddress, convert all of them to lower case and store the result in vWord. Display first few words. c) The words like am, is, my or through are not much of our interest and these types of words are called stop-words. The package tm has a function called stopwords(). Get all the English stop words and store them in sWord. Display few stop words in your report. d) Remove all the sWord from vWord and store the result in cleanWord. Display first few clean words. e) cleanWord contains all the cleaned words used in Lincoln’s address. We would like to see which words are more frequently used. Find 15 most frequently used clean words and store the result in fWord. Display first 5 words from fWord along with their frequencies. f) Construct a bar chart showing the count of each words for the 15 most frequently used words. Add a layer +coord_flip() with your plot. g) What is the reason for adding a layer +coord_flip() with the plot in question (2f). Explain what would happen if we would not have done that. h) The plot in question (2f) uses bar plot to display the data. Can you think of another plot that delivers the same information but looks much simpler? Demonstrate your answer by generating such a plot. 3. Answering Questions from Data: Install package nycflights13. The package provides a data frame called flights. Answer the following questions using this data. a) What month had the highest proportion of cancelled flights? What month had the lowest? Interpret any seasonal patterns. Please produce a plot that illustrates the proportion of cancelled flights for each month. b) What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013? Plot the number of trips per week over the year. c) Use the flights and planes tables to answer the following questions: What is the oldest plane (specified by the tailnum variable) that flew from New York City airports in 2013? How many airplanes that flew from New York City are included in the planes table? d) Use the flights and planes tables to answer the following questions: How many planes have a missing date of manufacture? What are the five most common manufacturers (Note: pay close attention to the same manufacturer being represented multiple times)? Has the distribution of manufacturer changed over time as reflected by the airplanes flying from NYC in 2013? Produce a plot that backs up your claim. (Hint: you may need to recode the manufacturer name and collapse rare vendors into a category called Other.) 1 e) Use the weather table to answer the following questions specifically for July, 2013: What is the dis- tribution of temperature in terms of windspeed? What is the relationship between dewp and humid? What is the relationship between precip and visib? Please provide plots for each question. 4. Regular Expressions: Write a regular expression to match patterns in the following strings. Demon- strate that your regular expression indeed matched that pattern by including codes and results. Carefully review how the first problem is solved for you. a) We have a vector vText as follows. Write a regular expression that matches g, og, go or ogo in vText and replace the matches with ‘.’. vText <- c('google','logo','dig',="" 'blog',="" 'boogie'="" )="" answer:="" pattern="">-><- 'o?go?'="" gsub(pattern,="" '.',="" vtext)="" ##="" [1]="" "..le"="" "l."="" "di."="" "bl."="" "bo.ie"="" b)="" replace="" only="" the="" 5="" or="" 6="" digit="" numbers="" with="" the="" word="" “found”="" in="" the="" following="" vector.="" please="" make="" sure="" that="" 3,="" 4,="" or="" 7="" digit="" numbers="" do="" not="" get="" changed.="" vphone="">-><- c('874','6783','345345',="" '32120',="" '468349',="" '8149674'="" )="" c)="" replace="" all="" the="" characters="" that="" are="" not="" among="" the="" 26="" english="" characters="" or="" a="" space.="" please="" replace="" with="" an="" empty="" spring.="" mytext="">-><- "#y%o$u="" @g!o*t="" t9h(e)="" so#lu!tio$n="" c%or_r+e%ct"="" d)="" in="" the="" following="" text,="" replace="" all="" the="" words="" that="" are="" exactly="" 3="" or="" 4="" characters="" long="" with="" triple="" dots="" ‘.="" .="" .="" ’="" mytext="">-><- "each="" of="" the="" three="" and="" four="" character="" words="" will="" be="" gone="" now"="" e)="" extract="" all="" the="" three="" numbers="" embedded="" in="" the="" following="" text.="" bigtext="">-><- 'there="" are="" four="" 20@14="" numbers="" hid989den="" in="" the="" 500="" texts'="" f)="" extract="" all="" the="" words="" between="" parenthesis="" from="" the="" following="" string="" text="" and="" count="" number="" of="" words.="" mytext="">-><- 'the="" salries="" are="" reported="" (in="" millions)="" for="" every="" company.'="" g)="" extract="" the="" texts="" in="" between="" _="" and="" dot(.)="" in="" the="" following="" vector.="" your="" output="" should="" be="" ‘bill’,="" ‘pay’,="" ‘fine-book’.="" mytext="">-><- c("h_bill.xls",="" "big_h_pay.xls",="" "use_case_fine-book.pdf")="" h)="" extract="" the="" numbers="" (return="" only="" integers)="" that="" are="" followed="" by="" the="" units="" ‘ml’="" or="" ‘lb’="" in="" the="" following="" text.="" 2="" mytext="">-><- 'received="" 10="" apples="" with="" 200ml="" water="" at="" 8pm="" with="" 15="" lb="" meat="" and="" 2lb="" salt'="" i)="" extract="" only="" the="" word="" in="" between="" pair="" of="" symbols="" $.="" count="" number="" of="" words="" you="" have="" found="" between="" pairs="" of="" dollar="" sign="" $.="" mytext="">-><- 'math="" symbols="" are="" $written$="" in="" $between$="" dollar="" $signs$'="" j)="" extract="" all="" the="" valid="" equations="" in="" the="" following="" text.="" mytext="">-><- 'equation1:="" 2+3="5," equation2="" is:="" 2*3="6," do="" not="" extract="" 2w3="6'" k)="" extract="" all="" the="" letters="" of="" the="" following="" sentence="" and="" check="" if="" it="" contains="" all="" 26="" letters="" in="" the="" alphabet.="" if="" not,="" produce="" code="" that="" will="" return="" the="" total="" number="" of="" unique="" letters="" that="" are="" included="" and="" list="" the="" letters="" that="" are="" not="" present="" as="" unique="" elements="" in="" a="" single="" vector.="" mytext="">-><- 'there are five wizard boxing matches to be judged' 5. extracting data from the web: our plan is to extract data from web sources. this includes email addresses, phone numbers or other useful data. the function readlines() is very useful for this purpose. a) read all the text in http://mamajumder.github.io/index.html and store your texts in mytext. show first few rows of mytext and examine the structure of the data. b) write a regular expression that would extract all the http web links addresses from mytext. include your codes and display the results that show only the http web link addresses and nothing else. c) now write a regular expression that would extract all the emails from mytext. include your codes and display the results that show only the email addresses and nothing else. d) now we want to extract all the phone/fax numbers in mytext. write a regular expression that would do this. demonstrate your codes showing the results. e) the link of ggplot2 documentation is http://docs.ggplot2.org/current/ and we would like to get the list of ggplot2 geoms from there. write a regular expression that would extract all the geoms names (geom_bar is one of them) from this link and display the unique geoms. how many unique geoms does it have? 6. big data problem: download the sample of big data from canvas. note that the data is in csv format and compressed for easy handling. now answer the following questions. a) read the data and select only the columns that contains the word ‘human’. store the data in an object dat. report first few rows of your data. b) the data frame dat should have 5 columns. rename the column names keeping only the last character of the column names. so each column name will have only one character. report first few rows of your data now. c) compute and report the means of each columns group by column b in a nice table. d) change the data into long form using id=‘b’ and store the data in mdat. report first few rows of data. e) the data frame mdat is now ready for plotting. generate density plots of value, color and fill by variable and facet by b. 3 http://mamajumder.github.io/index.html http://docs.ggplot2.org/current/ f) the data set bigdatasample.csv is a sample of much bigger data set. here we read the data set and then selected the desired column. do you think it would be wise do the same thing with the actual larger data set? explain how you will solve this problem of selecting few columns (as we did in question 6a) without reading the whole data set first. demonstrate that showing your codes. 7. optional bonus question (5 points extra) download the excel file “clean-dat-before.xls” from canvas it contains time series data for many variables. among the two columns of the data, the first column represents time and the second column represents the measurement. the challange is that variable names are also inluded in the time column. our goal is to clean and reshape the data. first few rows and columns of the desired output is shown below. notice each time point is converted into an integer time index to make a uniform elapsed time for all the variables. elapse_time area bulk.rotation. ecg endo.ma.circ..strain endo.ma.radial.strain 1 10.924 0.000 0.32157 0.000 0.000 2 10.648 0.070 0.58824 -1.495 0.762 3 10.574 -0.128 0.81176 -1.423 2.619 4 10.487 0.097 0.88627 -0.620 3.591 5 10.342 0.181 0.87451 -1.142 3.472 6 9.995 0.235 0.85882 -3.269 5.812 4 'there="" are="" five="" wizard="" boxing="" matches="" to="" be="" judged'="" 5.="" extracting="" data="" from="" the="" web:="" our="" plan="" is="" to="" extract="" data="" from="" web="" sources.="" this="" includes="" email="" addresses,="" phone="" numbers="" or="" other="" useful="" data.="" the="" function="" readlines()="" is="" very="" useful="" for="" this="" purpose.="" a)="" read="" all="" the="" text="" in="" http://mamajumder.github.io/index.html="" and="" store="" your="" texts="" in="" mytext.="" show="" first="" few="" rows="" of="" mytext="" and="" examine="" the="" structure="" of="" the="" data.="" b)="" write="" a="" regular="" expression="" that="" would="" extract="" all="" the="" http="" web="" links="" addresses="" from="" mytext.="" include="" your="" codes="" and="" display="" the="" results="" that="" show="" only="" the="" http="" web="" link="" addresses="" and="" nothing="" else.="" c)="" now="" write="" a="" regular="" expression="" that="" would="" extract="" all="" the="" emails="" from="" mytext.="" include="" your="" codes="" and="" display="" the="" results="" that="" show="" only="" the="" email="" addresses="" and="" nothing="" else.="" d)="" now="" we="" want="" to="" extract="" all="" the="" phone/fax="" numbers="" in="" mytext.="" write="" a="" regular="" expression="" that="" would="" do="" this.="" demonstrate="" your="" codes="" showing="" the="" results.="" e)="" the="" link="" of="" ggplot2="" documentation="" is="" http://docs.ggplot2.org/current/="" and="" we="" would="" like="" to="" get="" the="" list="" of="" ggplot2="" geoms="" from="" there.="" write="" a="" regular="" expression="" that="" would="" extract="" all="" the="" geoms="" names="" (geom_bar="" is="" one="" of="" them)="" from="" this="" link="" and="" display="" the="" unique="" geoms.="" how="" many="" unique="" geoms="" does="" it="" have?="" 6.="" big="" data="" problem:="" download="" the="" sample="" of="" big="" data="" from="" canvas.="" note="" that="" the="" data="" is="" in="" csv="" format="" and="" compressed="" for="" easy="" handling.="" now="" answer="" the="" following="" questions.="" a)="" read="" the="" data="" and="" select="" only="" the="" columns="" that="" contains="" the="" word="" ‘human’.="" store="" the="" data="" in="" an="" object="" dat.="" report="" first="" few="" rows="" of="" your="" data.="" b)="" the="" data="" frame="" dat="" should="" have="" 5="" columns.="" rename="" the="" column="" names="" keeping="" only="" the="" last="" character="" of="" the="" column="" names.="" so="" each="" column="" name="" will="" have="" only="" one="" character.="" report="" first="" few="" rows="" of="" your="" data="" now.="" c)="" compute="" and="" report="" the="" means="" of="" each="" columns="" group="" by="" column="" b="" in="" a="" nice="" table.="" d)="" change="" the="" data="" into="" long="" form="" using="" id="‘b’" and="" store="" the="" data="" in="" mdat.="" report="" first="" few="" rows="" of="" data.="" e)="" the="" data="" frame="" mdat="" is="" now="" ready="" for="" plotting.="" generate="" density="" plots="" of="" value,="" color="" and="" fill="" by="" variable="" and="" facet="" by="" b.="" 3="" http://mamajumder.github.io/index.html="" http://docs.ggplot2.org/current/="" f)="" the="" data="" set="" bigdatasample.csv="" is="" a="" sample="" of="" much="" bigger="" data="" set.="" here="" we="" read="" the="" data="" set="" and="" then="" selected="" the="" desired="" column.="" do="" you="" think="" it="" would="" be="" wise="" do="" the="" same="" thing="" with="" the="" actual="" larger="" data="" set?="" explain="" how="" you="" will="" solve="" this="" problem="" of="" selecting="" few="" columns="" (as="" we="" did="" in="" question="" 6a)="" without="" reading="" the="" whole="" data="" set="" first.="" demonstrate="" that="" showing="" your="" codes.="" 7.="" optional="" bonus="" question="" (5="" points="" extra)="" download="" the="" excel="" file="" “clean-dat-before.xls”="" from="" canvas="" it="" contains="" time="" series="" data="" for="" many="" variables.="" among="" the="" two="" columns="" of="" the="" data,="" the="" first="" column="" represents="" time="" and="" the="" second="" column="" represents="" the="" measurement.="" the="" challange="" is="" that="" variable="" names="" are="" also="" inluded="" in="" the="" time="" column.="" our="" goal="" is="" to="" clean="" and="" reshape="" the="" data.="" first="" few="" rows="" and="" columns="" of="" the="" desired="" output="" is="" shown="" below.="" notice="" each="" time="" point="" is="" converted="" into="" an="" integer="" time="" index="" to="" make="" a="" uniform="" elapsed="" time="" for="" all="" the="" variables.="" elapse_time="" area="" bulk.rotation.="" ecg="" endo.ma.circ..strain="" endo.ma.radial.strain="" 1="" 10.924="" 0.000="" 0.32157="" 0.000="" 0.000="" 2="" 10.648="" 0.070="" 0.58824="" -1.495="" 0.762="" 3="" 10.574="" -0.128="" 0.81176="" -1.423="" 2.619="" 4="" 10.487="" 0.097="" 0.88627="" -0.620="" 3.591="" 5="" 10.342="" 0.181="" 0.87451="" -1.142="" 3.472="" 6="" 9.995="" 0.235="" 0.85882="" -3.269="" 5.812="">- 'there are five wizard boxing matches to be judged' 5. extracting data from the web: our plan is to extract data from web sources. this includes email addresses, phone numbers or other useful data. the function readlines() is very useful for this purpose. a) read all the text in http://mamajumder.github.io/index.html and store your texts in mytext. show first few rows of mytext and examine the structure of the data. b) write a regular expression that would extract all the http web links addresses from mytext. include your codes and display the results that show only the http web link addresses and nothing else. c) now write a regular expression that would extract all the emails from mytext. include your codes and display the results that show only the email addresses and nothing else. d) now we want to extract all the phone/fax numbers in mytext. write a regular expression that would do this. demonstrate your codes showing the results. e) the link of ggplot2 documentation is http://docs.ggplot2.org/current/ and we would like to get the list of ggplot2 geoms from there. write a regular expression that would extract all the geoms names (geom_bar is one of them) from this link and display the unique geoms. how many unique geoms does it have? 6. big data problem: download the sample of big data from canvas. note that the data is in csv format and compressed for easy handling. now answer the following questions. a) read the data and select only the columns that contains the word ‘human’. store the data in an object dat. report first few rows of your data. b) the data frame dat should have 5 columns. rename the column names keeping only the last character of the column names. so each column name will have only one character. report first few rows of your data now. c) compute and report the means of each columns group by column b in a nice table. d) change the data into long form using id=‘b’ and store the data in mdat. report first few rows of data. e) the data frame mdat is now ready for plotting. generate density plots of value, color and fill by variable and facet by b. 3 http://mamajumder.github.io/index.html http://docs.ggplot2.org/current/ f) the data set bigdatasample.csv is a sample of much bigger data set. here we read the data set and then selected the desired column. do you think it would be wise do the same thing with the actual larger data set? explain how you will solve this problem of selecting few columns (as we did in question 6a) without reading the whole data set first. demonstrate that showing your codes. 7. optional bonus question (5 points extra) download the excel file “clean-dat-before.xls” from canvas it contains time series data for many variables. among the two columns of the data, the first column represents time and the second column represents the measurement. the challange is that variable names are also inluded in the time column. our goal is to clean and reshape the data. first few rows and columns of the desired output is shown below. notice each time point is converted into an integer time index to make a uniform elapsed time for all the variables. elapse_time area bulk.rotation. ecg endo.ma.circ..strain endo.ma.radial.strain 1 10.924 0.000 0.32157 0.000 0.000 2 10.648 0.070 0.58824 -1.495 0.762 3 10.574 -0.128 0.81176 -1.423 2.619 4 10.487 0.097 0.88627 -0.620 3.591 5 10.342 0.181 0.87451 -1.142 3.472 6 9.995 0.235 0.85882 -3.269 5.812 4>