FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due Friday, XXXXXXXXXX, 11:59pm, uploaded to CANVAS as PDF or DOC AND RMD Overview Analyze a dataset of your...

1 answer below »
Final Project using R studio on data exploration and analysis. imperative that you follow the guidelines and requirements. Must be done in Rmarkdow.


FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due Friday, 8.13.21, 11:59pm, uploaded to CANVAS as PDF or DOC AND RMD Overview Analyze a dataset of your choice and write a 10-20 page report of your findings. This report must be created in RMarkdown and you’ll submit both a knitted PDF/doc file and the raw Rmarkdown code. Your goal is to demonstrate your ability to code in R, to clean data, to use appropriate graphical and statistical techniques in R, and to interpret your results. Groups You are encouraged but certainly not required to work in groups. Groups can be up to 4 students. Everyone in the group gets the same grade. Data You should choose a dataset that is interesting to you, OR you may use one of three datasets provided by myself. The dataset should have at least 10 variables and at least 50 observations. You must have at least two continuous variables and at least two categorical variables. Some datasets will have hundreds of variables and more than 100,000 observations. Getting the cleaning the data may be the most difficult part of your project. YOU ABSOLUTELY SHOULD DISCUSS YOUR DATA WITH MYSELF OR A TA BEFORE TURNING IN YOUR PROJECT. There are many online sources for data – you can just go to Google and search for a subject and then add ‘data’. You can also scrape data off a website. Here are some good sites:  ICPSR https://www.icpsr.umich.edu/icpsrweb/landing.jsp. More than 10,000 datasets here  Kaggle https://www.kaggle.com/datasets  The Census Bureau (http://www.census.gov/)  NOAA (http://www.nodc.noaa.gov/)  The US Environmental Protection Agency (http://www.epa.gov/epahome/Data.html). Other ideas:  Use your web scraping tools to get data on all roll call votes in the 116th Senate (2nd session, 2020) You should NOT choose a dataset that has already been extensively cleaned and analyzed (i.e. from a textbook or ‘nice example’ website). However, if there is minimal cleaning to do, then put more effort into something else. You do NOT need to use all the variables in your dataset; indeed, you may end up cleaning/analyzing only 6 to 10 variables. Your goal is not be comprehensive, but to demonstrate what you’ve learned. https://www.icpsr.umich.edu/icpsrweb/landing.jsp https://www.kaggle.com/datasets http://www.census.gov/ http://www.nodc.noaa.gov/ http://www.epa.gov/epahome/Data.html https://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_116_2.htm S&DS 230e Final Project Guidelines 2 If you decide not to find your own data, you can use one of the following three datasets, all available on CANVAS under Files  Final Project Information. Dataset information on variables and collection methods are also provided.  World Bank Data from 2016  Environmental Attitudes from the General Social Survey of 2000  Food Choices (we looked briefly at a few variables in class) : https://www.kaggle.com/borapajo/food-choices Format Your project should be presented as a report; it should have appropriate RMarkdown formatting and discussions should be in complete sentences. There is no minimum length (brevity and clarity are admired), and your knitted report should not be more than 15 pages long, including graphs and relevant output (just suppress irrelevant output). You should NOT have pages of output that you don’t discuss. You also don’t need to have RMarkdown show every last bit of output your code creates. It should feel more formal than a homework assignment, but you should be extremely concise in your discussion. Sections of the Report  Introduction (Background, motivation) – not more than a short paragraph.  DATA: Make a LIST of all variables you actually use – describe units, anything I should know. Ignore variables you don’t discuss.  Data cleaning process – describe the cleaning process you used on your data. Talk about what issues you encountered.  Descriptive Plots, summary information. Plots should be clearly labeled, well formatted, and display an aesthetic sense.  Analysis – see below  Conclusions and Summary – a short paragraph. Content Requirements Your report should include evidence of your ability in each of the following areas: 1) Data Cleaning – demonstrate use of find/replace, data cleaning, dealing with missing values, text character replacement, matching. It’s ok if your data didn’t require much of this. 2) Graphics – show appropriate use of at least ONE of each of the following – boxplot, scatterplot (can be matrix plot), normal quantile plot (can be related to regression), residual plots, histogram. 3) Basic tests - t-test, correlation, AND ability to create bootstrap confidence interval for either a t-test or a correlation. 4) Permutation Test – include at least one. 5) Multiple Regression – use either backwards stepwise regression or some form of best subsets regression. Should include residual plots. A GLM with a mix of continuous and categorical predictors is fine here. https://www.kaggle.com/borapajo/food-choices S&DS 230e Final Project Guidelines 3 6) AT LEAST ONE OF THE FOLLOWING TECHNIQUES – ANOVA, ANCOVA, Logistic Regression, Multinomial Regression, OR data scraping off a website. Additional Comments Please do NOT have appendices – unlike a journal article, include relevant plots and output in the section where you discuss the results (more of a narrative). This said, you should ONLY include output that is relevant to your discussion. I can always look at your RMarkdown code if I have questions. It is fine to suppress both long output and parts of your R code. As you work on this project, I expect you will regularly pester myself and TA’s. Submission - Please read this carefully 1) ONLY ONE person in a group should upload a copy of the final project (i.e. if there are three people in a group, only one person needs to upload the files. 2) BE SURE to put all members’ names on your project documents.
Answered 8 days AfterAug 12, 2022

Answer To: FES 205 S&DS 230e Final Project Guidelines 1 S&DS 23eData Analysis Final Project Guidelines Due...

Mansi answered on Aug 16 2022
78 Votes
Final Project Report
Data used: Credit Card Data
1. Introduction to the problem and Objective
A Company collected data from 5000 customers. The objective is to understand what’s driving the total spend
of credit card (Primary Card + Secondary card) and priotizing the drivers based on their importance.The
data has infomration on Credit card usage and other demographic variables in the form of 132 variables on
5000 customers. We will be working on few of these variables which are according to our objective.
2. Dataset Used
Since there are a lot of variables in the dataset. We are giving the list and explanation of only those variables
which are of interest to us
and which we are making use of in our study.
1. “jobsat”-Job satisfaction- 1:Highly dissatisfied, 2:Somewhat dissatisfied, 3:Neutral, 4:Somewhat satisfied,
5:Highly satisfied
2. “lninc”-Log-income
3. “edcat”-Level of education- 1:Did not complete high school, 2:High school degree, 3:Some college,
4:College degree, 5: Post-undergraduate degree
4. “carditems”-Number of items on primary card last month
5. “card2items”-Number of items on secondary card last month
6. “spousedcat”-Spouse level of education- 1:Did not complete high school, 2:High school degree, 3: Some
college, 4: College degree, 5:Post-undergraduate degree
7. “gender”-Gender- 0: Male, 1: Female
8. “cars”-Number of cars owned/leased
9. “polcontrib”-Political contributions- 0: No, 1: Yes
10. “pets”- Number of pets owned
11. “cardspent”- Amount spent on primary card last month
12. “card2spent”- Amount spent on secondary card last month
Data Importing in R
Importing the data in R and naming it as reg. Checking the dimension of the data and structure of the data
reg<-read.csv("/Users/mansikhurana/Documents/Grey Nodes/R Markdown 2/Data.csv")
dim(reg)
## [1] 5000 132
str(reg)
dim() gives the number of rows and columns present in the data. Structure of the data, str() gives the
information on different data structures like numeric, factor, list etc..
3. Data Cleaning Process
First, let us calculate the summary/ descriptive statistics of the data. summary() is function available in R
to calculate summary statistics. It is as follows:
1
summary(reg)
We see that summary function gives statistical summary points like minimum, maximum, 1st quartile, 2nd
(median) and 3rd quartiles for numeric variables. It gives the count of all the categories present under
the different categorical or dummy variables. Generally, it also gives the count of the number of missing
observations present in the data. Here, we see that there are no missing observations.
To find out the presence of outliers values, we create boxplots for different variables first.
boxplot(reg$jobcat)
1
2
3
4
5
6
boxplot(reg$lninc)
3
4
5
6
7
boxplot(reg$edcat)
2
1
2
3
4
5
boxplot(reg$card2items)
0
5
10
15
boxplot(reg$carditems)
0
5
10
15
20
3
boxplot(reg$spousedcat)

1
0
1
2
3
4
5
boxplot(reg$cars)
0
2
4
6
8
boxplot(reg$pets)
4
0
5
10
15
20
Observing the
boxplots of all the above variables, we see that “lninc”, “cards2items”, “carditems”, “cars”, “pets” have
outlier values present as they have lot of values going beyond the maximum point.
Treating Outliers
## Treating OUTLIERS for "lninc"
out1 <- boxplot.stats(reg$lninc)$out
out_lninc <- which(reg$lninc %in% c(out1))
out_lninc
## [1] 18 84 272 350 467 648 755 990 1103 1238 1771 1958 1964 2062 2080
## [16] 2193 2199 2278 2347 2490 2970 3069 3213 3624 4010 4272 4287 4792 4917 4950
# Capping outlier by percentile 99
reg$lninc[c(out_lninc)]<-quantile(reg$lninc,0.99)
## Treating OUTLIERS for "carditems"
out3 <- boxplot.stats(reg$carditems)$out
out_card <- which(reg$carditems %in% c(out3))
out_card
## [1] 7 20 114 152 160 186 313 316 325 374 406 518 519 623 748
## [16] 750 766 877 884 1091 1126 1216 1299 1417 1434 1573 1595 1658 1660 1663
## [31] 1708 1717 1761 1856 1929 1985 2032 2112 2475 2547 2658 2711 2713 2800 2801
## [46] 2879 2971 3016 3108 3132 3298 3305 3428 3463 3550 3564 3593 3655 3675 3696
## [61] 3703 3751 3982 4026 4100 4131 4172 4263 4308 4312 4332 4411 4426 4498 4522
## [76] 4561 4665 4715 4720 4793 4891 4933
# Capping outlier by percentile 99
reg$carditems[c(out_card)]<-quantile(reg$carditems,0.99)
## Treating OUTLIERS for "card2items"
out2 <- boxplot.stats(reg$card2items)$out
out_card2 <- which(reg$card2items %in% c(out2))
5
out_card2
## [1] 58 170 174 227 242 266 334 544 729 748 885 906 1155 1176 1186
## [16] 1213 1340 1398 1423 1512 1517 1632 1713 2007 2063 2069 2096 2160 2402 2562
## [31] 2581 2681 2719 2769 2771 2862 2910 2980 3014 3117 3129 3167 3229 3289 3314
## [46] 3362 3435 3463 3526 3541 3638 3666 3695 3758 3879 3974 4016 4045 4083 4291
## [61] 4330 4429 4558 4599 4602 4625 4645 4659 4709 4776 4953
# Capping outlier by percentile 99
reg$card2items[c(out_card2)]<-quantile(reg$card2items,0.99)
## Treating OUTLIERS for "cars"
out4 <- boxplot.stats(reg$cars)$out
out_car <- which(reg$cars %in% c(out4))
out_car
## [1] 82 351 1334 1349 1379 1381 1744 2231 2301 3245 4646 4727 4912 4954
# Capping outlier by percentile 99
reg$cars[c(out_car)]<-quantile(reg$cars,0.99)
## Treating OUTLIERS for "pets"
out5 <- boxplot.stats(reg$pets)$out
out_pet <- which(reg$pets %in% c(out5))
out_pet
## [1] 39 231 336 407 614 927 954 962 1018 1024 1098 1219 1277 1282 1434
## [16] 1511 1680 1754 1864 1886 1916 2007 2112 2125 2155 2192 2425 2475 2478 2511
## [31] 2525 2540 2682 2996 3021 3270 3363 3413 3620 3644 3686 3733 3743 3791 3875
## [46] 3947 3990 4107 4184 4203 4320 4362 4414 4422 4496 4584 4609 4683 4693 4754
## [61] 4814 4830 4873
# Capping outlier by percentile 99
reg$pets[c(out_pet)]<-quantile(reg$pets,0.99)
# Creating factors for categorical variables #
reg$jobsat<-factor(reg$jobsat)
levels(reg$jobsat)<-c("1","2","3","4","5")
reg$spousedcat<-factor(reg$spousedcat)
levels(reg$spousedcat)<-c("-1","1","2","3","4","5")
# Creating dependent variable #
reg$tot_spend<-reg$cardspent+reg$card2spent
6
4. Plots/Graphics
# Histogram of reg$tot_spend
hist(reg$tot_spend)
Histogram of reg$tot_spend
reg$tot_spend
F
re
qu
en
cy
0 1000 2000 3000 4000 5000
0
50
0
15
00
25
00
# Histogram of log(reg$tot_spend)
hist(log(reg$tot_spend))
7
Histogram of log(reg$tot_spend)
log(reg$tot_spend)
F
re
qu
en
cy
2 3 4 5 6 7 8
0
50
0
10
00
15
00
#Replacing 'reg$tot_spend' variable in data with 'log(reg$tot_spend)'
reg$ln_totalspend<-log(reg$tot_spend)
Since, we are going to build a regression model considering ‘total spend’ as dependent variable. It is assumed
that dependent variable is normally distributed for linear regression. To check this assumption, we create...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here