Document format in .Rmd output: html_document. Description: This project requires you to understand what mode of transport employees prefers to commute to their office. The dataset "Cars-dataset"...

1 answer below »
Document format in .Rmd output: html_document.
Description: This project requires you to understand what mode of transport employees prefers to commute to their office. The dataset "Cars-dataset" includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.


Following is expected out of the candidate in this assessment.
1. EDA (15 Marks)
1.1. Perform an EDA on the data - (7 marks)

1.2. Illustrate the insights based on EDA (5 marks)

1.3. What is the most challenging aspect of this problem? What method will you use to deal with this? Comment (3 marks)
2. Data Preparation (10 marks)
2.1. Prepare the data for analysis
3. Modeling (30 Marks)
3.1. Create multiple models and explore how each model perform using appropriate model performance metrics (15 marks)
3.1.1. KNN
3.1.2. Naive Bayes (is it applicable here? comment and if it is not applicable, how can you build an NB model in this case?)
3.1.3. Logistic Regression

3.2. Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step. (15 marks)
4. Actionable Insights & Recommendations (5 Marks)
4.1. Summarize your findings from the exercise in a concise yet actionable note.
Answered Same DayMay 01, 2021

Answer To: Document format in .Rmd output: html_document. Description: This project requires you to understand...

Abr Writing answered on May 05 2021
146 Votes
analysis.html
Data Analysis
Cars - Mode of Transport
05/05/2020
In this project we will understand what mode of transport employees prefers to commute to their ofo,ce. The dataset “Cars-dataset” includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp.
We will build models to predict whether or not an employee will use Car as a mode of transport after understanding which variables are a signio,cant predictor for decisidng the mode of transport response variable.
1. EDA
Loading the Data into R workspace
cars <- read.csv("Cars-dataset.csv")
Correcting the variable types
cars <- cars %>%
mutate(
Engineer = as.factor(Engineer),
MBA = as.factor(MBA),
license = as.factor(license)
)
1.1. Impleme
ntation
Performing an Exploratory Data Analysis (EDA) on the cars data
Displaying the names of all the variables of the dataset.
colnames(cars)
[1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary"
[7] "Distance" "license" "Transport"
Checking the unique values in the Transport variable.
unique(cars$Transport)
[1] 2Wheeler Car Public Transport
Levels: 2Wheeler Car Public Transport
From the plot above, we can see that there are three levels of mode of transport in the dataset. Since, the aim of the paper is to predict whether a person will use a Car or not for transport, we will convert the variable into a binary variable where 1 means that a person use Car as a mode of tranportation; 0 otherwise.
cars <- cars %>%
mutate(
Transport = as.factor( ifelse( Transport == "Car", 1 ,0 ) )
)
Plotting the distribution of Age, Work.Exp, Salary and Distance based on mode of transportation.
p1 <- cars %>%
ggplot(aes(x=Age, fill=Transport)) +
geom_histogram(
color="#e9ecef",
alpha=0.6,
position = 'identity',
bins=10) +
scale_fill_manual(values=c("#69b3a2",
"#404080")) +
labs(
fill="",
x = "Age",
y = "Frequency"
)
p2 <- cars %>%
ggplot(aes(x=Work.Exp, fill=Transport)) +
geom_histogram(
color="#e9ecef",
alpha=0.6,
position = 'identity',
bins=10) +
scale_fill_manual(values=c("#69b3a2",
"#404080")) +
labs(
fill="",
x = "Work.Exp",
y = "Frequency"
)
p3 <- cars %>%
ggplot(aes(x=Salary, fill=Transport)) +
geom_histogram(
color="#e9ecef",
alpha=0.6,
position = 'identity',
bins=10) +
scale_fill_manual(values=c("#69b3a2",
"#404080")) +
labs(
fill="",
x = "Salary",
y = "Frequency"
)
p4 <- cars %>%
ggplot(aes(x=Distance, fill=Transport)) +
geom_histogram(
color="#e9ecef",
alpha=0.6,
position = 'identity',
bins=10) +
scale_fill_manual(values=c("#69b3a2",
"#404080")) +
labs(
fill="",
x = "Distance",
y = "Frequency"
)
ggarrange(p1, p2, p3, p4,
labels = c("Age", "Work Experience", "Salary", "Distance"),
ncol = 2, nrow = 2)
Now, plotting the distribution of the mode of transport against Gender, Engineer, MBA and license.
p1 <- melt(table(cars$Gender, cars$Transport)) %>%
rename(Gender = Var1, Transport = Var2) %>%
mutate(Transport = as.factor(Transport)) %>%
group_by(Gender) %>%
mutate(value = value/sum(value)*100) %>%
ggplot(aes(Gender, value)) +
geom_bar(aes(fill = Transport), position = "dodge", stat="identity") +
labs(y="Percentage")
p2 <- melt(table(cars$Engineer, cars$Transport)) %>%
rename(Engineer = Var1, Transport = Var2) %>%
mutate(
Transport = as.factor(Transport),
Engineer = as.factor(Engineer)
) %>%
group_by(Engineer) %>%
mutate(value = value/sum(value)*100) %>%
ggplot(aes(Engineer, value)) +
geom_bar(aes(fill = Transport), position = "dodge", stat="identity") +
labs(y="Percentage")
p3 <- melt(table(cars$MBA, cars$Transport)) %>%
rename(MBA = Var1, Transport = Var2) %>%
mutate(
Transport = as.factor(Transport),
MBA = as.factor(MBA)
) %>%
group_by(MBA) %>%
mutate(value = value/sum(value)*100) %>%
ggplot(aes(MBA, value)) +
geom_bar(aes(fill = Transport), position = "dodge", stat="identity") +
labs(y="Percentage")
p4 <- melt(table(cars$license, cars$Transport)) %>%
rename(license = Var1, Transport = Var2) %>%
mutate(
Transport = as.factor(Transport),
license = as.factor(license)
) %>%
group_by(license) %>%
mutate(value = value/sum(value)*100) %>%
ggplot(aes(license, value)) +
geom_bar(aes(fill = Transport), position = "dodge", stat="identity") +
labs(y="Percentage")
ggarrange(p1, p2, p3, p4,
labels = c("Gender", "Engineer?", "MBA?", "license?"),
ncol = 2, nrow = 2)
1.2. Insights
Basedon the EDA, the insights obtained on the dataset fot the analysis are:
        From the distribution of Age, Work.Exp, Salary and Distance based on mode of transportation, we found that higher the value of Age, Work.Exp, Salary and Distance, the higher are the chances of an employee using a Car as mode of transportation. For very high values of these variables, there was no employee within the dataset using any other mode of transportation except Car.
        The Gender have a little effect on usage of car as mode of transport with Males using cars a little more than the Females. Whether an employee is an engineer or not, whether an employee is MBA or not, has no effect on the car usage. The license have a greater difference, with people having a license, using car as mode of transport.
1.3. Challenge
The most challenging aspect of this problem was to understand and interpret the above bar charts. When originally plotted with frequencies, for example, it appears that if an employee is an Enginner then they don’t use Car for mode of transportation but this was due to the fact that there are more engineers in the dataset. Therefore, We changed the Y-axis from frequency/count/occurences to percentage. It now represent, in the case of Engineer?, what percentage of engineers use Car as mode of transport from all the engineers in the dataset and similary for non-enginners and other factorial/categorical characteristics of an employee.
We will be dividing the data into training and testing dataset in order to remove any possibility of over-fitting the model on the dataset. Therefore, the model will be fitted on the training dataset and its performance will be evaluated based on its prediction on the testing dataset.
2. Data Preparation
2.1. Data Preparation
Preparing the data for analysis by dividing the dataset into training (80%) and testing (20%) dataset.
set.seed(123)
train.ind <- sample(seq_len(nrow(cars)),
size = floor(0.80 * nrow(cars)))
train <- cars[train.ind, ]
test <- cars[-train.ind, ]
3. Modeling
3.1. Model Fitting
Creating multiple models and explore how each model perform using different model performance metrics.
3.1.1. KNN
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3)
knn.fit <- train(Transport ~ Age + Work.Exp + Salary + Distance + license,
data = train,
method = "knn",
trControl = ctrl,
preProcess = c("center","scale"),
tuneLength = 20)
plot(knn.fit)
Prediction
pred <- predict(knn.fit,
newdata = test)
confusionMatrix(pred, test$Transport)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 77 2
1 0 5

Accuracy : 0.9762
95% CI : (0.9166, 0.9971)
No Information Rate : 0.9167
P-Value [Acc > NIR] : 0.02507

Kappa : 0.8209

Mcnemar's Test P-Value : 0.47950
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here