Answer To: Assignment-1 MIS771 Descriptive Analytics and Visualisations Page 1 of 10 MIS771 Descriptive...
Biswajit answered on Sep 09 2021
Assignment Solutions
Introduction-TassPaperMill (TPM), a subsidiary of Pinnon Paper Industries (PPI), is an Australian company with a long history of manufacturing paper rolls. To address numerous concerns raised in their recent employee survey,TPM is currently reviewing how they calculate salary increments for their employees.We will extract a random sample of 1470 employee records from their HR database.The ultimate goal is to adopt a more holistic rewarding system factoring the key relations between remuneration indicators and demographic characteristics, employment history and various other potential contributors to boost performance. In addition, human resource manager at TPM reported in her recent presentation to the company executive management team that the staff turnover rate at TPM is higher compared to their competitors.Thus,TMP wants to identify key contributing factors before they lose more talented, motivated and focused employees who contribute to the organisation's overall success.
Main Body-
Variable Name
Description of variables
Age
Measured to the nearest year
Attrition
Employee leaving the company (0=No, 1=Yes)
Education
1 Below College; 2 College; 3 Bachelor; 4 Master
EnvironmentSatisfaction
SATISFACTION WITH THE ENVIROMENT (working conditions; development opportunities; company culture; management styles; hierarchies and human resources policies etc)-
Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
Gender
(1=FEMALE, 2=MALE)
JobInvolvement
Degree to which a job is central to a person's identity - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
Job level
Responsibility level and expectations of roles - 1 'Low' 2 'Medium'3 'High' 4 'Very High
JobSatisfaction
SATISFACTION WITH THE JOB - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
MaritalStatus
(1=DIVORCED, 2=MARRIED, 3=SINGLE)
NumCompaniesWorked
NO. OF COMPANIES WORKED AT
OverTime
(1=NO, 2=YES)
PercentSalaryHike
PERCENTAGE INCREASE IN SALARY
PerformanceRating
PERFORMANCE RATING - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
WorkLifeBalance
TIME SPENT BEWTWEEN WORK AND OUTSIDE 1 'Bad' 2 'Good' 3 'Better' 4 'Best'
YearsAtCompany
TOTAL NUMBER OF YEARS AT THE COMPNAY
YearsInCurrentRole
YEARS IN CURRENT ROLE
YearsSinceLastPromotion
YEARS SINCE LAST PROMOTION
Job satisfaction rating of 11 was replaced with imputed value of 2 considering its gender & other variables score similar to this row.
Q1.Description variables :
There are in total 17 variables in the dataset.Out of these 17 variables,6 are categorical variables & rest 11 are continuous/discrete variables.
We analysed the two variables Attrition & PercentSalaryHike.Below are the details.
PercentSalaryHike
Mean
15.21
Standard Error
0.10
Median
14.00
Mode
11.00
Standard Deviation
3.66
Sample Variance
13.40
Kurtosis
-0.30
Skewness
0.82
Range
14.00
Minimum
11.00
Maximum
25.00
Sum
22358.00
Count
1470.00
Table 1
Assuming that the sample was representative of the entire employee population,the average salary hike for all employees was 15.21 & standard deviation was 3.66.The data is right skewed with skewness of 0.82 & also negative kurtosis.It does not follow the standard Bell curve used in HR.While the maximum salary increase is 25%,the minimum is 11%.
As evident from the histogram,it is a long tail distribution.Around 55% of employees have salary hike between 11 -14.3 %. Around 27 % of employees salary hike is within 11-12.1 %.
Above Box plot comapres the percent salary hike for people who left & who stayed back.Better would be a hypothesis test to check the statistical significance of difference.
Row Labels
Average of PercentSalaryHike
No
15.23
Yes
15.10
Grand Total
15.21
Row Labels
StdDev of PercentSalaryHike
No
3.64
Yes
3.77
Grand Total
3.66
Row Labels
Max of PercentSalaryHike
No
25.00
Yes
25.00
Grand Total
25.00
Row Labels
Min of PercentSalaryHike
No
11.00
Yes
11.00
Grand Total
11.00
Row Labels
Count of Employees
No
1233
Yes
237
Grand Total
1470
Table 2
In Table 2,Yes stands for employees who left the organisation & No stands for employees who stayed back.
Conclusion :From 1470 employees,237 employees left which is around 16.12%.From table 2,we see that Minimum salary hike for both categories of employees i.e who left & who stayed back are 11 percent.Similarly,maximum salary hike for both categories of employees are also 25%.Average salary hike of employees who left is 15.10 % while those who stayed back are 15.23%.The standard deviation of salary hike for employees who left is 3.77 while who stayed back are 3.64.The coefficient of variation of salary hike for employees who left is 24.97% in comparison to 23.89% for employees who stayed back.So basically,there is higher variability of salary hike for employees who left.
Q2.1
Introduction : In this case,we have 17 variables out of which one PercentSalaryHike is dependent variable & rest 16 are independent variables.Out of the 16 independent variables,6 are categorical variables & rest 10 are continuous variables.Before building the regression model,we need to do feature/variable selection by using correlation matrix where we will check correlation coefficient between independent.We will pick only the variables which have significant correlation with dependent variables.This will help us identify the continuous variables which are significant.For categorical variables,we will be directly take those in the model & if p value is less than the assumed 0.05,we will accept or else drop those.
Description :
Correlation Matrix :
Age
Education
EnvironmentSatisfaction
JobInvolvement
JobLevel
JobSatisfaction
NumCompaniesWorked
PercentSalaryHike
PerformanceRating
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
Age
1
Education
0.04752484
1
EnvironmentSatisfaction
0.01014643
-0.00754
1
JobInvolvement
-0.01183258
0.002385
-0.0095
1
JobLevel
0.50960423
0.031794
0.001212
-0.056062233
1
JobSatisfaction
0.01479432
-0.05039
-0.03417
0.398666134
-0.018348
1
NumCompaniesWorked
0.07866392
0.052225
0.013452
-0.351162338
0.036036
-0.388621527
1
PercentSalaryHike
0.00363358
-0.04893
-0.0317
0.550392895
-0.03473
0.689970561
-0.510569943
1
PerformanceRating
-0.00686535
-0.04805
-0.03536
0.335945013
-0.010407
0.431331828
-0.245052431
0.764765885
1
WorkLifeBalance
-0.02149003
-0.02017
0.027627
-0.019344333
0.037818
0.023805154
0.014317645
-0.003279636
0.002898784
1
YearsAtCompany
0.31130877
-0.00607
0.001458
-0.068457051
0.534739
-0.028841981
-0.001827975
-0.035991262
0.003300928
0.012089185
1
YearsInCurrentRole
0.21290106
0.048399
0.018007
-0.030374295
0.389447
0.000304992
-0.002212759
-0.001520027
0.033066195
0.049856498
0.758753737
1
YearsSinceLastPromotion
0.21651337
0.008263
0.016194
-0.065300624
0.353885
-0.006542608
0.009712666
-0.022154313
0.01199545
0.008941249
0.618408865
0.548056248
1
Q2.2
First we create multiple linear regression with all variables
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.909365
R Square
0.826945
Adjusted R Square
0.825039
Standard Error
1.530891
Observations
1470
ANOVA
df
SS
MS
F
Significance F
Regression
16
16272.18
1017.011
433.9475618
0
Residual
1453
3405.29
2.343627
Total
1469
19677.47
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept
-2.45267
0.435049
-5.63769
2.0675E-08
-3.306066038
-1.59928
-3.30607
-1.59928
Age
0.012813
0.0052
2.464003
0.013854579
0.002612631
0.023014
0.002613
0.023014
Education2
-0.01757
0.113468
-0.15486
0.876952406
-0.24015004
0.205007
-0.24015
0.205007
Education3
0.040894
0.110801
0.369079
0.712122274
-0.176453037
0.258242
-0.17645
0.258242
Education4
-0.18423
0.300032
-0.61403
0.539293303
-0.772769475
0.404314
-0.77277
0.404314
Gender
-0.8442
0.092629
-9.11376
2.56733E-19
-1.02589697
-0.6625
-1.0259
-0.6625
JobInvolvement
0.633671
0.049803
12.72344
3.07961E-35
0.535976447
0.731365
0.535976
0.731365
Joblevel2
-0.09591
0.096155
-0.9975
0.318686742
-0.28453263
0.092703
-0.28453
0.092703
Joblevel3
-0.08817
0.128929
-0.68388
0.494158156
-0.341078021
0.164734
-0.34108
0.164734
Joblevel4
-0.15145
0.180404
-0.83953
0.401310951
-0.505333689
0.202426
-0.50533
0.202426
Joblevel5
-0.41499
0.211314
-1.96385
0.049737758
-0.829503976
-0.00048
-0.8295
-0.00048
JobSatisfaction
0.934493
0.042254
22.1161
1.18132E-93
0.85160792
1.017379
0.851608
1.017379
Married
-0.07691
0.093049
-0.82659
0.408605586
-0.25943844
0.105612
-0.25944
0.105612
Divorced
-0.16533
0.110929
-1.49041
0.136334201
-0.382927229
0.052269
-0.38293
0.052269
NumCompaniesWorked
-0.27864
0.018575
-15.001
2.11031E-47
-0.315076903
-0.2422
-0.31508
-0.2422
OverTime
-0.0642
0.089028
-0.72111
0.47096053
-0.238834902
0.110438
-0.23883
0.110438
PerformanceRating
4.799953
0.12132
39.56434
6.2769E-233
4.561971183
5.037934
4.561971
5.037934
From the above table,we see that adjusted R square is around 82.5% & standard error is 1.53 & model is statistically significant as shown in lower p value <0.05 in Anova test.Those are variables which are significant are marked green.
But we will build a more refined model by removing the insignificant variables where p value >0.05.Below model is the refined model with all significant variables.Here in the final model we find Female gender,Job involvement,Job satisfaction,Num companies worked,performance rating are significant.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.908644
R Square
0.825634
Adjusted R Square
0.825038
Standard Error
1.530895
Observations
1470
ANOVA
df
SS
MS
F
Significance F
Regression
5
16246.38
3249.276
1386.424
0
Residual
1464
3431.087
2.343639
Total
1469
19677.47
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept
-2.17115
0.392657
-5.52937
3.8E-08
-2.94138
-1.40092
-2.94138
-1.40092
Female
-0.84082
0.092404
-9.0994
2.89E-19
-1.02207
-0.65956
-1.02207
-0.65956
JobInvolvement
0.6423
0.049547
12.96332
1.84E-36
0.545108
0.739491
0.545108
0.739491
JobSatisfaction
0.939155
0.042092
22.31182
3.68E-95
0.856587
1.021722
0.856587
1.021722
NumCompaniesWorked
-0.27491
0.018407
-14.9352
4.71E-47
-0.31102
-0.2388
-0.31102
-0.2388
PerformanceRating
4.795484
0.121134
39.58816
1.2E-233
4.557869
5.033099
4.557869
5.033099
Q2.3
Residual analysis shows that the dependent variable percentsalary hike is not exactly normally distributed & the residuals of independent variables Job involvement,performance rating & job satisfaction are not homoscedastic.The details of the residual analysis graphs are given in the Excel sheet.
Conclusion :As seen from above correlation matrix,we can see the variables such as Performance rating,Job involvement,Job satisfaction,NumCompanies worked are significantly correlated with Percentsalary hike as correlation coefficient is higher.Also,the individual scatter plots between the independent & dependent variables shows that the relationship is linear.Also,the correlation among independent variables seem to be low hence eliminating the possibility of multicollinearity.The significant variables are marked green.As most independent variables have less than 0.5 correlation coefficient among themselves,there is less chance of multicolleanirity.But still we can check for multicolleanirity if any by finding the VIF (variance inflation factor) in regression.Eventhough the model violates few of the assumptions of multiple linear regression,still we can accept it for descriptive purpose.
As per the final model,the famales have on an average 0.8402 % less PercentsalaryHike than that of male. 1 unit increase in job involvement results in 0.6423 % increase in PercentSalaryHike.1 unit increase in job satisfaction results in 0.9391 % increase in PercentSalaryhike on average.Every additional no of companies worked results in -.2741 % decrease in PercentSalaryHike.1 unit increase in performance rating results in 4.79% increase in PercentSalaryHike.
The model is statistically significant as P value in ANOVA test less than 0.05 & adjusted R square is 82.5%
Q3.
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.91014884
R Square
0.828370911
Adjusted R Square
0.827667032
Standard Error
1.519350169
Observations
1470
ANOVA
df
SS
MS
F
Significance F
Regression
6
16300.24098
2716.707
1176.866
0
Residual
1463
3377.225683
2.308425
Total
1469
19677.46667
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept
3.627857474
1.262195506
2.874244
0.004109
1.151951411
6.103763536
1.151951411
6.103763536
Female
-0.825286881
0.091763043
-8.99367
7.22E-19
-1.005288056
-0.645285707
-1.005288056
-0.645285707
JobInvolvement
0.633468834
0.049207801
12.87334
5.28E-36
0.53694346
0.729994208
0.53694346
0.729994208
JobSatisfaction
-0.826316075
0.367873788
-2.24619
0.02484
-1.547932447
-0.104699703
-1.547932447
-0.104699703
NumCompaniesWorked
-0.276258441
0.018270092
-15.1208
4.18E-48
-0.312096813
-0.240420069
-0.312096813
-0.240420069
PerformanceRating
2.877047033
0.414958503
6.933337
6.15E-12
2.063069905
3.691024161
2.063069905
3.691024161
Performance rating xJob satisfaction
0.57910428
0.119888274
4.830366
1.51E-06
0.343933023
0.814275538
0.343933023
0.814275538
The final model with interaction term between performance rating & job satisfaction is
PercentSalaryHike = 3.62 -0.82 Female +0.63 JobInvolvement -0.82 Jobsatisfaction-0.27 NumCompaniesworked +2.87 performancerating+0.57 performancerating x job satisfaction.
As the p value associated with interaction term is less than 0.05,so this term is statistically significant.So increase in performance rating affects increase in percent salary in different way depending upon job satisfaction.
Conclusion :
So final model shows that Female gender,Job involvement,performance rating,job satisfaction,number of companies worked & interaction between performance rating & job satisfaction are important for salary hike.
Females have on average 0.82 % salary hike in comparison to male.1 unit increase in job involvement increases % salary hike by 0.63%.1 unit increase in job satisfaction results in (0.57 performance rating -0.82) % salary hike.so basically job satisfaction results in higher % salary hike in the presence of higher performance rating.similarly higher performance rating results in higher % salary hike in presence of high job satisfaction.
Q4.The employee attrition problem :
We did a logistic regression for the dependent variable Attrition (1=yes,0=No) with all other variables as independent variables.Among these Education,Gender,Job level,Marital status,overtime are categorical variables for which dummy codes were used.Education has 4 levels so 3 dummy codes were created with Education1 being reference level.Gender 1 was female with male being reference level.similarly,job level 1 was reference category with others being dummy coded.marrital status is dummy coded with single being reference & so also overtime where “No” is reference category.
Variable name
Coeff value
odds
intercept
1.706330944
Age
-0.030831391
0.970
Education2
0.096060844
1.101
Education3
0.307956911
1.361
Education4
0.227199341
1.255
EnvironmentSatisfaction
-0.367395095
0.693
Female
-0.319340636
0.727
JobInvolvement
-0.175083026
0.839
Joblevel2
-0.961435136
0.382
Joblevel3
-0.319954909
0.726
Joblevel4
-1.682080948
0.186
Joblevel5
-1.08606482
0.338
JobSatisfaction
-0.1699398
0.844
Married
-0.857916847
0.424
Divorced
-1.194120663
0.303
NumCompaniesWorked
0.045318512
1.046
OverTime
1.631282149
5.110
PercentSalaryHike
0.036695796
1.037
PerformanceRating
0.049571141
1.051
WorkLifeBalance
-0.227926049
0.796
YearsAtCompany
-0.00855152
0.991
YearsInCurrentRole
-0.153279842
0.858
YearsSinceLastPromotion
0.142827598
1.154
In the table above,we have attached coefficient value with odds of each.where the odds are greater than 1,it means higher value of the variable results in greater probability of exit & where odds are lower,the probability of exit is lower.
Different cases :
1. Medium satisfaction with job & work environment,5 years since last promotion,the person has higher probability of leaving as odds is higher (around 2.17 if we plug in the values of medium as 3 & 5 for years since last promotion)
2. Higher number of years in current role & overtime results in higher odds of exit
3. A 45 years old married employee with a very-high level job classification and maintaining a good work-life balance has lower odds of leaving.
The final equation is log(p/1-p) = 1.7 +0.3 Age +0.09 Education2 +0.307 Education3+0.227 Education4 +0.367 Environmentsatisfaction +.319 female + 0.175 Job involvement +0.961 Job level2 +0.313 Job level3+ 1.68 Job level4 -1.08 job level5 -.169 Job satisfaction+.857 married +1.19 divorce +0.045 Numcompaniesworked +1.63 Overtime +0.036 Percent salary hike +0.049 performance rating+.227 worklife balance-.008 yearsAtCompany+0.153YearsInCurrentRole+0.142 YearsSinceLastPromotion where p is the probability of attrition.
If we plug the individual values of an employee in the above equation,it will...