Assignment-1 MIS771 Descriptive Analytics and Visualisations Page 1 of 10 MIS771 Descriptive Analytics and Visualisation DEPARTMENT OF INFORMATION SYSTEMS AND BUSINESS ANALYTICS DEAKIN BUSINESS SCHOOL...

1 answer below »

View more »
Answered Same DaySep 05, 2021MIS771Deakin University

Answer To: Assignment-1 MIS771 Descriptive Analytics and Visualisations Page 1 of 10 MIS771 Descriptive...

Biswajit answered on Sep 09 2021
146 Votes
Assignment Solutions
Introduction-TassPaperMill (TPM), a subsidiary of Pinnon Paper Industries (PPI), is an Australian company with a long history of manufacturing paper rolls. To address numerous concerns raised in their recent employee survey,TPM is currently reviewing how they calculate salary increments for their employees.We will extract a random sample of 1470 employee records from their HR database.The ultimate goal is to adopt a more holistic rewarding system factoring the key relations between remuneration indicators and demographic characteristics, employment history and various other potential contributors to boost performance. In addition, human resource manager at TPM reported in her recent presentation to the company executive management team that the staff turnover rate at TPM is higher compared to their competitors.Thus,TMP wants to identify key contributing factors before they lose more talented, motivated and focused employees who contribute to the organisation's overall success.
Main Body-
    Variable Name
    Description of variables
    Age
    Measured to the nearest year
    Attrition
    Employee leaving the company (0=No, 1=Yes)
    Education
    1 Below College; 2 College; 3 Bachelor; 4 Master
    EnvironmentSatisfaction
    SATISFACTION WITH THE ENVIROMENT (working conditions; development opportunities; company culture; management styles; hierarchies and human resources policies etc)-
Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
    Gender
    (1=FEMALE, 2=MALE)
    JobInvolvement
    Degree to which a job is central to a person's identity - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
    Job level
    Responsibility level and expectations of roles - 1 'Low' 2 'Medium'3 'High' 4 'Very High
    JobSatisfaction
    SATISFACTION WITH THE JOB - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and
 4 indicating very high
    MaritalStatus
    (1=DIVORCED, 2=MARRIED, 3=SINGLE)
    NumCompaniesWorked
    NO. OF COMPANIES WORKED AT
    OverTime
    (1=NO, 2=YES)
    PercentSalaryHike
    PERCENTAGE INCREASE IN SALARY
    PerformanceRating
    PERFORMANCE RATING - Responses were obtained on a 4 point scale ranging from 1 to 4 where 1 indicating low and 4 indicating very high
    WorkLifeBalance
    TIME SPENT BEWTWEEN WORK AND OUTSIDE 1 'Bad' 2 'Good' 3 'Better' 4 'Best'
    YearsAtCompany
    TOTAL NUMBER OF YEARS AT THE COMPNAY
    YearsInCurrentRole
    YEARS IN CURRENT ROLE
    YearsSinceLastPromotion
    YEARS SINCE LAST PROMOTION
Job satisfaction rating of 11 was replaced with imputed value of 2 considering its gender & other variables score similar to this row.
Q1.Description variables :
There are in total 17 variables in the dataset.Out of these 17 variables,6 are categorical variables & rest 11 are continuous/discrete variables.
We analysed the two variables Attrition & PercentSalaryHike.Below are the details.
    PercentSalaryHike
    Mean
    15.21
    Standard Error
    0.10
    Median
    14.00
    Mode
    11.00
    Standard Deviation
    3.66
    Sample Variance
    13.40
    Kurtosis
    -0.30
    Skewness
    0.82
    Range
    14.00
    Minimum
    11.00
    Maximum
    25.00
    Sum
    22358.00
    Count
    1470.00
Table 1
Assuming that the sample was representative of the entire employee population,the average salary hike for all employees was 15.21 & standard deviation was 3.66.The data is right skewed with skewness of 0.82 & also negative kurtosis.It does not follow the standard Bell curve used in HR.While the maximum salary increase is 25%,the minimum is 11%.
As evident from the histogram,it is a long tail distribution.Around 55% of employees have salary hike between 11 -14.3 %. Around 27 % of employees salary hike is within 11-12.1 %.

Above Box plot comapres the percent salary hike for people who left & who stayed back.Better would be a hypothesis test to check the statistical significance of difference.
    Row Labels
    Average of PercentSalaryHike
    No
    15.23
    Yes
    15.10
    Grand Total
    15.21
    Row Labels
    StdDev of PercentSalaryHike
    No
    3.64
    Yes
    3.77
    Grand Total
    3.66
    Row Labels
    Max of PercentSalaryHike
    No
    25.00
    Yes
    25.00
    Grand Total
    25.00
    Row Labels
    Min of PercentSalaryHike
    No
    11.00
    Yes
    11.00
    Grand Total
    11.00
    Row Labels
    Count of Employees
    No
    1233
    Yes
    237
    Grand Total
    1470
Table 2
In Table 2,Yes stands for employees who left the organisation & No stands for employees who stayed back.
Conclusion :From 1470 employees,237 employees left which is around 16.12%.From table 2,we see that Minimum salary hike for both categories of employees i.e who left & who stayed back are 11 percent.Similarly,maximum salary hike for both categories of employees are also 25%.Average salary hike of employees who left is 15.10 % while those who stayed back are 15.23%.The standard deviation of salary hike for employees who left is 3.77 while who stayed back are 3.64.The coefficient of variation of salary hike for employees who left is 24.97% in comparison to 23.89% for employees who stayed back.So basically,there is higher variability of salary hike for employees who left.
Q2.1
Introduction : In this case,we have 17 variables out of which one PercentSalaryHike is dependent variable & rest 16 are independent variables.Out of the 16 independent variables,6 are categorical variables & rest 10 are continuous variables.Before building the regression model,we need to do feature/variable selection by using correlation matrix where we will check correlation coefficient between independent.We will pick only the variables which have significant correlation with dependent variables.This will help us identify the continuous variables which are significant.For categorical variables,we will be directly take those in the model & if p value is less than the assumed 0.05,we will accept or else drop those.
Description :
Correlation Matrix :
     
    Age
    Education
    EnvironmentSatisfaction
    JobInvolvement
    JobLevel
    JobSatisfaction
    NumCompaniesWorked
    PercentSalaryHike
    PerformanceRating
    WorkLifeBalance
    YearsAtCompany
    YearsInCurrentRole
    YearsSinceLastPromotion
    Age
    1
     
     
     
     
     
     
     
     
     
     
     
     
    Education
    0.04752484
    1
     
     
     
     
     
     
     
     
     
     
     
    EnvironmentSatisfaction
    0.01014643
    -0.00754
    1
     
     
     
     
     
     
     
     
     
     
    JobInvolvement
    -0.01183258
    0.002385
    -0.0095
    1
     
     
     
     
     
     
     
     
     
    JobLevel
    0.50960423
    0.031794
    0.001212
    -0.056062233
    1
     
     
     
     
     
     
     
     
    JobSatisfaction
    0.01479432
    -0.05039
    -0.03417
    0.398666134
    -0.018348
    1
     
     
     
     
     
     
     
    NumCompaniesWorked
    0.07866392
    0.052225
    0.013452
    -0.351162338
    0.036036
    -0.388621527
    1
     
     
     
     
     
     
    PercentSalaryHike
    0.00363358
    -0.04893
    -0.0317
    0.550392895
    -0.03473
    0.689970561
    -0.510569943
    1
     
     
     
     
     
    PerformanceRating
    -0.00686535
    -0.04805
    -0.03536
    0.335945013
    -0.010407
    0.431331828
    -0.245052431
    0.764765885
    1
     
     
     
     
    WorkLifeBalance
    -0.02149003
    -0.02017
    0.027627
    -0.019344333
    0.037818
    0.023805154
    0.014317645
    -0.003279636
    0.002898784
    1
     
     
     
    YearsAtCompany
    0.31130877
    -0.00607
    0.001458
    -0.068457051
    0.534739
    -0.028841981
    -0.001827975
    -0.035991262
    0.003300928
    0.012089185
    1
     
     
    YearsInCurrentRole
    0.21290106
    0.048399
    0.018007
    -0.030374295
    0.389447
    0.000304992
    -0.002212759
    -0.001520027
    0.033066195
    0.049856498
    0.758753737
    1
     
    YearsSinceLastPromotion
    0.21651337
    0.008263
    0.016194
    -0.065300624
    0.353885
    -0.006542608
    0.009712666
    -0.022154313
    0.01199545
    0.008941249
    0.618408865
    0.548056248
    1
Q2.2
First we create multiple linear regression with all variables
    SUMMARY OUTPUT
     
     
     
     
     
     
     
     
    Regression Statistics
     
     
     
     
     
     
     
     
    Multiple R
    0.909365
     
     
     
     
     
     
     
    R Square
    0.826945
     
     
     
     
     
     
     
    Adjusted R Square
    0.825039
     
     
     
     
     
     
     
    Standard Error
    1.530891
     
     
     
     
     
     
     
    Observations
    1470
     
     
     
     
     
     
     
    ANOVA
     
     
     
     
     
     
     
     
     
    df
    SS
    MS
    F
    Significance F
     
     
     
    Regression
    16
    16272.18
    1017.011
    433.9475618
    0
     
     
     
    Residual
    1453
    3405.29
    2.343627
     
     
     
     
     
    Total
    1469
    19677.47
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Coefficients
    Standard Error
    t Stat
    P-value
    Lower 95%
    Upper 95%
    Lower 95.0%
    Upper 95.0%
    Intercept
    -2.45267
    0.435049
    -5.63769
    2.0675E-08
    -3.306066038
    -1.59928
    -3.30607
    -1.59928
    Age
    0.012813
    0.0052
    2.464003
    0.013854579
    0.002612631
    0.023014
    0.002613
    0.023014
    Education2
    -0.01757
    0.113468
    -0.15486
    0.876952406
    -0.24015004
    0.205007
    -0.24015
    0.205007
    Education3
    0.040894
    0.110801
    0.369079
    0.712122274
    -0.176453037
    0.258242
    -0.17645
    0.258242
    Education4
    -0.18423
    0.300032
    -0.61403
    0.539293303
    -0.772769475
    0.404314
    -0.77277
    0.404314
    Gender
    -0.8442
    0.092629
    -9.11376
    2.56733E-19
    -1.02589697
    -0.6625
    -1.0259
    -0.6625
    JobInvolvement
    0.633671
    0.049803
    12.72344
    3.07961E-35
    0.535976447
    0.731365
    0.535976
    0.731365
    Joblevel2
    -0.09591
    0.096155
    -0.9975
    0.318686742
    -0.28453263
    0.092703
    -0.28453
    0.092703
    Joblevel3
    -0.08817
    0.128929
    -0.68388
    0.494158156
    -0.341078021
    0.164734
    -0.34108
    0.164734
    Joblevel4
    -0.15145
    0.180404
    -0.83953
    0.401310951
    -0.505333689
    0.202426
    -0.50533
    0.202426
    Joblevel5
    -0.41499
    0.211314
    -1.96385
    0.049737758
    -0.829503976
    -0.00048
    -0.8295
    -0.00048
    JobSatisfaction
    0.934493
    0.042254
    22.1161
    1.18132E-93
    0.85160792
    1.017379
    0.851608
    1.017379
    Married
    -0.07691
    0.093049
    -0.82659
    0.408605586
    -0.25943844
    0.105612
    -0.25944
    0.105612
    Divorced
    -0.16533
    0.110929
    -1.49041
    0.136334201
    -0.382927229
    0.052269
    -0.38293
    0.052269
    NumCompaniesWorked
    -0.27864
    0.018575
    -15.001
    2.11031E-47
    -0.315076903
    -0.2422
    -0.31508
    -0.2422
    OverTime
    -0.0642
    0.089028
    -0.72111
    0.47096053
    -0.238834902
    0.110438
    -0.23883
    0.110438
    PerformanceRating
    4.799953
    0.12132
    39.56434
    6.2769E-233
    4.561971183
    5.037934
    4.561971
    5.037934
From the above table,we see that adjusted R square is around 82.5% & standard error is 1.53 & model is statistically significant as shown in lower p value <0.05 in Anova test.Those are variables which are significant are marked green.
But we will build a more refined model by removing the insignificant variables where p value >0.05.Below model is the refined model with all significant variables.Here in the final model we find Female gender,Job involvement,Job satisfaction,Num companies worked,performance rating are significant.
    SUMMARY OUTPUT
     
     
     
     
     
     
     
     
    Regression Statistics
     
     
     
     
     
     
     
     
    Multiple R
    0.908644
     
     
     
     
     
     
     
    R Square
    0.825634
     
     
     
     
     
     
     
    Adjusted R Square
    0.825038
     
     
     
     
     
     
     
    Standard Error
    1.530895
     
     
     
     
     
     
     
    Observations
    1470
     
     
     
     
     
     
     
    ANOVA
     
     
     
     
     
     
     
     
     
    df
    SS
    MS
    F
    Significance F
     
     
     
    Regression
    5
    16246.38
    3249.276
    1386.424
    0
     
     
     
    Residual
    1464
    3431.087
    2.343639
     
     
     
     
     
    Total
    1469
    19677.47
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Coefficients
    Standard Error
    t Stat
    P-value
    Lower 95%
    Upper 95%
    Lower 95.0%
    Upper 95.0%
    Intercept
    -2.17115
    0.392657
    -5.52937
    3.8E-08
    -2.94138
    -1.40092
    -2.94138
    -1.40092
    Female
    -0.84082
    0.092404
    -9.0994
    2.89E-19
    -1.02207
    -0.65956
    -1.02207
    -0.65956
    JobInvolvement
    0.6423
    0.049547
    12.96332
    1.84E-36
    0.545108
    0.739491
    0.545108
    0.739491
    JobSatisfaction
    0.939155
    0.042092
    22.31182
    3.68E-95
    0.856587
    1.021722
    0.856587
    1.021722
    NumCompaniesWorked
    -0.27491
    0.018407
    -14.9352
    4.71E-47
    -0.31102
    -0.2388
    -0.31102
    -0.2388
    PerformanceRating
    4.795484
    0.121134
    39.58816
    1.2E-233
    4.557869
    5.033099
    4.557869
    5.033099
Q2.3
Residual analysis shows that the dependent variable percentsalary hike is not exactly normally distributed & the residuals of independent variables Job involvement,performance rating & job satisfaction are not homoscedastic.The details of the residual analysis graphs are given in the Excel sheet.
Conclusion :As seen from above correlation matrix,we can see the variables such as Performance rating,Job involvement,Job satisfaction,NumCompanies worked are significantly correlated with Percentsalary hike as correlation coefficient is higher.Also,the individual scatter plots between the independent & dependent variables shows that the relationship is linear.Also,the correlation among independent variables seem to be low hence eliminating the possibility of multicollinearity.The significant variables are marked green.As most independent variables have less than 0.5 correlation coefficient among themselves,there is less chance of multicolleanirity.But still we can check for multicolleanirity if any by finding the VIF (variance inflation factor) in regression.Eventhough the model violates few of the assumptions of multiple linear regression,still we can accept it for descriptive purpose.
As per the final model,the famales have on an average 0.8402 % less PercentsalaryHike than that of male. 1 unit increase in job involvement results in 0.6423 % increase in PercentSalaryHike.1 unit increase in job satisfaction results in 0.9391 % increase in PercentSalaryhike on average.Every additional no of companies worked results in -.2741 % decrease in PercentSalaryHike.1 unit increase in performance rating results in 4.79% increase in PercentSalaryHike.
The model is statistically significant as P value in ANOVA test less than 0.05 & adjusted R square is 82.5%
Q3.
    SUMMARY OUTPUT
     
     
     
     
     
     
     
     
    Regression Statistics
     
     
     
     
     
     
     
     
    Multiple R
    0.91014884
     
     
     
     
     
     
     
    R Square
    0.828370911
     
     
     
     
     
     
     
    Adjusted R Square
    0.827667032
     
     
     
     
     
     
     
    Standard Error
    1.519350169
     
     
     
     
     
     
     
    Observations
    1470
     
     
     
     
     
     
     
    ANOVA
     
     
     
     
     
     
     
     
     
    df
    SS
    MS
    F
    Significance F
     
     
     
    Regression
    6
    16300.24098
    2716.707
    1176.866
    0
     
     
     
    Residual
    1463
    3377.225683
    2.308425
     
     
     
     
     
    Total
    1469
    19677.46667
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
     
    Coefficients
    Standard Error
    t Stat
    P-value
    Lower 95%
    Upper 95%
    Lower 95.0%
    Upper 95.0%
    Intercept
    3.627857474
    1.262195506
    2.874244
    0.004109
    1.151951411
    6.103763536
    1.151951411
    6.103763536
    Female
    -0.825286881
    0.091763043
    -8.99367
    7.22E-19
    -1.005288056
    -0.645285707
    -1.005288056
    -0.645285707
    JobInvolvement
    0.633468834
    0.049207801
    12.87334
    5.28E-36
    0.53694346
    0.729994208
    0.53694346
    0.729994208
    JobSatisfaction
    -0.826316075
    0.367873788
    -2.24619
    0.02484
    -1.547932447
    -0.104699703
    -1.547932447
    -0.104699703
    NumCompaniesWorked
    -0.276258441
    0.018270092
    -15.1208
    4.18E-48
    -0.312096813
    -0.240420069
    -0.312096813
    -0.240420069
    PerformanceRating
    2.877047033
    0.414958503
    6.933337
    6.15E-12
    2.063069905
    3.691024161
    2.063069905
    3.691024161
    Performance rating xJob satisfaction
    0.57910428
    0.119888274
    4.830366
    1.51E-06
    0.343933023
    0.814275538
    0.343933023
    0.814275538
The final model with interaction term between performance rating & job satisfaction is
PercentSalaryHike = 3.62 -0.82 Female +0.63 JobInvolvement -0.82 Jobsatisfaction-0.27 NumCompaniesworked +2.87 performancerating+0.57 performancerating x job satisfaction.
As the p value associated with interaction term is less than 0.05,so this term is statistically significant.So increase in performance rating affects increase in percent salary in different way depending upon job satisfaction.
Conclusion :
So final model shows that Female gender,Job involvement,performance rating,job satisfaction,number of companies worked & interaction between performance rating & job satisfaction are important for salary hike.
Females have on average 0.82 % salary hike in comparison to male.1 unit increase in job involvement increases % salary hike by 0.63%.1 unit increase in job satisfaction results in (0.57 performance rating -0.82) % salary hike.so basically job satisfaction results in higher % salary hike in the presence of higher performance rating.similarly higher performance rating results in higher % salary hike in presence of high job satisfaction.
Q4.The employee attrition problem :
We did a logistic regression for the dependent variable Attrition (1=yes,0=No) with all other variables as independent variables.Among these Education,Gender,Job level,Marital status,overtime are categorical variables for which dummy codes were used.Education has 4 levels so 3 dummy codes were created with Education1 being reference level.Gender 1 was female with male being reference level.similarly,job level 1 was reference category with others being dummy coded.marrital status is dummy coded with single being reference & so also overtime where “No” is reference category.
    Variable name
    Coeff value
    odds
     intercept
    1.706330944
     
    Age
    -0.030831391
    0.970
    Education2
    0.096060844
    1.101
    Education3
    0.307956911
    1.361
    Education4
    0.227199341
    1.255
    EnvironmentSatisfaction
    -0.367395095
    0.693
    Female
    -0.319340636
    0.727
    JobInvolvement
    -0.175083026
    0.839
    Joblevel2
    -0.961435136
    0.382
    Joblevel3
    -0.319954909
    0.726
    Joblevel4
    -1.682080948
    0.186
    Joblevel5
    -1.08606482
    0.338
    JobSatisfaction
    -0.1699398
    0.844
    Married
    -0.857916847
    0.424
    Divorced
    -1.194120663
    0.303
    NumCompaniesWorked
    0.045318512
    1.046
    OverTime
    1.631282149
    5.110
    PercentSalaryHike
    0.036695796
    1.037
    PerformanceRating
    0.049571141
    1.051
    WorkLifeBalance
    -0.227926049
    0.796
    YearsAtCompany
    -0.00855152
    0.991
    YearsInCurrentRole
    -0.153279842
    0.858
    YearsSinceLastPromotion
    0.142827598
    1.154
In the table above,we have attached coefficient value with odds of each.where the odds are greater than 1,it means higher value of the variable results in greater probability of exit & where odds are lower,the probability of exit is lower.
Different cases :
1. Medium satisfaction with job & work environment,5 years since last promotion,the person has higher probability of leaving as odds is higher (around 2.17 if we plug in the values of medium as 3 & 5 for years since last promotion)
2. Higher number of years in current role & overtime results in higher odds of exit
3. A 45 years old married employee with a very-high level job classification and maintaining a good work-life balance has lower odds of leaving.
The final equation is log(p/1-p) = 1.7 +0.3 Age +0.09 Education2 +0.307 Education3+0.227 Education4 +0.367 Environmentsatisfaction +.319 female + 0.175 Job involvement +0.961 Job level2 +0.313 Job level3+ 1.68 Job level4 -1.08 job level5 -.169 Job satisfaction+.857 married +1.19 divorce +0.045 Numcompaniesworked +1.63 Overtime +0.036 Percent salary hike +0.049 performance rating+.227 worklife balance-.008 yearsAtCompany+0.153YearsInCurrentRole+0.142 YearsSinceLastPromotion where p is the probability of attrition.
If we plug the individual values of an employee in the above equation,it will...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here
March
January
February
March
April
May
June
July
August
September
October
November
December
2025
2025
2026
2027
SunMonTueWedThuFriSat
23
24
25
26
27
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1
2
3
4
5
00:00
00:30
01:00
01:30
02:00
02:30
03:00
03:30
04:00
04:30
05:00
05:30
06:00
06:30
07:00
07:30
08:00
08:30
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
18:30
19:00
19:30
20:00
20:30
21:00
21:30
22:00
22:30
23:00
23:30