Answer To: Business Analytics and Big Data (ACC73002) Assignment 3 – Report (50%) You are hired as a...
Pritam answered on Sep 27 2021
INTRODUCTION
The purpose of the analysis here is mainly to build a proper model to determine the price of the property based on different attributes. As a consultant of the real estate market, it is really of utter importance to build a model as accurate as possible. Six different samples from different states and regions are quite enough to determine the best model that can predict the prices of the property accurately.
Data and Empirical Strategy
The data set mainly contains two sets of data for two different states, one of them being State A and the other being State B. In these data sets one can find three different regions for that particular state. The regions are explained to be regional city, coastal city and coastal town. The variables present in the data are Price, Internal Area, number of bedrooms, number of bathrooms, number of garages and type of the property. All of them except the last one are numerical variables and the last one, Type, is a categorical variable. In the excel sheet containing the state A data has again two sets of data for each region and hence just one of them each has been taken for further analysis. The visualization might be considered as the building blocks of the analysis. The backbone of any analysis is the pre-analysis involving data manipulation and visualization techniques. Since the data doesn’t require any manipulation here, one can start the analysis through visualization. The statistical methods that have been used here is multiple linear regression analysis. Before building any kind of multiple linear regression, one has to check all the assumptions of the regression and then the backward elimination can be applied to create the final model.
Results and Discussions:
The first step in the analysis is to produce some visualization techniques to have a taste of the data, rather the understanding of the data is very significant before any kind of analysis. Hence some graphs are stated below for different states and regions to visualize the aspect of the analysis.
Some random visualization:
State A: Regional city:
State B: Regional city:
From the visualization, one thing is quite clear that the Internal area variable is quite positively related to other variables for both state A and state B. This can be an alarming issue for multicollinearity. The price seems to be quite higher for state A than that of state B.
State A: Coastal city:
State B: Coastal city:
Comparison based on regional city of State A and State B:
State A: Regional City:
Row Labels
Average of Price $000
Average of Internal Area m^2
Average of Bedrooms
House
368.34
162.31
3.54
Unit
223.44
95.55
2.22
Grand Total
320.8144
140.4104
3.104
Row Labels
Average of Bathrooms
Average of Garages
House
1.52
1.85
Unit
1.07
1.22
Grand Total
1.376
1.64
State B: Regional City:
Row Labels
Average of Price $000
Average of Internal Area m^2
Average of Land/Total Area m^2
House
418.51
139.94
763.46
Unit
335.47
104.16
210.91
Grand Total
400.86
132.34
646.04
Row Labels
Average of Bedrooms
Average of Bathrooms
Average of Garages
House
3.46
1.54
1.95
Unit
2.71
1.76
1.35
Grand Total
3.30
1.59
1.83
One can clearly see that as far as the average price of the property of Regional city is concerned, State B seems to be quite expensive than State A. Other attributes like average internal area, average number of bathrooms, average bathrooms and garages seem to be almost same and no any significant difference is seen.
Multiple Linear Regression model:
The entire data is selected for the first model and after that by checking the VIF the variables with high multicollinearity (VIF > 3 has been considered as the threshold for being having high multicollinearity). Then the regression model has been built based on the data removing the variable with high multicollinearity. Again a new model has been built based on the new predictors and finally after assuring the low VIF, the variables have been checked with insignificant p-values and then removed also. In fine, the variables with the significant p-values have been selected and thus the final model is selected.
State A final model:
Regression Statistics
Multiple R
0.82
R Square
0.68
Adjusted R Square
0.67
Standard Error
63.58
Observations
125.00
ANOVA
df
SS
MS
F
Significance F
Regression
4.00
1031712.22
257928.06
63.81
0.00
Residual
120
485039.21
4041.99
Total
124
1516751.434
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Sx
VIF
Intercept
119.67
26.44
4.53
0.00
67.31
172.02
Bedrooms
23.38
7.54
3.10
0.00
8.45
38.32
1.07
1.99
Bathrooms
71.77
13.18
5.45
0.00
45.68
97.86
0.55
1.60
Garages
30.69
7.44
4.13
0.00
15.96
45.42
0.85
1.22
Type
-62.58
15.19
-4.12
0.00
-92.65
-32.51
0.47
1.57
State B final model:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.78
R Square
0.61
Adjusted R Square
0.60
Standard Error
80.37
Observations
80.00
ANOVA
df
SS
MS
F
Significance F
Regression
2
791788.3
395894
61.28
0.00
Residual
77
497412.8
6459.91
Total
79
1289201
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
70.35
32.72
2.15
0.03
5.20
135.50
Internal Area m^2
2.17
0.29
7.57
0.00
1.60
2.74
Garages
23.61
9.98
2.37
0.02
3.73
43.49
From both the model, we can say that the model for the state A fits the data quite well. The adjusted R-squared for the first and second models are 0.67 and 0.60 respectively. This implies that almost 67% of the variance of the response variable is explained by bedrooms, bathrooms, garage, and type. While in the case of the second model only 60% of the response variable is explained by the predictor variables Internal area and garages.
Comparison based on Coastal city of State A and State B:
State A: Coastal City:
Row Labels
Average of Price $000
Average of Internal Area m^2
Average of Bedrooms
House
610.88
170.50
3.97
Unit
383.73
92.95
2.17
Grand Total
494.58
130.80
3.05
Row Labels
Average of Bathrooms
Average of Garages
House
2.20
1.98
Unit
1.42
1.08
Grand Total
1.80
1.52
State B: Coastal City:
Row Labels
Average of Price $000
Average of Internal Area m^2
Average of Bedrooms
House
538.10
180.55
3.84
Unit
411.53
88.02
1.97
Grand Total
500.63
153.16
3.29
Row Labels
Average of Bathrooms
Average of Garages
House
2.05
2.51
Unit
1.57
1.59
Grand Total
1.90
2.24
From the table it is quite evident that in the case of House property, the average price seems to be quite higher for state A but for unit property type, the average price seems to be quite higher for State B, other attributes remaining almost the same. Overall, the average price in State B seems to be greater than that of State A with other amenities also being provided in a larger amount.
State A final model:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.68
R Square
0.46
Adjusted R Square
0.45
Standard Error
164.55
Observations
125.00
ANOVA
df
SS
MS
F
Significance F
Regression
3
2824663.29
941554.43
34.77
0.00
Residual
121
3276394.86
27077.64
Total
124
6101058.15
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Sx
VIF
Intercept
241.19
59.18
4.08
0.00
124.02
358.35
Bathrooms
127.90
25.45
5.03
0.00
77.52
178.29
0.74
1.63
Garages
44.73
21.11
2.12
0.04
2.94
86.52
0.89
1.60
Type
-87.55
36.44
-2.40
0.02
-159.70
-15.40
0.50
1.53
State B final model:
SUMMARY...