The dataset ModelingPVCapacity.xls has data on the amount of installed photovoltaic cells (solar panels) in New Zealand. The amount installed is called the Capacity, and is measured in MW. We want to model this as a function of Time (measured in hundreds of days since 31 July 2013).
- Using Analyze> Curve Estimation, fit a cubic regression model for Capacity vs Time. Is the cube power term significant? Show the graph.
- It is unbelievable that the amount of installed PV capacity will continue to increase steeply in future years. Eventually it must flatten out. Therefore use the data to fit a nonlinear regression model Capacity = C*exp(b0+ b1*Time)/(1+exp(b0+b1*Time)).
Guess at initial parameter estimates. (If the model doesn’t converge try other guesses until you get convergence.) Save the predicted values.
- Show the nonlinear regression output. Also use Graph > Legacy Dialogs > Scatter/Dot > Overlay Scatterplot > Define to plot the Capacity vs Time and Predicted values vs Time overlaid on the same graph. (Change the plotting symbol for the predicted values to +).
- What is the predicted maximum capacity based on this model?
- Does this nonlinear regression model fit better or worse than the cubic model? Quote evidence.
- From other considerations, the NZ Electricity Authority believes that
in the long run
(i.e. asymptotically) about half of New Zealand households will find it economic to install PV systems. In view of the current population, that equates to about C =2800 MW of Capacity. Calculate a new variable
logitC = ln( Capacity/ (2800 – Capacity) ) .
and fit a regression model logitC = b0 + b1* Time + b2* Change
Save the predicted values. (The variable Change is in the data file)
- Show the regression output, and use Graphs > Legacy Dialogs etc. to plot an overlay scatterplot of logitC vs Time and the predicted values vs Time.
- The second variable Change was chosen because in November 2014 Contact Energy slashed the amount it would pay when customers sold PV-generated power back to the national grid, a move that was followed by other companies. Is there evidence that there was a change in the slope of the line for logitC vs time? Quote evidence.
- Using the estimated regression coefficients, predict the value of logitC at 30 April 2017 (Time= 13.69, Change= 8.82). Convert that back to estimate the PV Capacity in MW on 30 April 2017.
(Hint if Y is the predicted value of logitC then Capacity = 2800* e^Y / (1+ e^Y). )
Q2. The dataset Dengue.xls reports a study of 196 people living in a Mexican city, of whom 57 were found to have dengue fever, a nasty mosquito-borne disease. The response variable is Dengue (=1 if diseased and 0 if not). Explanatory variables include the person’s Age (in years), whether or not they used a mosquito net (MosNet=1 if yes, 0 if no), and which Sector of the city the person lived in (sector 1,2,3,4 or 5)
- Fit a binary logistic regression of Dengue vs Age. Save the predicted probabilities..
- Show the regression output . Also plot the predicted probabilities against Age.
- At what age is it 50% likely that the person will have dengue fever?
- Remove Age, and add the
indicator variables for Sector
(Sector1, Sector2, Sector3, Sector4) to the regression.
- Quote an overall statistic and sig value for whether the probability of dengue differs between the sectors.
- Looking at the coefficients, which sectors have significantly higher rates of dengue fever than other sectors?
- Re-fit the binary logistic regression with Age, MosNet and the sector indicator variables.
- Does the use of a mosquito net make a significant difference to the probability of having dengue fever?
- What proportion of individuals are correctly classified by the binary logistic model.
Q3 . The dataset NSWHospitals.sav contains information about the costs and treatment of various hospitals in different areas (“SLA”s) of New South Wales, Australia.
The response variable we will focus on is called StdCostRatio, which is a standardised measure of how expensive various hospitals are, per patient (average= 100). The remaining variables are described in an appendix. You don’t need to know the definitions.
- Fit a linear regression of StdCostRatio on all the columns from SupplyBeds10000 to Nocar. (i.e. exclude StdSeparationRatio. ) Include Collinearity statistics. Show output.
Comment on what evidence there is of multicollinearity in the regression.
- Fit a stepwise linear regression of StdCostRatio on the columns from SupplyBeds10000 to Nocar. Show output.
- Use backwards elimination to choose a model for StdCostRatio.
- Show the Model Summary and Coefficients results.
- Which model is best in terms of adjusted R2
?
- Which model is best in terms of Std Error of the Estimate?
- Is the final backwards elimination model better or worse than the model chosen by Stepwise? (State you reasons for your answer).
- Fit the models chosen in part (c) iii and iv. (Use Enter now, not stepwise etc.) and save the studentized dffits. Show output. Compare the standard deviation of deleted residuals. Which model is best by this criterion? What is the idea behind this criterion?
- Using the regression coefficients for the model you have chosen in part (d), write a sentence or two describing the type of SLA which is predicted to have a high StdCostRatio.
- Re-fit that last regression, showing the partial regression plots (added variable plots) and saving the standardised DFFITS . Which graph seems to be most dependent on a single point for its slope? Identify the point, e.g. by row number.