Answer To: I need help.
Abr Writing answered on Oct 19 2021
analysis.Rmd
---
title: "Assignment"
subtitle: "Statistics"
date: "14/10/2019"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T,
message = F,
error = F,
comment = "")
library(ggplot2)
library(reshape2)
library(fitdistrplus)
library(boot)
```
# Question 1
## Part 1
Writing a function for estimating $\mu$ produced by the thresholding estimator.
```{r}
mu.threshold <- function(y, lambda) {
if (abs(y) >= lambda) {
return(y)
} else {
return(0)
}
}
```
Writing a function that takes a $\mu$, $\lambda$, number of simulations to run and calculates the squared error risk of the hard thresholding function
```{r}
hard.threshold.risk <- function(mu, lambda, n) {
ys <- sort(rnorm(n, mean = mu, sd = 1))
r <- 0
for (index in 2:n) {
dy <- ys[index] - ys[index-1]
r <- r +
pnorm(ys[index], mean = mu, sd = 1) *
(mu - mu.threshold(ys[index], lambda))^2 *
dy
}
return(r)
}
```
## Part 2
Writing the sign function
```{r}
sgn <- function(x) {
if (x < 0) {
return(-1)
}
if (x == 0) {
return(0)
}
return(1)
}
```
Writing a function for estimating $\mu$ produced by the lasso estimator.
```{r}
mu.lasso <- function(y, lambda) {
return(sgn(y)*max(c(abs(y)-lambda, 0)))
}
```
Writing a function that takes a $\mu$, $\lambda$, number of simulations to run and calculates the squared error risk of the lasso shrinkage estimator
```{r}
lasso.estimator.risk <- function(mu, lambda, n) {
ys <- sort(rnorm(n, mean = mu, sd = 1))
r <- 0
for (index in 2:n) {
dy <- ys[index] - ys[index-1]
r <- r +
pnorm(ys[index], mean = mu, sd = 1) *
(mu - mu.lasso(ys[index], lambda))^2 *
dy
}
return(r)
}
```
## Part 3
```{r}
mus <- seq(-10, 10, 0.1)
lambdas <- c(1, 2)
n <- 100000
for (lambda in lambdas) {
threshold.estimators <- c()
lasso.estimators <- c()
for (mu in mus) {
threshold.estimators <- c(threshold.estimators,
hard.threshold.risk(mu, lambda, n))
lasso.estimators <- c(lasso.estimators,
lasso.estimator.risk(mu, lambda, n))
}
data <- data.frame(
mus = mus,
threshold = threshold.estimators,
lasso = lasso.estimators,
least.square = 1
)
data <- melt(data, id="mus")
print(
ggplot(data=data,
aes(x=mus, y=value, colour=variable)) +
geom_line() +
labs(
title = paste("Estimates for Lambda = ", lambda, sep = ""),
y = "estimate",
x = "mu"
)
)
}
```
## Part 4
From the above plots, we can see that for both the values of $\lambda$ (1, 2), the lasso estimator performed better than the hard thresholding estimator for $\mu < 0$ but the lasso estimator performed better than hard thresholg estimator for $\mu \geq 0$. The Least Square Estimator is better than both the estimators in terms of Squared Error Risk. For the Lasso estimator the Risk reaches the Risk with Least Square Estimate for higher value of $mu$.
# Question 2
Loading the dataset into R workspace and displaying the summary
```{r}
fuel <- read.csv("fuel2017-20.csv")
summary(fuel)
```
## Part 1
```{r}
lm.model <- lm(
Comb.FE ~ .,
data = fuel
)
summary(lm.model)
```
## Part 2
```{r}
source("bayes.lasso.R")
bl.model <- bayes.lasso (
Comb.FE ~ .,
data = fuel
)
```
```{r}
bl.model.results <- data.frame(
Estimate = bl.model$beta.mean,
"Std. Error" = bl.model$beta.sd,
"t value" = bl.model$t.stat
)
bl.model.results
```
## Part 3
The results from Least-Square Model and Bayesian Lasso Model are very similar to each other. Based on the results from the Lasso Estimator, the unimportant predictors are `No.Cylinders`, `AspirationOT`, `Drive.SysA`, `Drive.SysR`, `Fuel.TypeGM` and `Fuel.TypeGPR`.
## Part 4
Removing `No.Cylinders` as other unimportant predictors are part of a factor variables whose some (if not most) levels are significantly effecting the fuel efficiency.
```{r}
bl.model <- bayes.lasso (
Comb.FE ~ Model.Year +
Eng.Displacement +
Aspiration +
No.Gears +
Lockup.Torque.Converter +
Drive.Sys +
Max.Ethanol +
Fuel.Type +
Comb.FE,
data = fuel
)
```
```{r}
bl.model.results <- data.frame(
Estimate = bl.model$beta.mean,
"Std. Error" = bl.model$beta.sd,
"t value" = bl.model$t.stat
)
bl.model.results
```
Based on the results, we will advice a latest model (year) car with lower engine displacement, with Natural Engine Aspiration, lower number of gears, without Lockup Torque Converter, with Front Wheel Drive System, lower maximum % Ethonal allowed and Mid-grade Unleaded Fuel for higher fuel efficiency.
## Part 5
```{r}
new.data <- data.frame(
Model.Year = 2018,
Eng.Displacement = 3.5,
No.Cylinders = 6,
Aspiration = "N",
No.Gears = 8,
Lockup.Torque.Converter = "N",
Drive.Sys = "R",
Max.Ethanol = 15,
Fuel.Type = "GPR"
)
pred <- predict(lm.model, new.data, interval = "confidence")
pred
```
## Part 6
```{r}
bl.model <- bayes.lasso (
Comb.FE ~ Model.Year +
Eng.Displacement +
Eng.Displacement * No.Cylinders +
Aspiration +
No.Gears +
Lockup.Torque.Converter +
Drive.Sys +
Max.Ethanol +
Fuel.Type +
Comb.FE,
data = fuel
)
bl.model.results <- data.frame(
Estimate = bl.model$beta.mean,
"Std. Error" = bl.model$beta.sd,
"t value" = bl.model$t.stat
)
bl.model.results
```
From the above results from Bayesian Lasso Model, we can see that the engineer's hypothesis was right about the interaction effect between engine displacement and the number of cylinders as confirmed from the obtained `t-value`. From the model, we can conclude that the higher number of cylinder in interation with higher Eng Diaplacement, increases the fuel efficiency.
# Question 3
## Part 1
The bias of the estimator can be written as:
$$bias\left(\hat{\lambda}_{a,b}\right)=E\left[\frac{a+n\bar{y}}{n+b}\right]-\lambda=\frac{a+n\lambda}{n+b}-\lambda=\frac{a-b\lambda}{n+b}$$
The variance of the estimator can be written as:
$$V_\lambda\left[\hat{\lambda}_{a,b}\right]=E\left[\left(\hat{\lambda}_{a,b}-E\left(\hat{\lambda}_{a,b}\right)\right)^2\right]=E\left(\left(n\frac{\bar{y}-\lambda}{n+b}\right)^2\right]=\frac{n\lambda}{(n+b)^2}$$
## Part 2
From the bias found above, we can see that as the value of `a` increases, so do the bias. The bias decreases with increases value of $\lambda$. The variance is not dependent on `a`. Variance decreases with increasing value of `b` and increases with increasing value of $\lambda$.
## Part 3
```{r}
mse <- function(a, b, n, lambdas) {
res <- c()
for (lambda in lambdas) {
y <- rpois(n, lambda)
res <- c(res, ((a+n*mean(y))/(n+b)-lambda)^2)
}
return(res)
}
mse.ml <- function(n, lambdas) {
res <- c()
for (lambda in lambdas) {
y <- rpois(n, lambda)
res <- c(res, (lambda-mean(y))^2)
}
return(res)
}
n <- 5
lambdas <- seq(0, 10, 0.1)
data <- data.frame(
lambdas = lambdas,
ab.11 = mse(1, 1, n, lambdas),
ab.51 = mse(5, 1, n, lambdas),
ab.55 = mse(5, 5, n, lambdas),
ml = mse.ml(n, lambdas)
)
data.narrow = melt(data, "lambdas")
ggplot(data=data.narrow,
aes(x=lambdas, y=value, colour=variable)) +
geom_line() +
labs(
title = "Squared Error Risk",
y = "MSE",
x = "lambda"
)
```
Where,
- `ab.11` - `a = 1`, `b = 1`
- `ab.51` - `a = 5`, `b = 1`
- `ab.55` - `a = 5`, `b = 5`
- `ml` - Maximum Likelihood
## Part 4
```{r}
summary(data)
```
For `n = 5`, the suared error risk is lowest for `ab.51` that is `a = 5`, `b = 1`.
## Part 5
Let $\lambda=1$, `a = 5` and `b = 1`, plotting the Mean Squared risk for different values of `n`.
```{r}
a <- 5
b <- 1
lambda <- 1
N <- seq(100, 100000, 100)
msr <- c()
for (n in N) {
y <- rpois(n, lambda)
msr <- c(msr, ((a+n*mean(y))/(n+b)-lambda)^2)
}
df = data.frame(
N = N,
MSR = msr
)
ggplot(data=df, aes(x=N, y=msr)) +
geom_line() +
labs(
title = "Mean Squared Risk",
x = "n",
y = "MSR"
)
```
We can see from the plot above that estimator (2) asymptotically efficient as $n\rightarrow\infty$.
## Part 6
After Log rate parameterisation, the probability mass distribution can be written as:
$$f_{x|v} = \frac{1}{x!}e^{xv-e^v}$$
## Part 7
We know that the Fisher information for the Possion Distribution is equal to $\frac{1}{\lambda}$. Therefore, the Fisher information for log-rate parameterisation of Poisson Distribution can be evaluated as:
$$f_{x|v}=\frac{1}{x!}e^{xv-e^v}$$
Taking Log on both the sides:
$$\log{f_{x|v}}=xv-e^v-\log(x!)$$
Double Differentiating both the side w.r.t. $v$
$$-\frac{\partial^2}{\partial{v}^2}\log{f_{x|v}}=e^v$$
Therefore, the Fisher Information can be written as:
$$I(v)=E\left[\left(e^v\right)^2\right]=\left(e^v\right)^2$$
# Question 4
```{r}
y <- c(0, 3, 4, 3, 6, 3, 9, 6, 3, 5, 8, 4, 1)
summary(y)
```
```{r}
df <- data.frame(y = y)
ggplot(df, aes(x=y)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
```
## Part 1
```{r}
summary(bootdist(fitdist(y, "pois"), niter = 1001))
```
## Part 2
Creating a sample with the above estimate (taking $\lambda=4$) for 1000 full moon nights
```{r}
y <- rpois(1000, 4)
```
Now testing whether the sample mean is greater than $1$ or not. Since information is provided for the dog-bites events on non-full nights, assuming it to be equal to 1.
```{r}
df <- data.frame(y = y)
ggplot(df, aes(x=y)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") +
geom_vline(xintercept=1, linetype="dashed", color = "red")
```
```{r}
t.test(y, mu = 1, alternative = "greater")
```
From the above Student's T-test, we can see that the mean of the dog-bites are greater than 1 at a significance level (p-value) less than $2.2\times10^{-16}$.
## Part 3
```{r}
lambdas <- bootdist(fitdist(y, "pois"), niter = 1001)
fitdist(lambdas$estim$lambda, "gamma")
```
## Part 4
The posterior distribution for $\lambda$ can be written as $Gamma(\alpha, \beta)=Gamma(3927.1557, 990.9755)$. Plotting the posterior distribution.
```{r}
lambda <- rgamma(1000, shape = 3927.1557, rate = 990.9755)
df <- data.frame(lambda = lambda)
ggplot(df, aes(x=lambda)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
```
## Part 5
```{r}
data <- data.frame(
lambda = c(lambdas$estim$lambda, lambda),
legend = c(rep("Prior", length(lambdas$estim$lambda)),
rep("Posterior", length(lambda)))
)
ggplot(data, aes(lambda, fill = legend)) +
geom_density(alpha = 0.2)
```
## Part 6
**Prior**
```{r}
summary(bootdist(fitdist(y, "pois"), niter = 1001))
```
**Posterior**
```{r}
error <- qt(0.975,df=length(lambda)-1)*sd(lambda)/sqrt(length(lambda))
cat("[",
mean(lambda)-error,
", ",
mean(lambda)+error,
"]", sep="")
```
## Part 7
Assuming that there are 1 dog bites on non-full moon days, the Bayesian analysis also...