Answer To: I attached the instruction file as word file and raw data exceel. XXXXXXXXXXSubmit...
Pratibha answered on Dec 04 2023
Predictive Analysis Project Instructions
Project: Investigate the effect of climate on food supply – 2050 and 2080 on the basis of 2000 and 2020
Questions:
-How historical climate changes impacted food production?( Do-Time Series Analysis)
-How are projected climate changes likely to impact food production in the future?(Do: Regression)
-What are the most vulnerable regions and populations? (Do -Spatial analysis)
-What adaptation and mitigation strategies can be implemented to reduce the risks to food security? Do - logistics regression)
1.Data preparation: How many row and column, and variable. How may you remove to balancing the data. Give Histogram
· Data has 157 variables with 167 observations. I have used a dataset that contains information related to the growth rates of wheat (WH), rice (RI), and maize (MZ) for the years 2000, 2020, 2050, and 2080. The dataset also includes variables such as WHA1F2050, WH_2000, RI_2000, MZ_2000, WHAIF2020, WHAIF2080, RIA1F2020, RIA1F2050, RIA1F2080, MZA1F2020, MZA1F2050, MZA1F2080, BLS_2_Countries_(SRES)_ABBREVNAME, WH%GR, RI%GR, MZ%GR, WH_growth_2020_2080, WHpercentGR, WHB1a2050, WHB1a2080, WHA1F2020, WHA1F2080, WHA1F2050, ActChWHA1F2050, ActChWHA1F2020, ActChWHA1F2080, RI_growth_2020_2080, RIB1a2050, RIB1a2080, RIA1F2020, RIA1F2080, RIA1F2050, ActChRIA1F2050, ActChRIA1F2020, ActChRIA1F2080, MZ_growth_2020_2080, MZpercentGR, MZB1a2050, MZB1a2080, MZA1F2020, MZA1F2080, MZA1F2050, ActChMZA1F2050, ActChMZA1F2020, ActChMZA1F2080.
· The data quality is not too good, and data preprocessing is required. Issues with the data are as follows:
· Missing values
· Incorrect data types, and
· Variable name issues (for e.g., MG%GR, % is the issue and caused error during modelling)
Data preparation:
· Changed variable names for e.g RI%GR to RipercentGR,
· Ensuring that data types are appropriate for the type of data and analysis is important. Changed datatypes of variables, character to numeric and factor.
· Imputed the missing values by average values of the variables
· Created Variables for growth from the year 2000 to 2080, by using available Variables, and Formulas as shown below:
· # Calculate growth rates for WH, RI, and MZ from 2000 to 2050
df$WH_growth_2000_2050 <- (df$WHA1F2050 - df$WH_2000) / df$WH_2000
df$RI_growth_2000_2050 <- (df$RIA1F2050 - df$RI_2000) / df$RI_2000
df$MZ_growth_2000_2050 <- (df$MZA1F2050 - df$MZ_2000) / df$MZ_2000
· # Calculate growth rates for WH, RI, and MZ from 2020 to 2080
df$WH_growth_2000_2080 <- (df$WHA1F2080 - df$WH_2000) / df$WH_2000
df$RI_growth_2000_2080 <- (df$RIA1F2080 - df$RI_2000) / df$RI_2000
df$MZ_growth_2000_2080 <- (df$MZA1F2080 - df$MZ_2000) / df$MZ_2000
· # Calculate growth rates for WH, RI, and MZ from 2000 to 2050
df$WH_growth_2020_2050 <- (df$WHA1F2050 - df$WHA1F2020) / df$WHA1F2020
df$RI_growth_2020_2050 <- (df$RIA1F2050 - df$RIA1F2020) / df$RIA1F2020
df$MZ_growth_2020_2050 <- (df$MZA1F2050 - df$MZA1F2020) / df$MZA1F2020
· # Calculate growth rates for WH, RI, and MZ from 2020 to 2080
df$WH_growth_2020_2080 <- (df$WHA1F2080 - df$WHA1F2020) / df$WHA1F2020
df$RI_growth_2020_2080 <- (df$RIA1F2080 - df$RIA1F2020) / df$RIA1F2020
df$MZ_growth_2020_2080 <- (df$MZA1F2080 - df$MZA1F2020) / df$MZA1F2020
Histogram:
Histogram of all Variables has been plotted:
Selected Few Variables (Around 15 Variables) and Plotted The histogram
2. Modeling:
Model implementation : Three crops( wheat , Rice, Maize)
A. Time series
# Plot multiple time series separately
for (i in 1:ncol(data_ts)) {
plot(data_ts[, i], type = 'l', main = colnames(data_ts)[i],
xlab = "Year", ylab = "Value")
}
# Plot the time series data
#plot(data_ts, main = "Time Series Data", xlab = "Year", ylab = "Value")
# Plot the time series data
#plot(data_ts, main = "Wheat, Rice, and Maize Growth Over Time",
# ylab = "Growth", xlab = "Year", col = 1:ncol(df_selected))
# Extract the growth percentage columns for each crop
wheat_column <- df_selected$WHpercentGR
rice_column <- df_selected$RIpercentGR
maize_column <- df_selected$MZpercentGR
# Combine the columns into a matrix or data frame
combined_data <- data.frame(wheat_column, rice_column, maize_column)
# Convert the combined data into a time series
data_ts <- ts(combined_data, start = c(2000),end=c(2080), frequency = 1)
# Plot the time series data
plot(data_ts, main = "Wheat, Rice, and Maize Growth Over Time",
ylab = "Growth", xlab = "Year", col = 1:ncol(combined_data))
# Set hypotheses for each crop
# Hypothesis for Wheat: The wheat growth is expected to increase over time.
# Hypothesis for Rice: The rice growth follows a stable pattern without significant fluctuations.
# Hypothesis for Maize: The maize growth exhibits seasonal variations and an overall increasing trend.
# Fit ARIMA models for each crop (Example with ARIMA(1,0,1) model)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Wheat ARIMA model
wheat_arima <- auto.arima(wheat_column)
summary(wheat_arima)
## Series: wheat_column
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.0217 0.0817 34.9401
## s.e. 0.4892 0.4843 3.3443
##
## sigma^2 = 1549: log likelihood = -843.67
## AIC=1695.35 AICc=1695.6 BIC=1707.8
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.0241684 38.99381 35.02552 -Inf Inf 0.9005648 4.855697e-05
# Rice ARIMA model
rice_arima <- auto.arima(rice_column)
summary(rice_arima)
## Series: rice_column
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 24.4027
## s.e. 2.6777
##
## sigma^2 = 1197: log likelihood = -823.34
## AIC=1650.69 AICc=1650.76 BIC=1656.91
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -1.210429e-15 34.50007 28.80044 -Inf Inf 0.8783524 0.03113128
# Maize ARIMA model
maize_arima <- auto.arima(maize_column)
summary(maize_arima)
## Series: maize_column
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 35.3607
## s.e. 2.6758
##
## sigma^2 = 1196: log likelihood = -823.23
## AIC=1650.45 AICc=1650.52 BIC=1656.67
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 5.966523e-12 34.47548 30.41014 -Inf Inf 0.7650439 -0.05437934
# Forecast for the next 12 months (adjust as needed)
wheat_forecast <- forecast(wheat_arima, h = 12)
rice_forecast <- forecast(rice_arima, h = 12)
maize_forecast <- forecast(maize_arima, h = 12)
# Plot the forecasts
plot(wheat_forecast, main = "Wheat Growth Forecast")
plot(rice_forecast, main = "Rice Growth Forecast")
plot(maize_forecast, main = "Maize Growth Forecast")
B. Regression
Wheat Analysis:
Wheat Growth Prediction:
Null Hypothesis (H₀): There is no significant linear relationship between various factors and wheat growth.
Alternative Hypothesis (H₁): At least one of the predictor variables has a significant linear relationship with wheat growth.
# Load necessary libraries
library(caret)
df1 <- subset(df, select = c(WH_growth_2000_2050,WH_growth_2020_2080,WH_growth_2000_2080,WH_growth_2020_2050, WHpercentGR, WHB1a2050, WHB1a2080, WHA1F2020, WHA1F2080, WHA1F2050, ActChWHA1F2050, ActChWHA1F2020, ActChWHA1F2080))
df1 <- subset(df, select = c(WH_growth_2020_2080, WHpercentGR, WHB1a2050, WHB1a2080, WHA1F2020, WHA1F2080, WHA1F2050, ActChWHA1F2050, ActChWHA1F2020, ActChWHA1F2080))
missing_values <- colSums(is.na(df1))
print(missing_values)
## WH_growth_2020_2080 WHpercentGR WHB1a2050 WHB1a2080
## 0 0 0 0
## WHA1F2020 WHA1F2080 WHA1F2050 ActChWHA1F2050
## 0 0 0 0
## ActChWHA1F2020 ActChWHA1F2080
## 0 0
# Remove rows with missing values
data_ts <- na.omit(df1)
# Check for infinite values
infinite_values <- apply(df1, 2, function(x) any(is.infinite(x)))
print(infinite_values)
## WH_growth_2020_2080 WHpercentGR WHB1a2050 WHB1a2080
## FALSE FALSE FALSE FALSE
## WHA1F2020 WHA1F2080 WHA1F2050 ActChWHA1F2050
## FALSE FALSE FALSE FALSE
## ActChWHA1F2020 ActChWHA1F2080
## FALSE FALSE
# Remove rows with infinite values
df1 <- df1[is.finite(rowSums(df1)), ]
set.seed(123) # For reproducibility
train_indices <- sample(1:nrow(df1), 0.75 * nrow(df)) # 70% train, 30% test
train_data <- df1[train_indices, ]
test_data <- df1[-train_indices, ]
dim(train_data)
## [1] 124 10
dim(test_data)
## [1] 42 10
# Train the linear regression model
set.seed(123)
wheat_model <- lm(WHpercentGR ~., data = train_data, ntree = 100)
print(wheat_model)
##
## Call:
## lm(formula = WHpercentGR ~ ., data = train_data, ntree = 100)
##
## Coefficients:
## (Intercept) WH_growth_2020_2080 WHB1a2050
## 6.150e+01 2.045e-01 8.334e+00
## WHB1a2080 WHA1F2020 WHA1F2080
## 1.153e+00 1.980e+00 2.299e+00
## WHA1F2050 ActChWHA1F2050 ActChWHA1F2020
## -8.933e+00 2.045e-06 -7.717e-06
## ActChWHA1F2080
## 6.133e-06
Rice Growth Prediction:
Null Hypothesis (H₀): The combined effect of all predictor variables does not significantly impact rice growth.
Alternative Hypothesis (H₁): The combination of at least some of the predictor variables has a significant linear relationship with rice growth.
# Fit the multiple linear regression model
dfRI <- subset(df, select = c(RI_growth_2000_2050, RI_growth_2020_2080, RI_growth_2000_2080, RI_growth_2020_2050, RIpercentGR, RIB1a2050, RIB1a2080, RIA1F2020, RIA1F2080, RIA1F2050, ActChRIAIF2050, ActChRIAIF2020, ActChRIAIF2080))
missing_values <- colSums(is.na(dfRI))
print(missing_values)
## RI_growth_2000_2050 RI_growth_2020_2080 RI_growth_2000_2080 RI_growth_2020_2050
## 0 0 0 0
## RIpercentGR RIB1a2050 RIB1a2080 RIA1F2020
## 0 0 0 0
## RIA1F2080 RIA1F2050 ActChRIAIF2050 ActChRIAIF2020
## 0 0 0 0
## ActChRIAIF2080
## 0
# Remove rows with missing values
dfRI <- na.omit(dfRI)
# Check for infinite values
infinite_values <- apply(dfRI, 2, function(x) any(is.infinite(x)))
print(infinite_values)
## RI_growth_2000_2050 RI_growth_2020_2080 RI_growth_2000_2080 RI_growth_2020_2050
## TRUE FALSE TRUE FALSE
## RIpercentGR RIB1a2050 RIB1a2080 RIA1F2020
## FALSE FALSE FALSE FALSE
## RIA1F2080 RIA1F2050 ActChRIAIF2050 ActChRIAIF2020
## FALSE FALSE FALSE FALSE
## ActChRIAIF2080
## FALSE
# Remove rows with infinite values
dfRI <- dfRI[is.finite(rowSums(dfRI)), ]
set.seed(123) # For reproducibility
train_indices <- sample(1:nrow(dfRI), 0.75 * nrow(df)) # 70% train, 30% test
train_data <- dfRI[train_indices, ]
test_data <- dfRI[-train_indices, ]
dim(train_data)
## [1] 124 13
dim(test_data)
## [1] 40 13
lm_modelRI <- lm(RIpercentGR ~ ., data = train_data)
print(lm_modelRI)
##
## Call:
## lm(formula = RIpercentGR ~ ., data = train_data)
##
## Coefficients:
## (Intercept) RI_growth_2000_2050 RI_growth_2020_2080
## 3.410e+03 3.808e+03 -4.032e-01
## RI_growth_2000_2080 RI_growth_2020_2050 RIB1a2050
## -4.342e+02 -4.026e+00 -8.465e+00
## RIB1a2080 RIA1F2020 RIA1F2080
## 2.625e+00 5.521e+00 -6.881e-01
## RIA1F2050 ActChRIAIF2050 ActChRIAIF2020
## 1.964e+00 7.148e-05 -2.050e-05
## ActChRIAIF2080
## -3.912e-05
Maize Growth Prediction: Null Hypothesis (H₀): There is no significant linear association between factors such as temperature variation, water availability, and maize growth. Alternative Hypothesis (H₁): At least one of the predictor variables shows a significant linear association with maize growth.
dfMZ <- subset(df, select = c(MZ_growth_2000_2050,MZ_growth_2020_2080,MZ_growth_2000_2080,MZ_growth_2020_2050, MZpercentGR, MZB1a2050, MZB1a2080, MZA1F2020, MZA1F2080, MZA1F2050, ActChMZA1F2050, ActChMZA1F2020, ActChMZA1F2080))
missing_values <- colSums(is.na(dfMZ))
print(missing_values)
## MZ_growth_2000_2050 MZ_growth_2020_2080 MZ_growth_2000_2080 MZ_growth_2020_2050
## 0 0 0 0
## MZpercentGR MZB1a2050 MZB1a2080 MZA1F2020
## 0 0 0 0
## MZA1F2080 MZA1F2050 ActChMZA1F2050 ActChMZA1F2020
## 0 0 0 0
## ActChMZA1F2080
## 0
# Remove rows with missing values
dfMZ <- na.omit(dfMZ)
# Check for infinite values
infinite_values <- apply(dfMZ, 2, function(x) any(is.infinite(x)))
print(infinite_values)
## MZ_growth_2000_2050 MZ_growth_2020_2080 MZ_growth_2000_2080 MZ_growth_2020_2050
## FALSE FALSE FALSE FALSE
## MZpercentGR MZB1a2050 MZB1a2080 MZA1F2020
## FALSE FALSE FALSE FALSE
## MZA1F2080 MZA1F2050 ActChMZA1F2050 ActChMZA1F2020
## FALSE FALSE FALSE FALSE
## ActChMZA1F2080
## FALSE
# Remove rows with infinite values
dfMZ <- dfMZ[is.finite(rowSums(dfMZ)), ]
set.seed(123) # For reproducibility
train_indices <- sample(1:nrow(dfMZ), 0.75 * nrow(df)) # 70% train, 30% test
train_data <- dfMZ[train_indices, ]
test_data <- dfMZ[-train_indices, ]
dim(train_data)
## [1] 124 13
dim(test_data)
## [1] 42 13
lm_modelMZ <- lm(MZpercentGR ~ ., data = train_data, ntree = 100)
print(lm_modelMZ)
##
## Call:
## lm(formula = MZpercentGR ~ ., data = train_data, ntree = 100)
##
## Coefficients:
## (Intercept) MZ_growth_2000_2050 MZ_growth_2020_2080
## 1.004e+02 1.900e+02 2.327e-01
## MZ_growth_2000_2080 MZ_growth_2020_2050 MZB1a2050
## -9.374e+01 -4.026e-01 -2.689e+00
## MZB1a2080 MZA1F2020 MZA1F2080
## -6.267e+00 4.601e+00 -1.394e+00
## MZA1F2050 ActChMZA1F2050 ActChMZA1F2020
## 5.583e+00 4.362e-07 -3.708e-06
## ActChMZA1F2080
## -4.672e-07
C. Spatial Analysis
# Load necessary libraries
library(sf)
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(ggplot2)
# Filter columns in crop_data
crop_data <- subset(df, select = c("BLS_2_Countries_(SRES)_ABBREVNAME", "Fips_code", "ISO3v10", "MZ_growth_2000_2050", "MZ_growth_2020_2080", "MZ_growth_2000_2080", "MZ_growth_2020_2050", "MZpercentGR"))
# Rename the column in crop_data to match the merging column in world_map
names(crop_data)[names(crop_data) == "BLS_2_Countries_(SRES)_ABBREVNAME"] <- "region"
# Simulated world map (for illustration)
world_map <-...