The file used in this assignment is the onecombining clinical data and genetic data provided in the previous module (BRCAMerged.csv). A number of functions have been provided in cells#1-15 of the...

1 answer below »

The file used in this assignment is the onecombining clinical data and genetic data provided in the previous module (BRCAMerged.csv).

A number of functions have been provided in cells#1-15 of the attached notebook (AssignmentWeek4.pdf). The corresponding script is provided as a Jupyternotebook and an R script, both called AssignmentWeek4.

The question asked in this assignment is to compare classification models and is similar to the worksheet.

1) Download the attached files and place them in the same folder:

BRCAMerged.csv

AssignmentWeek4.ipynb

AssignmentWeek4.R

2) Run the script either as AssignmentWeek4.iptnb(Jupyternotebook installation) or as AssignmentWeek4.R (RStudioinstallation).

3) Create at the end additional code to add some clinical variables to the genetic variables exclusively used in AssignmentWeek4 as provided. Namely, the goal is to addnumeric variables we have not analyzed yet, such as stage (column #5), Diagnosis.Age (column #6), Birth.from.Initial.Pathologic.Diagnosis.Date (column #12), Death.from.Initial.Pathologic.Diagnosis.Date (column #14), Last.Alive.Less.Initial.Pathologic.Diagnosis.Date.Calculated.Day.Value (column #15), Days.to.Last.Followup (column #16), Disease.Free.Months (column #17), Fraction.Genome.Altered (column #22), HER2.ihc.score (column #25), Overall.Survival..Months. (column #32). To do this, create a modified version of Cell #6 to add these variables to the genetic variables. You should now have a dataset for analysis containing 20541 variables.

4) Run again the analyses in cells [9] to [15], which you can duplicate below the previous code, to see whether there are differences in the classification results. Enter the R code in the next cells.

5) Turn in the assignment as a plain R script file (not Jupyter notebook file), attached to your submission.

Note: the file BRCAMerged.csv can also be downloaded from Google Drive:https://drive.google.com/file/d/1I8yySge8gTfKR2WlpQ_Q1SSAR-O8dtwn/view?usp=sharing

assignmentweek4bhi557-ffompm1n.pdf

Answered 4 days AfterJul 31, 2021

Answer To: The file used in this assignment is the onecombining clinical data and genetic data provided in the...

Mohd answered on Aug 03 2021

145 Votes

Untitled
Untitled
-
8/2/2021
cell #1
#install.packages("randomForest")
#install.packages("class")
cell #2
memory.limit(size=3500)
## Warning in memory.limit(size = 3500): cannot decrease memory limit: ignored
## [1] 8036
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(class)
cell #3 load the dataset, which has patients as rows and variables as columns
mrnaNorm <- read.table("BRCAMerged.csv", header = T, sep=",")
class(mrnaNorm)
## [1] "data.frame"
Create at the end additional code to add some clinical variables to the genetic variables exclusively used in AssignmentWeek4 as provided. Namely, the goal is to addnumeric variables we have not analyzed yet, such as stage (column #5), Diagnosis.Age (column #6), Birth.from.Initial.Pathologic.Diagnosis.Date (column #12), Death.from.Initial.Pathologic.Diagnosis.Date (column #14), Last.Alive.Less.Initial.Pathologic.Diagnosis.Date.Calculated.Day.Value (column #15), Days.to.Last.Followup (column #16), Disease.Free.Months (column #17), Fraction.Genome.Altered (column #22), HER2.ihc.score (column #25), Overall.Survival..Months. (column #32). To do this, create a modified version of Cell #6 to add these variables to the genetic variables. You should now have a dataset for analysis containing 20541 variables.
1. Run again the analyses in cells [9] to [15], which you can duplicate below the previous code, to see whether there are differences in the classification results. Enter the R code in the next cells.
# cell #4
sampClass <- lapply(mrnaNorm[,"type"], function(t) (if (t == "MN") return("0") else return("1")))
mrnaClass <- as.data.frame(sampClass)
dim(mrnaClass)
## [1] 1 1212
table(unlist(sampClass))
##
## 0 1
## 112 1100
sampClassNum <- lapply(mrnaNorm[,"type"], function(t) (if (t == "MN") return(0) else return(1)))
mrnaClassNum <- as.data.frame(sampClassNum)
table(unlist(mrnaClassNum))
##
## 0 1
## 112 1100
0 1
112 1100
0 1
112 1100
cell #5
geneNames <- as.data.frame(colnames(mrnaNorm[,-c(1:40)])) # extract the gene names from mrnaNorm as column names after column 40
dim(geneNames)
## [1] 20531 1
20531 genes
cell #6
mrnaData <- mrnaNorm[,-c(1:40)]
dim(mrnaData)
## [1] 1212 20531
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 515722 27.6 9913783 529.5 8906039 475.7
## Vcells 30096525 229.7 86061692 656.6 86060821 656.6
1212 patients and 20531 gene expression values
cell #7
trainSet <- mrnaData
testSet <- mrnaData
trainClasses <- unlist(mrnaClassNum[1,], use.names=FALSE)
testClasses <- unlist(mrnaClassNum[1,], use.names=FALSE)
knn.predic <- knn(trainSet, testSet, trainClasses, testClasses,k=1)
cbr.predic = as.vector(knn.predic)
table(cbr.predic, testClasses)
## testClasses
## cbr.predic 0 1
## 0 112 0
## 1 0 1100
tab <- table(cbr.predic, t(testClasses))
error <- sum(tab) - sum(diag(tab))
accuracy <- round(100- (error * 100 / length(testClasses)))
print(paste("accuracy= ", as.character(accuracy), "%"), quote=FALSE)
## [1] accuracy= 100 %
cell #8
bssWssFast <- function (X, givenClassArr, numClass=2)
# between squares / within square feature selection
{
classVec <- matrix(0, numClass, length(givenClassArr))
for (k in 1:numClass) {
temp <- rep(0, length(givenClassArr))
temp[givenClassArr == (k - 1)] <- 1
classVec[k, ] <- temp
}
classMeanArr <- rep(0, numClass)
ratio <- rep(0, ncol(X))
for (j in 1:ncol(X)) {
overallMean <- sum(X[, j]) / length(X[, j])
for (k in 1:numClass) {
classMeanArr[k] <-
sum(classVec[k, ] * X[, j]) / sum(classVec[k, ])
}
classMeanVec <- classMeanArr[givenClassArr + 1]
bss <- sum((classMeanVec - overallMean)^2)
wss <- sum((X[, j] - classMeanVec)^2)
ratio[j] <- bss/wss
}
sort(ratio, decreasing = TRUE, index = TRUE)
}
cell #9
# select features
dim(mrnaData)
## [1] 1212 20531
# 1212 20531 matrix
dim(mrnaClass)
## [1] 1 1212
# 1 1212
dim(mrnaClassNum)
## [1] 1 1212
# 1 1212
dim(geneNames)
## [1] 20531 ...

SOLUTION.PDF

The file used in this assignment is the onecombining clinical data and genetic data provided in the previous module (BRCAMerged.csv). A number of functions have been provided in cells#1-15 of the...

Answer To: The file used in this assignment is the onecombining clinical data and genetic data provided in the...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment