Answer To: The file used in this assignment is the onecombining clinical data and genetic data provided in the...
Mohd answered on Aug 03 2021
Untitled
Untitled
-
8/2/2021
cell #1
#install.packages("randomForest")
#install.packages("class")
cell #2
memory.limit(size=3500)
## Warning in memory.limit(size = 3500): cannot decrease memory limit: ignored
## [1] 8036
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(class)
cell #3 load the dataset, which has patients as rows and variables as columns
mrnaNorm <- read.table("BRCAMerged.csv", header = T, sep=",")
class(mrnaNorm)
## [1] "data.frame"
Create at the end additional code to add some clinical variables to the genetic variables exclusively used in AssignmentWeek4 as provided. Namely, the goal is to addnumeric variables we have not analyzed yet, such as stage (column #5), Diagnosis.Age (column #6), Birth.from.Initial.Pathologic.Diagnosis.Date (column #12), Death.from.Initial.Pathologic.Diagnosis.Date (column #14), Last.Alive.Less.Initial.Pathologic.Diagnosis.Date.Calculated.Day.Value (column #15), Days.to.Last.Followup (column #16), Disease.Free.Months (column #17), Fraction.Genome.Altered (column #22), HER2.ihc.score (column #25), Overall.Survival..Months. (column #32). To do this, create a modified version of Cell #6 to add these variables to the genetic variables. You should now have a dataset for analysis containing 20541 variables.
1. Run again the analyses in cells [9] to [15], which you can duplicate below the previous code, to see whether there are differences in the classification results. Enter the R code in the next cells.
# cell #4
sampClass <- lapply(mrnaNorm[,"type"], function(t) (if (t == "MN") return("0") else return("1")))
mrnaClass <- as.data.frame(sampClass)
dim(mrnaClass)
## [1] 1 1212
table(unlist(sampClass))
##
## 0 1
## 112 1100
sampClassNum <- lapply(mrnaNorm[,"type"], function(t) (if (t == "MN") return(0) else return(1)))
mrnaClassNum <- as.data.frame(sampClassNum)
table(unlist(mrnaClassNum))
##
## 0 1
## 112 1100
0 1
112 1100
0 1
112 1100
cell #5
geneNames <- as.data.frame(colnames(mrnaNorm[,-c(1:40)])) # extract the gene names from mrnaNorm as column names after column 40
dim(geneNames)
## [1] 20531 1
20531 genes
cell #6
mrnaData <- mrnaNorm[,-c(1:40)]
dim(mrnaData)
## [1] 1212 20531
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 515722 27.6 9913783 529.5 8906039 475.7
## Vcells 30096525 229.7 86061692 656.6 86060821 656.6
1212 patients and 20531 gene expression values
cell #7
trainSet <- mrnaData
testSet <- mrnaData
trainClasses <- unlist(mrnaClassNum[1,], use.names=FALSE)
testClasses <- unlist(mrnaClassNum[1,], use.names=FALSE)
knn.predic <- knn(trainSet, testSet, trainClasses, testClasses,k=1)
cbr.predic = as.vector(knn.predic)
table(cbr.predic, testClasses)
## testClasses
## cbr.predic 0 1
## 0 112 0
## 1 0 1100
tab <- table(cbr.predic, t(testClasses))
error <- sum(tab) - sum(diag(tab))
accuracy <- round(100- (error * 100 / length(testClasses)))
print(paste("accuracy= ", as.character(accuracy), "%"), quote=FALSE)
## [1] accuracy= 100 %
cell #8
bssWssFast <- function (X, givenClassArr, numClass=2)
# between squares / within square feature selection
{
classVec <- matrix(0, numClass, length(givenClassArr))
for (k in 1:numClass) {
temp <- rep(0, length(givenClassArr))
temp[givenClassArr == (k - 1)] <- 1
classVec[k, ] <- temp
}
classMeanArr <- rep(0, numClass)
ratio <- rep(0, ncol(X))
for (j in 1:ncol(X)) {
overallMean <- sum(X[, j]) / length(X[, j])
for (k in 1:numClass) {
classMeanArr[k] <-
sum(classVec[k, ] * X[, j]) / sum(classVec[k, ])
}
classMeanVec <- classMeanArr[givenClassArr + 1]
bss <- sum((classMeanVec - overallMean)^2)
wss <- sum((X[, j] - classMeanVec)^2)
ratio[j] <- bss/wss
}
sort(ratio, decreasing = TRUE, index = TRUE)
}
cell #9
# select features
dim(mrnaData)
## [1] 1212 20531
# 1212 20531 matrix
dim(mrnaClass)
## [1] 1 1212
# 1 1212
dim(mrnaClassNum)
## [1] 1 1212
# 1 1212
dim(geneNames)
## [1] 20531 ...