The file used in this assignment is the one from the worksheet, containing copy number scores for genes(BRCA.snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt).
The attached notebook extracts the group category from the dataset and adds it as a column (AssignmentWeek5.pdf). The corresponding script is provided as a Jupyternotebook and an R script, both called AssignmentWeek5.
There are three major sample groups: primary tumor, metastasis, and normal (normalsamplestaken from the same patients).
1) Download the attached files and place them in the same folder:
- BRCA.snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt
- AssignmentWeek5.ipynb
- AssignmentWeek5.R
2) Run the script either as AssignmentWeek5.iptnb(Jupyternotebook installation) or as AssignmentWeek5.R (RStudioinstallation).
3) Create at the end additional code to calculate copy number variations for normal group and for metastasis group. The cells have been prepared but are empty (cells #10-17)
4) Conclude whether you see a pattern by comparing copy number variations between the three groups: primary tumor, normal, and metastasis.
5) Turn in the assignment as a plain R script file (not Jupyter notebook file), attached to your submission.
Module 5 R Notebook Gene Alterations Assignment goals In this notebook, you are going to practice analyzing files containing copy number variations. Since for a known population, cBioPortal can be used for that purpose, we are going to get reults at the patient level. Here we are continuing to work on a breast cancer dataset downloaded from Firehose (https://gdac.broadinstitute.org/) (https://gdac.broadinstitute.org/)). The dataset used is "BRCA.snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline This is a complete dataset of copy number alterations from the whole geneome compared with a normal genome. This dataset has 6 columns: Sample Chromosome Start End Num _Probes Segment_Mean 1 TCGA-3C-AAAU-10A-01D-A41E-01 1 3218610 95674710 53225 0.0055 2 TCGA-3C-AAAU- 10A-01D-A41E-01 1 95676511 95676518 2 -1.6636 'Sample' represents the patient ID (there are some normals too as we saw previously). 'Chromosome' represents the chromosome number. 'Start' represents the start position of the segmented window. 'End' represents the end position of the segmented window. 'Num_Probes' represents the number of probes in the segmented window. 'Segment_Mean' represents the mean copy numner estimate of this particular segment. When 'Segment_Mean' is greater than 0, there is amplification, and when it is less than 0, there is deletion. However slight variations do not count much, as we saw. Often cut-off values of +0.2 and -0.2 are chosen to determine whether there is amplificaiton or deletion. In this assignment, you will find whether a particular patient has copy number alterations, and how much. So let us get started ! Preparing the environment Good news - we are not going to use any libraries for this assignment ! In [ ]: # no library to load https://gdac.broadinstitute.org/) Loading the data We load the dataset downloaded from Firehose here. We check how many rows and columns and display some of the data to see what they look like. We are using the header row as variable names for the columns of our 'cnvlogs' dataframe, which is accomplished through 'header = T'). We see that we have 284,458 rows and 6 columns. Because we have only 6 columns, we can look at the first 6 rows using 'head'. we notice right away that we have many rows for the same patient because each row corresponds to a particular probe, or location on the genome. In [ ]: Data understanding To get a sense of whether our data overall is more amplified or deleted, we can run summary statistics. We find that overall our data are more deleted since the mean is negative. However this should be used with caution because we may have a mix of normals and patients in this dataset. In [ ]: Copy number alterations of a group of patients In this dataset, there are several types of samples. Based on the sample type, there may be in this dataset: 1) primary tumor samples 2) metastasis samples 3) normal samples TCGA has data both from tumor samples and from normal samples taken from the same patient. This information can be found in the sample ID. For example in ID: TCGA.3C.AAAU.01A.11R.A41B.07 the type of sample is indicated in the 4th group: 01A. Primarr # cell #1 cnvLogs <- read.table("brca.snp__genome_wide_snp_6__broad_mit_edu__level_3__segme="" header="T," fill="T)" dim(cnvlogs)="" #="" 284458="" 6="" head(cnvlogs)="" #="" sample="" chromosome="" start="" end="" num_probes="" segment_mean="" #1="" tcga-3c-aaau-10a-01d-a41e-01="" 1="" 3218610="" 95674710="" 53225="" 0="" #="" cell="" #2="" summary(cnvlogs)="" #="" mean="-0.1132" median="0" tumor="" types="" range="" from="" 01="" -="" 05="" and="" 08="" -="" 09,="" metastasis="" types="" from="" 06="" -="" 07,="" normal="" types="" from="" 10="" -="" 19="" and="" control="" samples="" from="" 20="" -="" 29.="" therefore="" we="" are="" extracting="" in="" the="" cell="" below="" this="" code="" indicating="" whether="" the="" sample="" is="" from="" primary="" tumor,="" metastasis.="" or="" normal.="" this="" group="" is="" the="" fourth="" in="" the="" id="" character="" string,="" obtained="" with="" 'strsplit',="" and="" we="" get="" the="" two="" characters="" of="" interest="" by="" taking="" the="" first="" two="" characters="" with="" 'substr'.="" so="" first,="" we="" create="" a="" function="" to="" extract="" the="" group="" type="" from="" the="" tcga="" bar="" code.="" we="" do="" this="" for="" all="" the="" ids="" by="" applying="" with="" 'lapply'="" the="" same="" process="" to="" all="" the="" ids="" in="" 'cnvlogs'.="" in="" [="" ]:="" we="" count="" how="" many="" samples="" and="" rows="" we="" have="" of="" each="" type.="" in="" [="" ]:="" in="" [="" ]:="" we="" see="" that="" we="" have="" mostly="" primary="" tumor="" rows="" (212,756),="" followed="" by="" normal="" rows="" (61,376="" +="" 8,460),="" and="" some="" metastasis="" rows="" (1,866).="" we="" add="" a="" 'group'="" column="" at="" the="" end="" of="" the="" dataframe.="" in="" [="" ]:="" we="" are="" going="" to="" compare="" the="" copy="" number="" variations="" between="" these="" groups.="" for="" each="" group:="" 1)="" find="" how="" many="" potential="" copy="" number="" variations="" the="" group="" has="" (just="" counting="" how="" many="" rows="" the="" group="" has="" in="" this="" dataset).="" 2)="" find="" how="" many="" amplifications="" ('segment_mean'=""> 0 ) the group has. 3) find how many deletions ('Segment_Mean' < 0="" )="" the="" group="" has.="" 4)="" find="" the="" average="" 'segment_mean'="" in="" the="" group="" to="" get="" a="" global="" picture.="" so="" first,="" we="" create="" a="" function="" to="" extract="" the="" group="" type="" from="" the="" tcga="" bar="" code.="" #="" cell="" #3="" samp=""><- lapply(as.list(t(cnvlogs['sample'])),="" function(t)="" substr(unlist(strsplit="" #="" extract="" the="" sample="" type="" (tumor="" normal)="" sampletype="">-><- as.data.frame(samp)="" dim(sampletype)="" #="" 1="" 284458="" head(sampletype[1:10])="" #="" cell="" #4="" unique(t(sampletype))="" #="" extracts="" how="" many="" unique="" objects="" there="" are="" tab="">-><- table(unlist(t(sampletype)))="" tab="" #="" count="" how="" many="" of="" each="" group="" #="" 01="" 06="" 10="" 11="" #="" 212756="" 1866="" 61376="" 8460="" #="" cell="" #5="" cnvdata="">-><- cbind(cnvlogs,="" t(sampletype))="" dim(cnvdata)="" colnames(cnvdata)[7]="">-><- "group"="" head(cnvdata,="" 2)="" #="" display="" first="" two="" rows="" we="" count="" how="" many="" rows="" the="" group="" has="" in="" this="" dataset.="" this="" can="" be="" accomplished="" using="" a="" 'subset'="" function="" in="" r="" to="" select="" all="" the="" rows="" for="" a="" group,="" then="" we="" count="" how="" mnay="" we="" have="" using="" 'nrow'.="" primary="" tumor="" group="" we="" count="" how="" many="" rows="" the="" group="" has="" in="" this="" dataset.="" this="" can="" be="" accomplished="" using="" a="" 'subset'="" function="" in="" r="" to="" select="" all="" the="" rows="" for="" a="" group,="" then="" we="" count="" how="" many="" we="" have="" using="" 'nrow'.="" in="" [="" ]:="" next,="" we="" calculate="" how="" many="" of="" these="" rows="" have="" 'segment_mean'=""> 0. We find 113823. In [ ]: Similarly, we calculate how many of these rows have 'Segment_Mean' < 0.="" we="" find="" 98877.="" in="" [="" ]:="" finally,="" we="" calculate="" the="" mean="" of="" 'segment_mean'="" for="" this="" group.="" we="" find="" a="" negative="" value="" of="" -0.04,="" which="" is="" below="" the="" amplification="" threshold.="" in="" [="" ]:="" we="" can="" say="" that="" this="" group="" has="" 212,756="" copy="" number="" alterations,="" with="" 113,823="" amplifications="" and="" 98,877="" deletions.="" overall="" the="" patient="" has="" more="" amplifications="" than="" deletions,="" however="" the="" overall="" score,="" based="" on="" median="" values,="" is="" very="" slightly="" negative.="" normal="" group="" we="" count="" how="" many="" rows="" the="" group="" has="" in="" this="" dataset.="" this="" can="" be="" accomplished="" using="" a="" 'subset'="" function="" in="" r="" to="" select="" all="" the="" rows="" for="" a="" group,="" then="" we="" count="" how="" many="" we="" have="" using="" 'nrow'.="" in="" [="" ]:="" next,="" we="" calculate="" how="" many="" of="" these="" rows="" have="" 'segment_mean'=""> 0. # cell #6 nrow(subset(cnvData, Group == "01")) # cell #7 nrow(subset(cnvData, Group == "01" & Segment_Mean > 0)) # cell #8 nrow(subset(cnvData, Group == "01" & Segment_Mean < 0))="" #="" cell="" #9="" mean(subset(cnvdata,="" group="=" "01")[["segment_mean"]])="" #="" cell="" #10="" in="" [="" ]:="" in="" [="" ]:="" in="" [="" ]:="" metastasis="" group="" we="" count="" how="" many="" rows="" the="" group="" has="" in="" this="" dataset.="" this="" can="" be="" accomplished="" using="" a="" 'subset'="" function="" in="" r="" to="" select="" all="" the="" rows="" for="" a="" group,="" then="" we="" count="" how="" many="" we="" have="" using="" 'nrow'.="" in="" [="" ]:="" next,="" we="" calculate="" how="" many="" of="" these="" rows="" have="" 'segment_mean'=""> 0. In [ ]: Similarly, we calculate how many of these rows have 'Segment_Mean' < 0.="" in="" [="" ]:="" finally,="" we="" calculate="" the="" mean="" of="" 'segment_mean'="" for="" this="" group.="" in="" [="" ]:="" compare="" this="" overall="" amplification="" score="" for="" the="" metastasis="" group="" wit="" the="" normal="" group="" and="" the="" cancer="" group.="" in="" [="" ]:="" #="" cell="" #11="" similarly,="" we="" calculate="" how="" many="" of="" these="" rows="" have="" 'segment_mean'="">< 0. # cell #12 finally, we calculate the mean of 'segment_mean' for this group. # cell #13 compare this overall amplification score for the normal group with the score for the cancer group. # cell #14 # cell #15 # cell #16 # cell #17 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# module 5 r notebook\n", "\n", "# gene alterations\n", "\n", "## assignment goals\n", "\n", "in this notebook, you are going to practice analyzing files containing copy number variations. since for a known population, cbioportal can be used for that purpose, we are going to get reults at the patient level. \n", "\n", "here we are continuing to work on a breast cancer dataset downloaded from firehose (https://gdac.broadinstitute.org/). the dataset used is \"brca.snp__genome_wide_snp_6__broad_mit_edu__level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt\". this is a complete dataset of copy number alterations from the whole geneome compared with a normal genome. \n", "\n", "this dataset has 6 columns:\n", "\n", " sample chromosome start end num_probes segment_mean\n", "1 tcga-3c-aaau-10a-01d-a41e-01 1 3218610 95674710 53225 0.0055\n", "2 tcga-3c-aaau-10a-01d-a41e-01 1 95676511 95676518 2 -1.6636\n", "\n", "'sample' represents the patient id (there are some normals too as we saw previously).\n", "'chromosome' represents the chromosome number.\n", "'start' represents the start position of 0.="" #="" cell="" #12="" finally,="" we="" calculate="" the="" mean="" of="" 'segment_mean'="" for="" this="" group.="" #="" cell="" #13="" compare="" this="" overall="" amplification="" score="" for="" the="" normal="" group="" with="" the="" score="" for="" the="" cancer="" group.="" #="" cell="" #14="" #="" cell="" #15="" #="" cell="" #16="" #="" cell="" #17="" {="" "cells":="" [="" {="" "cell_type":="" "markdown",="" "metadata":="" {},="" "source":="" [="" "#="" module="" 5="" r="" notebook\n",="" "\n",="" "#="" gene="" alterations\n",="" "\n",="" "##="" assignment="" goals\n",="" "\n",="" "in="" this="" notebook,="" you="" are="" going="" to="" practice="" analyzing="" files="" containing="" copy="" number="" variations.="" since="" for="" a="" known="" population,="" cbioportal="" can="" be="" used="" for="" that="" purpose,="" we="" are="" going="" to="" get="" reults="" at="" the="" patient="" level.="" \n",="" "\n",="" "here="" we="" are="" continuing="" to="" work="" on="" a="" breast="" cancer="" dataset="" downloaded="" from="" firehose="" (https://gdac.broadinstitute.org/).="" the="" dataset="" used="" is="" \"brca.snp__genome_wide_snp_6__broad_mit_edu__level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt\".="" this="" is="" a="" complete="" dataset="" of="" copy="" number="" alterations="" from="" the="" whole="" geneome="" compared="" with="" a="" normal="" genome.="" \n",="" "\n",="" "this="" dataset="" has="" 6="" columns:\n",="" "\n",="" "="" sample="" chromosome="" start="" end="" num_probes="" segment_mean\n",="" "1="" tcga-3c-aaau-10a-01d-a41e-01="" 1="" 3218610="" 95674710="" 53225="" 0.0055\n",="" "2="" tcga-3c-aaau-10a-01d-a41e-01="" 1="" 95676511="" 95676518="" 2="" -1.6636\n",="" "\n",="" "'sample'="" represents="" the="" patient="" id="" (there="" are="" some="" normals="" too="" as="" we="" saw="" previously).\n",="" "'chromosome'="" represents="" the="" chromosome="" number.\n",="" "'start'="" represents="" the="" start="" position=""> 0. # cell #12 finally, we calculate the mean of 'segment_mean' for this group. # cell #13 compare this overall amplification score for the normal group with the score for the cancer group. # cell #14 # cell #15 # cell #16 # cell #17 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# module 5 r notebook\n", "\n", "# gene alterations\n", "\n", "## assignment goals\n", "\n", "in this notebook, you are going to practice analyzing files containing copy number variations. since for a known population, cbioportal can be used for that purpose, we are going to get reults at the patient level. \n", "\n", "here we are continuing to work on a breast cancer dataset downloaded from firehose (https://gdac.broadinstitute.org/). the dataset used is \"brca.snp__genome_wide_snp_6__broad_mit_edu__level_3__segmented_scna_minus_germline_cnv_hg19__seg.seg.txt\". this is a complete dataset of copy number alterations from the whole geneome compared with a normal genome. \n", "\n", "this dataset has 6 columns:\n", "\n", " sample chromosome start end num_probes segment_mean\n", "1 tcga-3c-aaau-10a-01d-a41e-01 1 3218610 95674710 53225 0.0055\n", "2 tcga-3c-aaau-10a-01d-a41e-01 1 95676511 95676518 2 -1.6636\n", "\n", "'sample' represents the patient id (there are some normals too as we saw previously).\n", "'chromosome' represents the chromosome number.\n", "'start' represents the start position of>->->