applications of machine learning models to get the outputs. Apply the methods and materials in the article to the dataset and write a report on the results obtained.
A DEEP LEARNING APPROACH FOR CANCER DETECTION AND RELEVANT GENE IDENTIFICATION PADIDEH DANAEE∗, REZA GHAEINI School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97330, USA ∗E-mail:
[email protected] and
[email protected] DAVID A. HENDRIX School of Electrical Engineering and Computer Science, Department of Biochemistry and Biophysics, Oregon State University, Corvallis, OR 97330, USA E-mail:
[email protected] Cancer detection from gene expression data continues to pose a challenge due to the high dimen- sionality and complexity of these data. After decades of research there is still uncertainty in the clinical diagnosis of cancer and the identification of tumor-specific markers. Here we present a deep learning approach to cancer detection, and to the identification of genes critical for the diagnosis of breast cancer. First, we used Stacked Denoising Autoencoder (SDAE) to deeply extract functional features from high dimensional gene expression profiles. Next, we evaluated the performance of the extracted representation through supervised classification models to verify the usefulness of the new features in cancer detection. Lastly, we identified a set of highly interactive genes by analyzing the SDAE connectivity matrices. Our results and analysis illustrate that these highly interactive genes could be useful cancer biomarkers for the detection of breast cancer that deserve further studies. Keywords: Cancer Detection; RNA-seq Expression; Deep Learning; Dimensionality Reduction; Stacked Denoising Autoencoder; Classification. 1. Introduction The analysis of gene expression data has the potential to lead to significant biological dis- coveries. Much of the work on the identification of differentially expressed genes has focused on the most significant changes, and may not allow recognition of more subtle patterns in the data.1–6 Tremendous potential exists for computational methods to analyze these data for the discovery of gene regulatory targets, disease diagnosis and drug development.7–9 How- ever, the high dimension and noise associated with these data presents a challenge for these tasks. Moreover, the mismatch between the large number of genes and typically small number of samples presents the challenge of a “dimensionality curse”. Multiple algorithms have been used to distinguish normal cells from abnormal cells using gene expression.10–13 Although there has been a lot of research into cancer detection from gene expression data, there remains a critical need to improve accuracy, and to identify genes that play important roles in cancer. Machine learning methods for dimensionality reduction and classification of gene expres- sion data have achieved some success, but there are limitations in the interpretation of the most significant signals for classification purposes.14,15 Recently, there have been efforts to use single-layer, nonlinear dimensionality reduction techniques to classify samples based on gene expression data.16 In similar studies of computer vision, unsupervised deep learning methods have been successfully applied to extract information from high dimensional image data.17 Pacific Symposium on Biocomputing 2017 219 B io co m pu tin g 20 17 D ow nl oa de d fr om w w w .w or ld sc ie nt if ic .c om by 5 0. 90 .2 48 .4 5 on 0 4/ 27 /2 0. R e- us e an d di st ri bu tio n is s tr ic tly n ot p er m itt ed , e xc ep t f or O pe n A cc es s ar tic le s. Similarly, one can extract the meaningful part of the expression data by applying such tech- niques, thereby enabling identification of specific subsets of genes that are useful for biologists and physicians, with the potential to inform therapeutic strategies. In this work, we used stacked denoising autoencoders (SDAE) to transform high- dimensional, noisy gene expression data to a lower dimensional, meaningful representation.18 We then used the new representations to classify breast cancer samples from the healthy control samples. We used different machine learning (ML) architectures to observe how the new compact features can be effective for a classification task and allow the evaluation of the performance of different models. Finally, we analyzed the lower-dimensional representations by mapping back to the original data to discover highly relevant genes that could play critical roles and serve as clinical biomarkers for cancer diagnosis. The performance of these methods affirm that SDAEs could be applied to cancer detection in order to improve the classification performance, extract both linear and nonlinear relationships in the data, and perhaps more important, to extract a subset of relevant genes from deep models as a set of potential cancer biomarkers. The identification of these relevant genes deserves further analysis as it potentially can improve methods for cancer diagnosis and treatment. 2. Background Classification and clustering of gene expression in the form of microarray or RNA-seq data are well studied. There are various approaches for the classification of cancer cells and healthy cells using gene expression profiles and supervised learning models. The self-organizing map (SOM) was used to analyze leukemia cancer cells.19 A support vector machine (SVM) with a dot product kernel has been applied to the diagnosis of ovarian, leukemia, and colon cancers.11 SVMs with nonlinear kernels (polynomial and Gaussian) were also used for classification of breast cancer tissues from microarray data.10 Unsupervised learning techniques are capable of finding global patterns in gene expression data. Gene clustering represents various groups of similar genes based on similar expression patterns. Hierarchical clustering and maximal margin linear programming are examples of this learning and they have been used to classify colon cancer cells.20,21 K-nearest neighbors (KNN) unsupervised learning also has been applied to breast cancer data.12 Due to the large number of genes, high amount of noise in the gene expression data, and also the complexity of biological networks, there is a need to deeply analyze the raw data and exploit the important subsets of genes. Regarding this matter, other techniques such as principal component analysis (PCA) have been proposed for dimensionality reduction of expression profiles to aid clustering of the relevant genes in a context of expression profiles.22 PCA uses an orthogonal transformation to map high dimensional data to linearly uncorrelated components.23 However, PCA reduces the dimensionality of the data linearly and it may not extract some nonlinear relationships of the data.24 In contrast, other approaches such as kernel PCA (KPCA) may be capable of uncovering these nonlinear relationships.25 Similarly, researchers have applied PCA to a set of combined genes of 13 data sets to obtain the linear representation of the gene expression and then apply a autoencoder to capture nonlinear relationships.26 Recently, a denoising autoencoder has been applied to extract a Pacific Symposium on Biocomputing 2017 220 B io co m pu tin g 20 17 D ow nl oa de d fr om w w w .w or ld sc ie nt if ic .c om by 5 0. 90 .2 48 .4 5 on 0 4/ 27 /2 0. R e- us e an d di st ri bu tio n is s tr ic tly n ot p er m itt ed , e xc ep t f or O pe n A cc es s ar tic le s. feature set from breast cancer data.16 Using a single autoencoder may not extract all the useful representations from the noisy, complex, and high-dimensional expression data. However, by reducing the dimensionality incrementally, the multi-layered architecture of an SDAE may extract meaningful patterns in these data with reduced loss of information.27 3. Materials and Methods We have applied a deep learning approach that extracts the important gene expression relation- ships using SDAE. After training the SDAE, we selected a layer that has both low-dimension and low validation error compared to other encoder stacks using a validation data set inde- pendent of both our training and test set.28 As a result, we selected an SDAE with four layers of dimensions of 15,000, 10,000, 2,000, and 500. Consequently we used the selected layer as input features to the classification algorithms. The goal of our model is extracting a mapping that possibly decodes the original data as closely as possible without losing significant gene patterns. We evaluated our approach for feature selection by feeding the SDAE-encoded features to a shallow artificial neural network (ANN)29 and an SVM model.30 Furthermore, we applied a similar approach with PCA and KPCA as a comparison. Lastly, we used the SDAE weights from each layer to extract genes with strongly propa- gated influence on the reduced-dimension SDAE-encoding. These selected “deeply connected genes” (DCGs) are further tested and analyzed for pathway and Gene Ontology (GO) en- richment. The results from our analysis showed that in fact our approach can reveal a set of biomarkers for the purpose of cancer diagnosis. The details of our method are discussed in the following subsections, and the work-flow of our approach is shown in Fig 1. 3.1. Gene Expression Data For our analysis, we analyzed RNA-seq expression data from The Cancer Genome Atlas (TCGA) database for both tumor and healthy breast samples.31 These data consist of 1097 breast cancer samples, and 113 healthy samples. To overcome the class imbalance of the data, we used synthetic minority over-sampling technique (SMOTE) to transform data into a more balanced representation for pre-training.32 We used the imbalanced-learn package for this transformation of the training data.33 Furthermore, we removed all genes that had zero expression across all samples. 3.2. Dimensionality Reduction Using Stacked Denoising Autoencoder An autoencoder (AE) is a feedforward neural network that produces the output layer as close as possible to its input layer using a lower dimensional representation (hidden layer). The autoencoder consists of an encoder and a decoder. The encoder is a nonlinear function, like a sigmoid, applied to an affine mapping of the input layer, which can be expressed as fθ(X) = σ(Wx+b) with parameters θ = {W, b}. The matrix W is of dimensions d′×d to go from a larger dimension of gene expression data d to a lower dimensional encoding corresponding to d′. The bias vector b is of dimension d′. This input layer encodes the data to generate a Pacific Symposium on Biocomputing 2017 221 B io co m pu tin g 20 17 D ow nl oa de d fr om w w w .w or ld sc ie nt if ic .c om by 5 0. 90 .2 48 .4 5 on 0 4/ 27 /2 0. R e- us e an d di st ri bu tio n is s tr ic tly n ot p er m itt ed , e xc ep t f or O pe n A cc es s ar tic le s. TCGA Data Resampling Test 1 2 3 4 5 C ro ss V a lid a ti o n 2 3 4 5 1 3 4 5 1 4 5 1 1 2