UNIVERSITY of HOUSTONDepartment of BiotechnologyBTEC4300 Principles of BioinformaticsHomework 3De novo Transcriptome AssemblyFor the following questions, snip the result and the command you...

Please see attached. Every question will need a screen shot of the steps taken, only questions 1-7


UNIVERSITY of HOUSTON Department of Biotechnology BTEC4300 Principles of Bioinformatics Homework 3 De novo Transcriptome Assembly For the following questions, snip the result and the command you used, unless the question states otherwise. Remember that quality controlling RNA-seq reads with FastQC and trimming with trimmomatic are nearly identical to that of QCing/trimming DNA-seq reads in the previous homework. Make sure the following tools are installed in a conda environment: sra-tools fastqc spades trimmomatic conda install -c bioconda sra-tools fastqc spades trimmomatic 1. Retrieve the RNA-Seq dataset with prefetch (Accession: SRR633727). The dataset is from an expression study on the plant Arabidopsis thaliana, commonly known as the Mouse-ear Cress. 2. Extract and split the downloaded SRR633727.sra file with fastq-dump. 3. Assess the quality of the RNA-Seq dataset with FastQC. Snip the per-base quality histogram. 4. Trim the paired end reads with trimmomatic and the provided trimming parameters: trimmomatic PE LEADING:10 TRAILING:10 SLIDINGWINDOW:5:15 MINLEN:30 5. Assess the post-trimming quality of the RNA-Seq dataset with FastQC. Snip the per-base quality histogram. 6. Assemble the transcriptome with rnaSPAdes and the trimmed reads. 7. How many transcripts did rnaSPAdes assemble? Lecture Problems 8. Briefly explain the difference between rooted and unrooted phylogenetic trees. 9. What is the significance of biological and technical replicates in RNA-Seq experiments? 1 BTEC 4100 Principles of Bioinformatics Lab Tutorial 3 - Bash 3 Learning Objectives: · Retrieve DNA short reads from NCBI’s SRA database with SRA-tools · Quality control and pre-process (trim) short reads · Assemble the short reads with SPAdes and MEGAHIT · Compare the quality of assemblies using QUAST · Use RAST to annotate the genome of the bacterium Deinococcus radiodurans Downloading Software After following the Anaconda Installation instruction, you should have anaconda or miniconda installed, as well as an environment named genome_assembly_env created. Moreover, you did installed all tools that are required for this tutorial including fastqc, trimmomatic, megahit, etc. Now, let begin to practice genome assembly by using those tools. Practice Data FASTQ files. One of the most popular formats for sequencer data is FASTQ. Below is a short overview of the format. @ Sequence ( A, T, C, and G's) + In this tutorial, the practice data is from the NCBI SRA database. We will assemble the genome of Campylobacter jejuni (taxid 197; SRA: SRR9620862). We will be dealing with paired reads (2x250), one for forward reads (R1) and another for reverse reads (R2). Organization First. Before downloading the read files, I create a folder/directory called bash3_practice, then create a directory called genome_assembly using the mkdir command. There will be a lot of output from the software we use. Being organized will help keep track of files and directories being made. If you copy and paste the commands below, you will get an error if you do not have a bash3_practice directory made. For Windows Users. You can open up your PowerShell or Ubuntu. Activate the genome_assembly_env, so that you can use tools that you have downloaded in this environment in the previous step. In your bash3_practice, create your genome_assembly folder to store all files of this practice. . conda activate genome_assembly_env mkdir bash3_practice/genome_assembly && cd bash3_practice/genome_assembly For MacOSX users. Ensure your bash3_practice folder is created in your home (~) directory. You can use the command below: conda activate genome_assembly_env mkdir ~/bash3_practice/genome_assembly && cd ~/bash3_practice/genome_assembly Getting the data. You should have SRA tools already installed. If not, go back to Installing with conda (above) and download it. We will use SRA tools to download the genome assembly data. After downloading and preparing the data, you should have two files in your genome_assembly folder. The command below will download the reads in your current working directory (so ensure you are in your genome_assembly folder). prefetch -O . SRR9620862 Next, we need to split the reads into forward and reverse reads: cd SRR9620862 fasterq-dump --split-files SRR9620862.sra We can rename the files and move them: cp SRR9620862_1.fastq C_jejuni_R1.fastq cp SRR9620862_2.fastq C_jejuni_R2.fastq mv C_jejuni_R* ../ && cd ../ NOTE: Genome assembly deals with a lot of data. You can keep the FASTQ files compressed with gzip. Doing this will save you space. gzip C_jejuni_R* Quality Assessment In this section, we want to look at the quality of our data to help us determine what type of trimming/cleanup we can do to improve quality. Improving the quality of our data will help with better assemblies. FastQC (assessment tool). FastQC is a quality control tool for sequence data. The software outputs various information about the data given. Such output includes: · Read length – important for setting maximum k-mer size for assembly · Total number of reads – Helps calculate coverage · Quality encoding type – Important in quality trimming software · %GC – Genomes with high GC content don't usually assemble well · Quality throughout the reads – Helps determine trimming/cleanup methods · K-mer occurrences – Can signify contamination of reads (such as barcodes are adapters) · N's occurrences – Can indicate poor sequencing run (trim these reads) Usage: fastqc C_jejuni_R1.fastq.gz C_jejuni_R2.fastq.gz Output is a few files. Below I have snips of some HTML files for the forward read (R1) and reverse reads (R2). I didn't include snips of all modules, but I suggest exploring them. You can also get more detail about the modules here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/ There's a summary section on the left of the HTML page with green, yellow, or red icons next to each module. Use the link provided to the modules to see why you might have gotten a warning or failure icon. Although it is hard to know what to expect when first looking at the modules, without much experience, the developers provide some examples of "good" and "bad" data. Example of ‘good’ data: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html Example of ‘bad’ data: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html Basic statistics module. From the basic statistics, the number of reads is 439,576, sequence length is 35-251, and %GC 33. It appears that sequence lengths could be causing a warning in the sequence length distribution module. During the trimming/cleanup process, we can attempt to fix this by only having reads of length 250bp. Coverage is roughly estimated using the following formula: C = LN/G · G is genome size (of the reference genome) · L is the read length (it's between 33-250; we can see that most are in the 250, so going with that) · N is the number of reads (paired-end reads, so take 439,576 and times by 2) C = (879,152 * 500)/1,640,000 Coverage is ~ 268X. Suggested coverage is around 50X-100X. ((182,722 * 2) * 250) / 1,640,000 ~ 55.7X Base sequence quality. The per-base sequence quality graph is a Box&Whisker plot, in which the X-axis is the position in the read and the Y-axis is the quality. The red line is the median value, the yellow box represents where %50 (interquartile range) sits for that position, upper and lower whiskers are the 10% and 90% points, and the blue line represents the mean quality. Nothing here appears to be out of the ordinary. The quality seems to be good throughout the entire length of the reads. However, there appears to be a slight dip in quality at the end and beginning, but not much. Forward reads: Reverse Reads: Quality trimming and cleanup Trimmomatic (filtering tool). Since we have an idea about our data quality, we can start using tools to trim/clean up the data for quality improvement. Trimmomatic is a pair-aware trimmer. For usage and more details, go here: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf The order of the parameters supplied matters. So be mindful of that. For this first trimming, I used LEADING to cut bases off the start of the read that is below quality score 10. Next, TRAILING was to cut off bases below ten but for the end. Next, SLIDINGWINDOW was used to look at five bases at a time, starting at 5' end, and clip the read once the average quality (within those five bases) falls below 20. Finally, MINLEN was used to remove any reads below the length of 250. trimmomatic PE C_jejuni_R1.fastq.gz C_jejuni_R2.fastq.gz \ Cjejuni_R1_paired.fastq.gz Cjejuni_R1_unpaired.fastq.gz \ Cjejuni_R2_paired.fastq.gz Cjejuni_R2_unpaired.fastq.gz \ LEADING:10 TRAILING:10 SLIDINGWINDOW:5:20 MINLEN:250 The output tells us 43.29% of the read pairs survived. About 380,584 reads left. So, 380,584 * 500bp is 190Mbs, equating to 116X coverage. Usually, the best is ~50X-100X. Thus, we can filter more. After first filtering Forward read: Adding CROP. We can add the crop parameter to cut sequences to be 250. MINLEN will remove other sequences not long enough. So we should only have sequences with a length of 250. trimmomatic PE C_jejuni_R1.fastq.gz C_jejuni_R2.fastq.gz \ Cjejuni_R1_paired.fastq.gz Cjejuni_R1_unpaired.fastq.gz \ Cjejuni_R2_paired.fastq.gz Cjejuni_R2_unpaired.fastq.gz \ CROP:250 LEADING:10 TRAILING:10 SLIDINGWINDOW:5:20 MINLEN:250 The output shows that 41.57% of the read pairs survived, which is ~111X coverage. You are welcome to try and increase quality. If you manage to do so, please share it with the class. From here, we will move on with our trimmed data. Genome Assembly After moving on from quality assessment and filtering, we can take the cleaned-up data and assemble the genome. The basic idea here is to use different genome assemblers and compare their output. Doing this will give us an idea of what assembly will be best to move on with for further analysis, such as genome annotation (there's a lot you can do after assembly). Genome assemblers used here are SPAdes and MEGAHIT. We will run the assembler with different parameters and compare those, but first, we will include an error correction step. Doing this can help improve the assemblies. (Links to genome assembly software used are provided at the end of this tutorial.) SPAdes error correction only. SPAdes has an error correction pipeline option that we can use. This tends to be the most computationally expensive task. It took about 2 hours on my laptop, and it took about 1 hour on my Mac desktop. The output is a directory that contains files and other directories. Before we run error correction, we need to downgrade our version of spades. This is because you will most likely get a segmentation fault if we run the code below without downgrading SPAdes. Segmentation faults are just one of the bugs you might face when trying new tools/software. Luckily with conda, it is pretty easy to downgrade: conda install -c bioconda spades=3.11.1 spades.py -1 Cjejuni_R1_paired.fastq.gz -2 Cjejuni_R2_paired.fastq.gz \ -o
Oct 13, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here