Data structures homeworkHW2 - DNA_Spring2020_cs3310 1 Assignment 2 Genetic Palindromes, Linked...

Question

Data structures homeworkHW2 - DNA_Spring2020_cs3310 1   Assignment 2  Genetic Palindromes, Linked Lists        Release Date Due Date  January 30, 2020 February 13, 2020    Objectives     •       Understanding the mechanisms of linked lists in depth  •       Sorting linked lists  •       Explicitly manipulating pointers and references   Taken from Wikimedia / Wikipedia    Problem Specification    In this assignment you will check to see if a given DNA (or RNA) double-strand is palindromic and  search for palindromic strands in a genomic sequence. A palindrome in languages is a word or string  that can be read the same way forward or backwards (e.g. racecar, wow, noon, madam, 10101).  Palindromic sequences in DNA refer to a short run of bases (typically 3 to 5 in length), followed by  their complementary bases in reverse order, we will call these genetic palindromes.    DNA strands can be represented using strings of letters referring to the bases of the strand. There are  four common types of bases: adenine (A), cytosine (C), guanine (G), and thymine (T), and there is  also uracil (U) in RNA. A and T are complements, A and U are complements, and C and G are  complements.     The meaning of palindrome in the context of genetics is slightly different from the definition used for  words and sentences. Since a double helix (see the above figure) is formed by two paired antiparallel  strands of nucleotides that run in opposite directions, and the nucleotides always pair in the same way  (adenine (A) with thymine (T) in DNA or uracil (U) in RNA; cytosine (C) with guanine (G)), a (single- stranded) nucleotide sequence is said to be a palindrome if it is equal to its reverse complement. For  example, the DNA sequence ACCTAGGT is palindromic because its nucleotide-by-nucleotide  complement is TGGATCCA, and reversing the order of the nucleotides in the complement gives the  original sequence. The RNA sequence “UAGCTA” is genetic palindrome because, read backwards, it  2   reads “ATCGAU”, the complement to the original sequence.    Palindromic sequences are important for genetic research, specifically in researching restrictive  enzymes. They are also linked with diseases including several cancers, mental retardation, X-linked  recessive diseases and many physical abnormalities. Hence, identifying these in a genome sequence  is an important task.    Note: In this assignment you are NOT allowed to use Java (or other languages) Collections API such  as LinkedList, ArrayList, List, Stack, Queue, etc.    Write an application in Java, Python or C++ to determine whether genomic sequences contain  genetic palindromes using your implementation of linked lists as follows:     1) Storing a genome sequence in a string variable, say myGseq, and then searching for genetic  palindromes by iterating through the characters of myGseq, may be a good solution that uses  array-like random-access manipulations! However…  2)  Since sample sizes and genomic sequences can be arbitrarily large, storing them in a string  variable may not be feasible. Therefore, we will store them one character (i.e., base) per node in  a linked list and manipulate the linked list to identify or search genetic palindromes. In order to get  extensive experience with linked lists and pointers/references, you will use your implementation  of linked lists following the ideas1 given after the assignment submission section. Any other  methods used will receive zero credit.  3) Your linked list node’s data portion will have at least one attribute, called nucleotide of type char.  4) Read the DNA / RNA sequences from a text file. One genomic sequence per line. Note that there  might be only one strand per line to simplify test cases.  5) Identify all genetic palindromes of length four to seventeen in a given genomic sequence. Store  all the identified genetic palindromes in a new linked list using the Append() method of linked-list.  Separate each genetic palindrome by a node containing the non-base character ‘-‘. Print the data  of this linked list.  6) For each genetic palindrome that is found in the file, find how many other similar genetic  palindromes exist in the file. Note that a pair of strands are similar when they have the same  length and the ith nucleotide of a strand in the pair is either a base or its complement.  7) Test and measure time complexities of your implementations of isGeneticPalindrome(),  findGeneticPalindromes() and countSimilarGeneticPalindromes() and output the results. See  sample inputs and outputs.   8) All your functions, including isGeneticPalindrome(), should have input validation.  a.     Example: myLL.storeStrand(“TAT”) would store the characters {T,A,T} in the linked-list.  b.     Example: myLL.storeStrand( “hi_mom” ) would report an error and gracefully exit or  continue processing next input. Your message to mom would never be seen :(  9) To test InsertSort() (and sortedInsert()), sort the genomic sequence after storing it in the linked  list and print the first and the last genomic sequences in the file and their corresponding sorted  sequences.    1 We have liberally borrowed ideas of the linked lists from Nick Parlante@Stanford and palindromes in genetics from Wikipedia.  3     Sample Inputs and Outputs    Please enter the file name containing DNA/RNA Sequences: mySequences.txt     .. read mySequences.txt  time to create linked lists from sequences: 20 milliseconds    DNA sequence 1: TATA    “TATA” is a genetic palindrome.   This sequence does not contain any genetic palindromes.    time to test whether a genomic subsequence in Sequence 1 is palindromic: 2 milliseconds  time to find all palindromic subsequences of length [4, 17] in Sequence 1: 5 milliseconds    DNA sequence 2: Lorem Ipsom  “Lorem Ipsom” contains characters that do not denote DNA molecules. Skipped.    DNA sequence 3: TTTACCT    “TTTACCT” is not a palindromic sequence. However, it has the following genetic palindromes  of length > 3: TTACC    time to test whether a genomic subsequence in Sequence 3 is palindromic: 2.5 milliseconds  time to find all palindromic subsequences of length [4, 17] in Sequence 3: 6 milliseconds      “TATA” has 20 similar sequences in the file, and  it took 10 milliseconds to determine that.     “TTACC” has 5 similar sequences in the file, and  it took 11 milliseconds to determine that.     Design Requirements      Code Documentation    For this assignment, you must include documentation for your code. This include how to compile and  run your program. Also, you must give outputs of your program with the following double-stranded  DNA: "TAGUCTA”, “CTAGUCTAA”, “TTAAATTCCTTTAGG”, “AGTCCGATCCGT”, and  “AAAAAATTTTTTGGGGGGCCCCCC” at the minimum. If you don’t provide outputs for these five  inputs, it means your program does not execute properly.     You must also include documentation for your code as generated by JavaDoc (and similar tools if  using some other language). You should have JavaDoc comments for every class, constructor, and  method. By default, JavaDoc should output html documentation to a subfolder within your project  (/dist/javadoc). Make sure this folder is included when you zip your files for submission. You do not  need to submit a hard copy of this documentation.    Hint: http://stackoverflow.com/questions/4468669/how-to-generate-javadoc-html-in-eclipse      4     Coding Conventions and Programming Standards    You must adhere to all conventions in the CS 3310 Java coding standard. This includes the use of  white spaces for readability and the use of comments to explain the meaning of various methods and  attributes. Be sure to follow the conventions for naming files, classes, variables, method parameters  and methods. Read the material linked from our class web-pages (in case you can’t recall  programming styles and conventions from your CS1 and CS2 courses).      Testing    Make sure you test your application with several different values capturing different cases, to make  sure it works. That is test your program not just on the inputs we specified but other representatiove  test data that covers your program’s logic as well.      Assignment Submission    • Generate a .zip file that contains all your files, including:  ▪ Signed Plagiarism Declaration  ▪ Source code files  ▪ Include any input or output files  ▪ Documentation of your code – e.g. using Javadoc if using Java  ▪ A brief report (in a pdf file) on your observations of comparing theoretical vs empirically  observed time complexities. Note this report will include (a) a brief description of  problem statement(s), (b) algorithms descriptions (if these are standard, commonly  known algorithms, then just mention their names along with customization to your  specific solution(s), otherwise give the pseudo-code of your algorithms, (c) theoretically  derived complexities of the algorithms used in your code, (d) table(s) of the observed  time complexities, and (e) plots comparing theoretical vs. empirical along with your  observations (e.g. do theoretical agree with your implementation, why? Why not?).  • Don’t forget to follow the naming convention specified for submitting assignments       5   Linked List Specifications using Java-like syntax  Linked List Ground Rules    All of the linked list code in this document uses the "classic" doubly linked list  structure: A single head pointer points to the first node in the list. Each node contains  a single next pointer to the next node and a single prev pointer to the previous  node. The next pointer of the last node is NULL and prev pointer of the first  node is NULL. The empty list is represented by a NULL head pointer. All of the  nodes are allocated in the heap.   You will practice list manipulations using an object-oriented (OO) style of coding,  for some functionality we may use non-OO style of coding. Essentially, in OO style  – list manipulation functions are implemented as methods of the class and in non-OO  style (aka conventional) they are implemented as independent units. So there might  be two versions of the functions. We suggest, you first implement the functions using  whatever style you are most comfortable with and then copy-paste-modify-test those  methods’ implementations using

HW2 - DNA_Spring2020_cs3310 1 Assignment 2 Genetic Palindromes, Linked Lists Release Date Due Date January 30, 2020 February 13, 2020 Objectives • Understanding the mechanisms of linked lists in depth...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment