Data structures homework
HW2 - DNA_Spring2020_cs3310 1 Assignment 2 Genetic Palindromes, Linked Lists Release Date Due Date January 30, 2020 February 13, 2020 Objectives • Understanding the mechanisms of linked lists in depth • Sorting linked lists • Explicitly manipulating pointers and references Taken from Wikimedia / Wikipedia Problem Specification In this assignment you will check to see if a given DNA (or RNA) double-strand is palindromic and search for palindromic strands in a genomic sequence. A palindrome in languages is a word or string that can be read the same way forward or backwards (e.g. racecar, wow, noon, madam, 10101). Palindromic sequences in DNA refer to a short run of bases (typically 3 to 5 in length), followed by their complementary bases in reverse order, we will call these genetic palindromes. DNA strands can be represented using strings of letters referring to the bases of the strand. There are four common types of bases: adenine (A), cytosine (C), guanine (G), and thymine (T), and there is also uracil (U) in RNA. A and T are complements, A and U are complements, and C and G are complements. The meaning of palindrome in the context of genetics is slightly different from the definition used for words and sentences. Since a double helix (see the above figure) is formed by two paired antiparallel strands of nucleotides that run in opposite directions, and the nucleotides always pair in the same way (adenine (A) with thymine (T) in DNA or uracil (U) in RNA; cytosine (C) with guanine (G)), a (single- stranded) nucleotide sequence is said to be a palindrome if it is equal to its reverse complement. For example, the DNA sequence ACCTAGGT is palindromic because its nucleotide-by-nucleotide complement is TGGATCCA, and reversing the order of the nucleotides in the complement gives the original sequence. The RNA sequence “UAGCTA” is genetic palindrome because, read backwards, it 2 reads “ATCGAU”, the complement to the original sequence. Palindromic sequences are important for genetic research, specifically in researching restrictive enzymes. They are also linked with diseases including several cancers, mental retardation, X-linked recessive diseases and many physical abnormalities. Hence, identifying these in a genome sequence is an important task. Note: In this assignment you are NOT allowed to use Java (or other languages) Collections API such as LinkedList, ArrayList, List, Stack, Queue, etc. Write an application in Java, Python or C++ to determine whether genomic sequences contain genetic palindromes using your implementation of linked lists as follows: 1) Storing a genome sequence in a string variable, say myGseq, and then searching for genetic palindromes by iterating through the characters of myGseq, may be a good solution that uses array-like random-access manipulations! However… 2) Since sample sizes and genomic sequences can be arbitrarily large, storing them in a string variable may not be feasible. Therefore, we will store them one character (i.e., base) per node in a linked list and manipulate the linked list to identify or search genetic palindromes. In order to get extensive experience with linked lists and pointers/references, you will use your implementation of linked lists following the ideas1 given after the assignment submission section. Any other methods used will receive zero credit. 3) Your linked list node’s data portion will have at least one attribute, called nucleotide of type char. 4) Read the DNA / RNA sequences from a text file. One genomic sequence per line. Note that there might be only one strand per line to simplify test cases. 5) Identify all genetic palindromes of length four to seventeen in a given genomic sequence. Store all the identified genetic palindromes in a new linked list using the Append() method of linked-list. Separate each genetic palindrome by a node containing the non-base character ‘-‘. Print the data of this linked list. 6) For each genetic palindrome that is found in the file, find how many other similar genetic palindromes exist in the file. Note that a pair of strands are similar when they have the same length and the ith nucleotide of a strand in the pair is either a base or its complement. 7) Test and measure time complexities of your implementations of isGeneticPalindrome(), findGeneticPalindromes() and countSimilarGeneticPalindromes() and output the results. See sample inputs and outputs. 8) All your functions, including isGeneticPalindrome(), should have input validation. a. Example: myLL.storeStrand(“TAT”) would store the characters {T,A,T} in the linked-list. b. Example: myLL.storeStrand( “hi_mom” ) would report an error and gracefully exit or continue processing next input. Your message to mom would never be seen :( 9) To test InsertSort() (and sortedInsert()), sort the genomic sequence after storing it in the linked list and print the first and the last genomic sequences in the file and their corresponding sorted sequences. 1 We have liberally borrowed ideas of the linked lists from Nick Parlante@Stanford and palindromes in genetics from Wikipedia. 3 Sample Inputs and Outputs Please enter the file name containing DNA/RNA Sequences: mySequences.txt .. read mySequences.txt time to create linked lists from sequences: 20 milliseconds DNA sequence 1: TATA “TATA” is a genetic palindrome. This sequence does not contain any genetic palindromes. time to test whether a genomic subsequence in Sequence 1 is palindromic: 2 milliseconds time to find all palindromic subsequences of length [4, 17] in Sequence 1: 5 milliseconds DNA sequence 2: Lorem Ipsom “Lorem Ipsom” contains characters that do not denote DNA molecules. Skipped. DNA sequence 3: TTTACCT “TTTACCT” is not a palindromic sequence. However, it has the following genetic palindromes of length > 3: TTACC time to test whether a genomic subsequence in Sequence 3 is palindromic: 2.5 milliseconds time to find all palindromic subsequences of length [4, 17] in Sequence 3: 6 milliseconds “TATA” has 20 similar sequences in the file, and it took 10 milliseconds to determine that. “TTACC” has 5 similar sequences in the file, and it took 11 milliseconds to determine that. Design Requirements Code Documentation For this assignment, you must include documentation for your code. This include how to compile and run your program. Also, you must give outputs of your program with the following double-stranded DNA: "TAGUCTA”, “CTAGUCTAA”, “TTAAATTCCTTTAGG”, “AGTCCGATCCGT”, and “AAAAAATTTTTTGGGGGGCCCCCC” at the minimum. If you don’t provide outputs for these five inputs, it means your program does not execute properly. You must also include documentation for your code as generated by JavaDoc (and similar tools if using some other language). You should have JavaDoc comments for every class, constructor, and method. By default, JavaDoc should output html documentation to a subfolder within your project (/dist/javadoc). Make sure this folder is included when you zip your files for submission. You do not need to submit a hard copy of this documentation. Hint: http://stackoverflow.com/questions/4468669/how-to-generate-javadoc-html-in-eclipse 4 Coding Conventions and Programming Standards You must adhere to all conventions in the CS 3310 Java coding standard. This includes the use of white spaces for readability and the use of comments to explain the meaning of various methods and attributes. Be sure to follow the conventions for naming files, classes, variables, method parameters and methods. Read the material linked from our class web-pages (in case you can’t recall programming styles and conventions from your CS1 and CS2 courses). Testing Make sure you test your application with several different values capturing different cases, to make sure it works. That is test your program not just on the inputs we specified but other representatiove test data that covers your program’s logic as well. Assignment Submission • Generate a .zip file that contains all your files, including: ▪ Signed Plagiarism Declaration ▪ Source code files ▪ Include any input or output files ▪ Documentation of your code – e.g. using Javadoc if using Java ▪ A brief report (in a pdf file) on your observations of comparing theoretical vs empirically observed time complexities. Note this report will include (a) a brief description of problem statement(s), (b) algorithms descriptions (if these are standard, commonly known algorithms, then just mention their names along with customization to your specific solution(s), otherwise give the pseudo-code of your algorithms, (c) theoretically derived complexities of the algorithms used in your code, (d) table(s) of the observed time complexities, and (e) plots comparing theoretical vs. empirical along with your observations (e.g. do theoretical agree with your implementation, why? Why not?). • Don’t forget to follow the naming convention specified for submitting assignments 5 Linked List Specifications using Java-like syntax Linked List Ground Rules All of the linked list code in this document uses the "classic" doubly linked list structure: A single head pointer points to the first node in the list. Each node contains a single next pointer to the next node and a single prev pointer to the previous node. The next pointer of the last node is NULL and prev pointer of the first node is NULL. The empty list is represented by a NULL head pointer. All of the nodes are allocated in the heap. You will practice list manipulations using an object-oriented (OO) style of coding, for some functionality we may use non-OO style of coding. Essentially, in OO style – list manipulation functions are implemented as methods of the class and in non-OO style (aka conventional) they are implemented as independent units. So there might be two versions of the functions. We suggest, you first implement the functions using whatever style you are most comfortable with and then copy-paste-modify-test those methods’ implementations using