Problem A. (10
points)
DNA: CG Pairs
[1]
DNA (Deoxyribonucleic Acid) is a molecule that encodes genetic blueprints for living organisms. Understanding how it works is one of the fundamental pursuits of modern biology and genetics. Each strand of DNA is a chain of nucleotides. Each nucleotide is made up of sugar, a phosphate group, and one of the four nitrogenous bases: adenine (A), thymine (T), cytosine (C) and guanine (G). For this reason, a common way to represent a strand of DNA is by simply writing out the letters corresponding to the base that each nucleotide contains. For example, a small strand of DNA could be represented as:
dna_str = 'atcgttcg'
(This is a very very short dna sequence, actual sequences are much longer. If you're interested in what a real DNA sequence looks like, check out the National Center for Biotechnology Information at the National Institute of Health.
Here is the page
that contains the first segment of the genome for a Burmese python.)
Vertebrates have a much lower density of CG pairs then would occur by chance, but they have a relatively high concentration near genes. Because of this, finding higher concentrations of CG pairs is a good way to find possible sites of genes.
Build a function cg_ratio(dna) that takes a string containing a dna strand as its argument and returns the fraction of dinucleotides (i.e. pairs of consecutive nucleotides) that are “cg” in that strand, without using the built-in count method. For example, the following dna strand has a cg_ratio of 0.25:
cg_ratio('accgttcgc')
because it contains 8 dinucleotides (‘ac’, ‘cc’, ‘cg’, ‘gt’, ‘tt’, ‘tc’, ‘cg’, ‘gc’), and 2 of them are ‘cg’, so we get 2/8 = 0.25.
Note that the total number of dinucleotides is always one less than the length of the string in this scenario.
If there are characters that are not legal nucleotide letters (‘a’, ‘c’, ‘g’, ‘t’, or their capitalized versions), print 'Invalid DNA strand' and return 0.0. Your function should be able to handle both upper and lower case letters. You can assume that the string passed into your function will contain at least two characters.
Hints:
● Start by converting the string to lowercase so you don’t have to deal with capitalization
● You may want to think about string indexing as your function examines the dna strand. You will need to step through the string, examining pairs of letters as you go. This means that you will likely want to loop through the indexes of the string, not the elements.
● If var is a string and i is any index in that string other than the last index, then var[i:i+2] will give me a two-character slice that includes the character at index i and the one at index i+1.
Constraints:
● Don't use the built-in count method
Examples
(values in
bold
are returned, values in
italics
are printed)
>>> cg_ratio('atcgttcg')
0.2857142857142857
>>> cg_ratio('ggGCg')
0.25
>>> cg_ratio('cGa')
0.5
>>> cg_ratio('testing')
Invalid DNA strand
0.0
>>> cg_ratio('gccGtTfa')
Invalid DNA strand
0.0
[1]
DNA problems adapted from
Discovering Computer Science, Jessen Havill, 2016