Assignment Brief & Python Code attached as a file
1 BIO828 Assignment The following tasks should be completed and submitted before the midnight on Friday in Week 11. Your answers/report to the task questions should be in Word or PDF format, your Python code as a separate *.py file and your Python script output file(s) as specified in Task 5. Please pack all your assignment work files into a single zip file and submit it using the submission dropbox provided on BBL. Please get in touch if you have any queries. Task 1 – Pubmed search (10%) Using the key information from statement below, perform a Pubmed search and then answer the questions below: Although traditional sweeteners such as sugar are carbohydrates, most current research instead is focusing on proteins that have an intrinsically sweet taste. Because these sweet-tasting proteins are much sweeter than their carbohydrate counterparts, they are, in essence, calorie free, because so little is used to achieve a sweet taste in food. The most successful example of such a protein is aspartame; however, aspartame is synthetic and does not occur in nature. Alternate natural protein sources are being investigated, including a sweet-tasting protein called monellin. a. According to Ogata et al, how much sweeter than ordinary sugar is monellin on both a molar and weight basis? b. Based on the UniProt entry for monellin chain B from serendipity berry (P02882) what residue (amino acid position) when blocked eliminates monellin’s sweetness? Please also provide the amino acid if possible. Task 2 – RNA prediction (10%) Using the Mfold method provided in week 3, for the following homologous sequences, predict the secondary structure for each of the sequences and determine the consensus secondary structure for the four sequences. Sequence A – UUAAGGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCACCUGCGUUCUGGUCAGUACUGGAGU GAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCCCU Sequence B – GUUACGGCGGUCAAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCAAUGAUACUACCCUU CCGGGUGGAAAAGUAGGACACCGCCGAACAU 2 Sequence C – AUCUGCGGCCAUACCGCGCUGAACGUUCCGCGUCUCGUCCGAUCCGCGCAGACAAGCAUCGCAGGGGCCAGAGAGUAUUGACGUGGG UGACCAGUCGAGAACACUGUGCUGCCGCAGGU Sequence D – AUGUGCGACCAUACCAAGCUGAAAAUACUGCAUCCCGUCUGAUCUGCACAGUCAAGCAGCUUAGGGCCCAGUCAGUAGUGCGGUGGG GGACCAUGCGCGAACAUUGUGGUGUUGCACUU Task 3 – Functional Prediction of variants (10%) Use the variants and programs provided in the table below (websites for programs provided in Week 6 Lecture) to complete the table. All these variants are have been associated with Primary Congenital Glaucoma, you will need to perform the Polyphen analysis first as it provides a protein accession number which can be clicked on to link to further information on COL1A1, this information will be needed for some of the other programs. Please briefly comment on the similarity/difference between the results for each mutation. Gene/protein Variant Polyphen SIFT FATHMM (provide score) COL1A1 p.Met264Leu COL1A1 p.Ala1083Thr COL1A1 p.Gly767Ser COL1A1 p.Gly154Val 3 Task 4 – Protein sequence alignment (10%) Search the Pfam database (this can be accessed using the following link - https://pfam.xfam.org/ ) with the following query sequence; Describe the matches found, their significance, and the corresponding alignment positions; Discuss the likely function(s) of this query protein as predicted by this Pfam search. >example.seq MADLEAVLADVSYLMAMEKSKATPAARASKKILLPEPSIRSVMQKYLEDRGEVTFEKIFSQKLGYLLFRDFCLNHLEEARPLVEFYEEIKKYEKLET EEERVARSREIFDSYIMKELLACSHPFSKSATEHVQGHLGKKQVPPDLFQPYIEEICQNLRGDVFQKFIESDKFTRFCQWKNVELNIHLTMNDFSV HRIIGRGGFGEVYGCRKRDTGKMYAMKCLDKKRIKMKQGETLALNERIMLSLVSTGDCPFIVCMSYAFHTPDKLSFILDLMNGGDLHYHLSQHGV FSEADMRFYAAEIILGLEHMHNRFVVYRDLKPANILLDEHGHVRISDLGLACDFSKKKPHASVGTHGYMAPEVLQKGVAYDSSADWFSLGCMLFK LLRGHSPFRQHKTKDKHEIDRMTLTMAVELPDSFSPELHSLLEGLLQRDVNRRLGCLGRGAQEVKESPFFRSLDWQMVFLQRYPPPLIPPRGEV NAADAFDIGSFDEEDTKGIKLLDSDQELYRNFPLTISERWQQEVAETVFDTINAETDRLEARKKAKNKQLGHEEDYALGKDCIMHGYMSKMGNPF LTQWQRRYFYLFPNRLEWRGEGEAPQSLLTMEEIQSVEETQIKERKCLLLKIRGGKQFILQCDSDPELVQWKKELRDAYREAQQLVQRVPKMKN KPRSPVVELSKVPLVQRGSANGL Task 5 – Python coding for data analysis (10%) The file “DNA-Seqs.fasta.tab” is a tab delimited text file that contains DNA sequences in fasta format. Here are a few lines from the file. SeqName seqstr seqsA0001 GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCC seqsA0002 ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGA seqsB0003 GCCTCTCTGGGTTGTGGTGGGGGTACAGGCAGCCTGCCCTGGTGGGCACCCTGGAGCCCCATGTGTAGGGAGAGG Write a Python program to read in the data from this input file; for each sequence in the file, count the numbers of nucleotides A, C, G, T, calculate the fraction (proportion) of each nucleotide and the GC content (see the definition at https://en.wikipedia.org/wiki/GC-content) per sequence, and finally output your summary of the analysis in the following format and save the result to a disk file as a tab delimited text file, a csv (comma separated values) file, or an Excel workbook. Please attach your Python code (the .py file) so that it can be tested. seqName n nA nC nG nT fracA fracC fracG fracT GCcontent seqsA0001 45 8 ? ? ? 0.18 ? ? ? ? seqsA0002 45 10 ? ? ? 0.22 ? ? ? ? https://pfam.xfam.org/