CSE 5370: Bioinformatics Homework 2: Due Thursday, February 24th, 2022 at 4:59PM CST
CSE 5370: Bioinformatics Homework 2: Due Thursday, February 24th, 2022 at 4:59PM CST
Homework 2
In this homework you will write basic genome assemblers using two methods:
greedy SCS and De Bruijn Graphs.
Group Work
You are allowed to work together in groups of up to 4, but the assignment is
structured so that every person must generate their own synthetic read pools
with a random number generator (such that every submission will be unique).
You may collaborate on discussing logic in the code. The automated grading
script automatically renames all functions: function1, function2, function3 etc.
and all variables: var1, var2, var3 etc. then standardizes all white space. Code
from different groups should not be identical after this process and every person
should have a different random set of reads and assemblies per the assignment
specification.
1
1 Simulated Reads (25 points)
1. Write python code to generate a synthetic genome (a string) consisting of
random nucleotide sequence of length 10,000.
2. Randomly generate a pool of synthetic ”reads” (Pool A) of length 250
such that no reads overlap on the genome from 1.
3. Randomly generate a pool of synthetic ”reads” (Pool B) such that every
nucleotide in the genome from 1 is in a minimum of 4 reads and each read
is guaranteed to overlap with another read by at least 7 nucleotides.
4. Randomly generate a pool of synthetic ”reads” (Pool C) such that every
nucleotide in the genome from 1 is in a minimum of 30 reads and each
read is guaranteed to overlap with another read by at least 14 nucleotides.
Use a random number generate to produce these read pools and submit them
as text files with the assignment. Also include in your submission a text files
containing your synthetic genome. Include the code to generate these in your
single code file.
13 def shortestCommonSuperstring(string_set):
14 shortest_sup = None
15 for perm in itertools.permutations(string_set):
16 sup = perm[0]
17 for i in range(len(string_set)-1):
18 olen = overlap(perm[i], perm[i+1], min_length=1)
19 sup += perm[i+1][olen:]
2
20 if shortest_sup is None or len(sup) < len(shortest_sup):
21 shortest_sup = sup
22 return shortest_sup
k = 1, 9, 18
for each run. Include a text file with the assembly output in your submission as
well as your code.