Team:FAU Erlangen/Software
The struggle to be unique
In order to execute our project, we first needed certain sequences that are not common or do not appear at all in the yeast strains we use. We required 10 consecutive base pairs that were not present in the genome. The first brute force attempt turned out to take approximately 40 days, counting all 4194304 possible combinations throughout the chromosomes. As that was not an option, we needed a faster program that could tell us which sequences are least common. There were 2 different approaches.
- Brute force optimization
- Cutting down to solve the specific problem at hand
Brute force optimization
As the first attempt at brute force was not feasible, we tried to make better use of the computing power at hand. The first try used only one CPU core, which was a really slow way of doing things, since the problem was mainly CPU-bound. The code was designed to be executed in as many processes as there were cores in the current system, meaning an effective division of the time needed by that number. The system we used has 64 cores and the change greatly increased the completion speed. It is now possible to search through the complete yeast genome to count the appearances of every single 10-character-long sequence within a few days (there were some other people using the same computing cluster, so the calculation was slowed down at times). That is still a long time, but it is an improvement. But why stop there? We did not want to wait for all that time only to get a few good sequences we can use. The next step was to search for shorter sequences, starting from 7 base pairs, and checking if there are any unique sequences. It turned out there were a few good sequences when searching for 8 base pairs, which could be used to match them with common matching programs to check if they meet our requirements. A further approach at cutting down computing time was to disregard any sequences that were more often found than a certain threshold, allowing the next sequence to be checked.
Checking existing sequence set
As our project was designed to target specific places within a part that we would put there ourselves, we decided to check whether the promoter region of the HIS3 and CAN promoters already possessed any useful targeting sites. For that purpose, we wrote a second script, which read the promoter files and made 10-character long sequences out of them. Only those were matched, reducing both computing time and cloning steps in the lab. These scripts take - depending on the promoter length, of course - approximately 10 minutes to 1 hour to count everything.
"This sounds great, i want that!"
We have uploaded a final form of these files here. The usage is simple, though it still is a bit rudimentary in its handling. You will need to have a working python 2.7 distribution, as well as the folder organization included in the download. The brute force script has a variable in the 9th line of the file, which is the length of the sequence that will be checked. It will take any file that has been put into the "input/chromosomes/" folder and interpret it as a set of characters, matching the generated sequences along the file content. The promotercheck file will take any input in the "input/promoter/" folder and interpret it as a promoter. The files will then be matched to the files in the "input/chromosomes/" folder as before. If called from a shell, the program will give an approximate time to finish, though it may vary on the settings of the limit. The upload includes the files we used for our matching process, so anyone who is interested can immediately try this out.