Team:FAU Erlangen/Tour25

The struggle to be unique

In order to execute our project, we first needed unique sequences, which are not common or do not appear at all in the yeast strains we use. We needed 10 basepairs that are not present. The first brute force attempt turned out to take approximately 40 days, counting all 4194304 possible combinations throughout the chromosomes. As that was not an option, we needed a faster program that could tell us which sequences are the least common. There were 2 different approaches.

  • Brute force optimization
  • Cutting down to solve the specific problem at hand

Brute force optimization

As the first attempt at brute force was not feasible, we tried to make better use of the computing power at hand. The first try used only one CPU core, which was a really slow way of doing things, since the problem was mainly CPU-bound. The code was designed to be executed in as many processes as there are cores in the current system, meaning an effective division of the time needed by that number. The system we used has 64 cores and the change greatly increased the completion speed.it is now possible to search through the complete genome of yeast to count the appearances of every single 10-character-long sequence within a few days days (there were some other people using the same computing cluster, so the calculation was slowed down at times). That is still a long time, but it is an improvement. But why stop there? We did not want to wait for all that time only to get a few good sequences we can use. The next step was to search for shorter sequences, starting from 7 base pairs, and checking if there are any unique sequences. It turned out there were a few good sequences when searching for 8 base pairs, that could be used to match them with common matching programs to check if they meet our requirements. A further approach at cutting down computing time was to disregard any sequences that were more often found than a certain threshold, allowing the next sequence to be checked.

Checking existing sequence set

As our project was designed to target specific places within a part that would be put there by ourselves, we decided to check if the promoterregion of the HIS3 and CAN promoters already possessed some good targeting sites. For that purpose, we wrote a second script, which read the promoterfiles and made 10-character long sequences out of them. Only those would be matched, reducing both computing time and cloning steps in the lab. These scripts take - depending on the promoterlength of course - approximately 10 minutes to 1 hour to count everything.

"This sounds great, i want that!"

We have uploaded a final form of these files here. The usage is simple, though it still is a bit rudimentary in its handling. You will need to have a working python 2.7 distribution, as well as the folder organization included in the download. Both scripts have a variable in the 9th line of the file, which is the length of the sequence that will be checked. It will take any file that has been put into the "input/chromosomes/" folder and interpret it as a set of characters, matching the generated sequences along the file content. The promotercheck file will take any input in the "input/promoter/" folder and interpret it as a promoter. The files will then be matched to the files in the "input/chromosomes/" folder as before. If called from a shell, the program will give an approximate time to finish, though it may vary on the settings of the limit. The upload includes the files we used for our matching process, so anyone who is interested can immediately try this out. BEWARE: After starting the execution, you can only wait for the program to finish, or force it to stop by pressing ctrl+z in the command console. Manually killing every process is possible as well, but depending on your system, the number of processes may be very large!

After the script is done, you will find a file called Complete_result for the script that you executed in the output folder. During execution, several temporary files will be generated in the output/temp/ folder, which will be cleaned after finishing and at the start of a script. This avoids any mistakes after an unexpected program shutdown.
The promotercheck file will have the following layout, consisting of an information of the current Promoter, the length of sequences that are checked and the actual sequences as well as their complements followed by the number of occurences in the chromosome set:

Promotercheck File
Promoter: HIS3_promoter.txt
Chosen sequence length: 12


Sequences occurences

CACATAATGAAT 2
GTGTATTACTTA 3

ACATAATGAATT 6
TGTATTACTTAA 7