Team:UCSC/Software

Software at UCSC


Our Software Tools can be found in the iGEM GitHub


F.O.C.U.S


Codon optimization is a technique often used by molecular biologists to increase protein expression yields by augmenting the rate translation for a specific gene of interest. Each organism has a codon usage bias where it prefers to use certain codons to encode for a specific amino acid. Current methods of codon optimization work by substituting less frequent codons for a particular amino acid to a more commonly used codon.

While this method of codon optimization results in increased protein yields, recent evidence suggests that it alters protein function. We hypothesize that codon frequencies in ORFs model translational speed, are non-trivial to protein folding, and therefore play a major role in codon optimization. Because of this, we created our Frequency Optimized Codon Usage Strategy (FOCUS).

FOCUS provides the end user with a foundation to visualize the rate of protein synthesis as the nascent polypeptide exits the ribosome, granting potential insight to the underlying mechanics of protein folding. We plan to apply FOCUS accordingly to better understand the nature of misfolded proteins and the diseases associated, such as cancer and Alzheimers. FOCUS has the potential to identify detrimental single nucleotide polymorphisms that may be pivotal to the encoded protein's fold and serves as an extremely useful tool for protein engineering.


Figure 2 Secondary structure diagram of the Zymomonas mobilis pyruvate decarboxylase with rare codons indicated as gold stars. The three domains are labelled with Roman numerals starting from the N-terminus. Helicies are indicated by red and strands in blue.

We wanted our FOCUS program to codon optimize a protein sequence by matching the translational speed a much as possible. We wanted the genetic sequence to induce 'stalls' during translation by using a rare codon from the target organism's codon bias at specific locations. To achieve these stalls at the correct locations, we looked at the nucleotide sequence for the protein of interest from the native organism.

  • This program has been tested to work on Unix based operating systems (Mac OS, Linux)
  • When using IDLE, the script needs to be altered and the input file names need to be explicitly written. Variables that need to be hardcoded are commented in the FOCUS script.
  • The following program files need to be in the same directory for the FOCUS to work:
    • commandLine.py
    • fastaReader.py
    • FOCUS.py
    • GNUplot.py
    • parseTasser.py
  • The input files need to be in the same directory as the FOCUS script.
  • Codon bias table is generated using CodonBiasGenerator. However, you may use another tool to create the table, as long as it is a GCG codon bias table.
  • Secondary structure prediction should be in I-TASSER format. Our team used the tool PSSpred to generate secondary structure predictions.

Class F.O.C.U.S

Attributes

  • rnaCodonTable: Dictionary with RNA codons as keys and the single letter amino acid as the value.
  • DnaCodonTable: Dicionary with DNA codons as keys and the single letter amino acid as the value.
  • GCG: File name for the codon bias table given to the FOCUS object.
  • CodonMap: Dictionary with the amino acid as the key and a list of lists as the value. The entries in the inner list are codon, relative frequency, and percent per thousand.

Functions

  • OpenFile(seqFile): Opens a file and raises an error if it does not exist.

seqFile is the name of the text file

  • CodonToAmino(codon): Converts a DNA codon to the encoding amino acid. The single amino acid letter is then returned
  • MakeDict(): Fills CodonMap dictionary using the codon bias table.
  • SortDict(dictionary) : For each amino acid in the CodonMap dictionary, the list of lists is sorted by the relative codon frequency, with the highest frequency being first.
  • getScore(codon): Returns the relative codon score for a particular codon.
  • printer(protien, values): Returns a string that has been formatted for printing to a file or terminal, where protein is a protein sequence string and values is a FOCUS score string for the same protein.
  • SSprinter(protein, values, SSlist): Returns a string that has been formatted for printing to a file or terminal, much like printer() but with the addition of secondary structure prediction.

Class parseTasser

Attributes

  • TASSER: filename for the secondary structure prediction in I-TASSER format.
  • trans: Dictionary to translate the number for secondary structure to the corresponding character. ie. '1' corresponds to a coil, '2' corresponds to a helix, and '4' corresponds to a sheet.

Functions

  • OpenFile(file): Opens the file and prints an error message if the file is not found.
  • returnList(): Returns a list with the first entry being the secondary structure string and the second entry being the confidence for the secondary structure prediction.

Class GNUplot

Attributes

  • proteinSeq: Protein sequence string
  • freqSeq: FOCUS score string
  • SSpred: List with a string for the Secondary structure prediction as the first entry.
  • Thousand: List with entries correspoding to the percent per thousand. The first character in proteinSeq corresponds to the first entry in the Thousand list.
  • reverseTrans: Dictionary to convert between Secondary Structure character to a number which can be used in GNUplot.

Functions

  • makeTable(): Prints a table that can be used in GNUplot.

sequenceAnalysis


1.1 What is it for?

The sequenceAnalysis program is a tool for providing efficient, large scale analysis of amino acid sequences. It works very much like the software ProtParam designed by ExPASy. However, unlike ProtParam, this program has the advantage of reading and computing the physical and chemical properties of multiple proteins all at once.

1.2 Program Specifications

This program was written using the most recent version of the Python programming language, Python 3.4.3. It is designed to work as module, which can be imported to any program to access all or specific functions that the user requires. An added benefit of the modular design is that it can be easily edited to provide further functionality. At this point, this program is made up of two classes: ProteinParam and FastAReader.

This class was written by UCSC iGEM team members Cristian Camacho, Jairo Navarro, and Raymond Bryan. It was developed in the course BME 160: Research Programming in the Life Sciences, and serves as the backbone for two other programs that are being submitted: CodonBiasGenerator and FOCUS.

2.1 Attributes

  • aa2mw : A dictionary of the molecular weights of all 20 amino acids
  • mwH20 : A float value corresponding to the molecular weight of water
  • aa2abs280 : Dictionary of the absorbance values of Tyrosine, Tryptophan and Cysteine at a wavelength of 280 nm.
  • aa2chargePos : Dictionary of the positive charge values of Lysine, Arginine and Histidine
  • aa2chargeNeg : Dictionary of the negative charge value of Aspartic Acid, Glutamic Acid, Cysteine and Tyrosine.
  • aaNterm : Float value of the charge
  • aaCterm : Float value of
  • validAA : An empty dictionary which will contain the counts of valid amino acids in a specific protein sequence.

2.2 Methods

  • aaCount( ) : Iterates through the amino acid sequence and returns a single integer count of valid amino acid characters found.
  • aaComposition ( ) : Returns the validAA dictionary with the valid amino acids and their counts for a specific protein.
  • pI ( ) : Estimates the theoretical isoelectric point of a protein by iterating through every pH value until it finds the one that results in a net charge that is closest to zero.
  • charge ( ) : Calculate the net charge at a particular pH, using the pKa of each charged Amino acid and the Nterminus and Cterminus
  • molarExtinction ( ) : Estimates the molar extinction coefficient based on the number and extinction coefficients of tyrosines, tryptophans, and cysteines.
  • massExtinction ( ) : Calculates the mass extinction by dividing the molar extinction value by the molecular weight of the corresponding protein.
  • molecularWeight ( ) : Calculates a proteins molecular weight by summing the weights of the individual Amino acids and excluding the waters that are released with peptide bond formation.

Class FastAreader

This program was developed by Professor David Bernick of UC Santa Cruz, for the upper- division course BME 160: Research Programming in the Life Sciences. This class is what allows the sequenceAnalysis module to read and calculate the characteristics of multiple protein sequences at the same time, as long as they are in the FASTA format.

3.1 Attributes

  • fname : The initial file name to be ready by FastAreader.

3.2 Methods

  • doOpen ( ) : Checks if a file name is given to FastAreader, and if not, waits for a file to be given through system.in. This function provides command line usability.
  • readFasta ( ) : Using filename given in init, returns each included FastA record as 2 strings - header and sequence. If a filename is not provided, std.in is used. Whitespace is removed, no adjustment is made to sequence contents. The initial '>' is removed from the header.








The use of the sequenceAnalysis module is fairly simple. Here are the following steps for making use of it in your programs:

  1. Download the file named sequenceAnalysis from either the UCSC iGEM 2015 wiki, or the 2015 iGEM GitHub.
  2. Important, save the file in the same directory as the script that you are writing.
  3. Make sure to include the following line “import sequenceAnalysis” before writing any code for your new program.
  4. In order to use a function from either the ProteinParam class or FastAreader class, you must create an object for that class.
  5. Then you can use that object to access any of the available function from that class.

Below is an example program which illustrates the above steps and explains how to access one or more of the functions from ProteinParam.

4.1 Ex) pIFinder.py

Figure 1: This program is known as pIFinder, which specifically utilizes the pI ( ) method of the class ProteinParam and the FastAreader class to calculate and print the isoelectric point of given protein sequences with their respective headers.

Notice that in the red box there is the line “import sequenceAnalysis”, which signifies that the capabilities of sequenceAnalysis are now available to your program. Also, notice that the lines that which the two arrows are pointing to are responsible for creating a FastAreader object and a ProteinParam object.

 

In order to test whether your program is working correctly, we have provided a test file that includes eleven FASTA formatted protein sequences. This test file can be found on the 2015 UCSC iGEM wiki under the name “sequenceAnalysisTest.txt”. Make sure that this file is also saved in the same directory as sequenceAnalysis and the program that you are writing. Furthermore, also make sure that whether you are hardcoding the file name into your code, or submitting it via the command line, that the file name matches exactly the way it is written.


Figure 2: The results for the test file.

Codon Bias Generator


1.1 What is it for?

The CodonBiasGenerator program is a tool for providing an easy, parseable Codon Bias Table that can be read, analyzed, and manipulated according to the user needs. It works very much like the Codon Usage Frequency Table's found on GenScrpit and Kazusa. However, unlike the competitors the CodonBiasGenerator gives the user the option of any gene sequenced input while returning a better output in the form of a dictionary file. This file is then able to be used as a stardard text file capable of string and data manipulation.

1.2 Program Specifications

This program was written using the most recent version of the Python programming language, Python 3.4.3. It is designed to work with the sequenceAnalysis.py as an imported module using the FastAReader class as a sequence reader. The class of the program is CodonFreq, with three separate defining methods for easy access and variability, each corresponding to its own calculated/documented variable.

This class was written by UCSC iGEM team members Raymond Bryan, Jairo Navarro, and Cristian Camacho. It was developed in the course BME 188A/B: Synthetic Biology Lab, and provides functionality congruently with sequenceAnalysis.py.

2.1 Attributes

  • rnaCodonTable: a common dictionary used as a codon to AminoAcid(AA) key table
  • dnaCodonTable: a revised rnaCodonTable dictionary using the translated RNA to DNA codons
  • CodonCount: a listed dictionary organized as AA→ codon→ Value, where the value is the number of time the codon is accounted for within the Genome inputed.
  • CodonFrequen: a listed dictionary organized as AA→ codon→ Value, where the value is the float usage Frequency of the codon in question calculated as codonCount/ total # AA is called
  • CodonPerThou: a listed dictionary organized as AA→ codon→ Value, where the value is the codon Frequency per Thousand calculated as (codonCount/ Total # of codons)*1000

2.2 Methods

  • readSeq(): This allows the each sequence read to be analysed of all codons and append their count to a master list dictionary by matching the codon at hand to the possible codons listed in the dictionary and finally consolidating the total number of times called into a single value as a dictionary output.
  • TableMaker(): This allows the counted value gather by readSeq() to be accessed and used to determine the total amount of time the AA is called in the sequence by various possible codons into a separate dictionary . This new dictionary is then used to to calculate the Codon Usage Frequency, also stored as a dictionary output.
  • PercentPerThou(): This allows the counted value gather by readSeq() to be accessed and used to determine the total amount codons called in the genome. This new value in conjuction with the individual codon counts, is then used to to calculate the Codon Percent per Thousand, also stored as a dictionary output.

The use of the CodonBiasGenerator can be done in either IDLE or CMD. Here are the following steps for making use of it in your system:

  1. Download both files named sequenceAnalysis.py and CodonBiasGenerator.py from either the UCSC iGEM 2015 wiki, or the 2015 iGEM GitHub.
  2. Important, save both files in the same directory you currently work with in python3.4
  3. Open your favorite commandline and change to your working python directory
  4. Type/ Run CodonBiasGenerator.py and enter your gene Fasta file
  5. Witness python sorcery magic.

Below is an example program which illustrates the above steps and explains how to move further.