Difference between revisions of "Team:Vanderbilt/Software/Development"
Line 272: | Line 272: | ||
<div class="figure"> | <div class="figure"> | ||
− | <p><img src="https://static.igem.org/mediawiki/2015/c/c5/Vanderbilt_gui.png" alt="Example of Sequence GUI" height="216" width="384"/> | + | <p><a href="https://2015.igem.org/File:Vanderbilt_gui.png"><img src="https://static.igem.org/mediawiki/2015/c/c5/Vanderbilt_gui.png" alt="Example of Sequence GUI" height="216" width="384"/></a> |
</p> | </p> | ||
</div> | </div> |
Revision as of 01:45, 19 September 2015
https://github.com/cosmicexplorer/mutation-optimizer
Abstract
The wetware team devised multiple methods to identify and remove what they called "mutation hotspots," or specific sequences of DNA known to be susceptible to certain types of randomized damage. Through a set of experiments, the team demonstrated that these methods of removing hotspots reduced the frequency of mutation in the parts produced. We denoted the process of removing these mutation hotspots to produce a more stable part "mutation optimization." Jarrod Shilts wrote a Python script to automatically perform this operation, which was used in the wetware team's experimentation. The software portion of this project entailed porting the optimization script from Python to JavaScript, and producing a library, console application, and publicly accessible web application.
Web Application
Like the wetware team, the software portion of this project is organized into both sequence and circuit optimization methods. The sequence optimization section presents a simple interface with smart defaults. An example is shown below.
The user pastes or drag-and-drops the original sequence into the box on the left, and the optimized sequence appears on the right. The optimized sequence is heavily annotated as shown with all of the changes the optimizer made while it worked. A variety of parameters are available to the user, detailed in the "Documentation" link at the top.
The application takes a "plug-and-play" approach, where knowledge of the underlying principles isn't required to effectively operate it. Because of the genericity of the mutation hotspots identified by the software, it operates well on a large variety of inputs without any user interaction. The capabilities of the software are more thoroughly discussed in the wetware team's documentation.
The circuit optimization kit takes a similar approach, applied to higher-level parts instead of individual sequences. The user inputs a list of part numbers, and the software will suggest replacements for parts which conflict to cause a higher probability of mutation. More documentation about the capabilities of the circuit optimization interface is available on the wetware team's page. An example of the circuit optimization interface is shown below.
Registry Mining
While the wetware team demonstrated that optimizing a sequence reduced its frequency of mutation, the question was left outstanding whether many parts actually could be optimized. To determine this, we mined the Parts Registry and ran our optimization algorithm on a large number of parts. Our results are available on the Stats page. As discussed above, this required creating a wrapper over the Parts Registry, which we released as a separate project to be used by other teams.
Supplementary Projects
We chose to translate the code to JavaScript so that we could create an interactive web application for scientists, while keeping an extensible command-line application and software library open for other developers, using the same codebase. JavaScript is also much faster than Python in many sequence operations, which reduces the significant time required to optimize a part. To support this effort, we created and released a generic library of optimization tools for JavaScript applications (see DISCOTANGO).
In addition to creating user-facing applications, we also collected statistics on the efficacy of the mutation optimization scheme on publicly available biobricks. This required mining the registry for part information, which was a challenge in itself. We wrote a wrapper to query the Parts Registry for part information and used it to collect bulk statistics on the Parts Registry's content (see biobrick-api-wrapper).
DISCOTANGO
https://github.com/cosmicexplorer/DISCOTANGO
While translating our code to JavaScript had many advantages discussed above, it also came with challenges, since JavaScript's canon of open scientific libraries are less well-developed than Python's. While developing the mutation optimization software, we realized we could turn the codon assignment problem into a generic discrete search problem with a discontinuous cost function. While we initially employed a naïve bounded local search scheme to optimize sequences, we delved into multiple types of generic search schemes for optimal codon assignments. We decided it would be appropriate to separate this project from our other projects and release it as its own generic search library, which we labelled DISCOTANGO (the DISCOntinuous Toolkit for Aquisition of Neighboring Global Optimizations). We are still working on finding the most optimal search scheme for our codon assignment problem. Instructions and examples are found on this project's github page.
biobrick-api-wrapper
https://github.com/cosmicexplorer/biobrick-api-wrapper
The Parts Registry's public REST API is only partially functional, and the data it does offer often needs to be cleaned up before use. While mining the Parts Registry (see Registry Mining), we developed a variety of methods to get and clean data from the Parts Registry. We realized this could be quite useful for a variety of tasks outside of just mutation optimization, and decided to release it as a separate library. Instructions for use are found on this project's github page.
Ongoing Research
Wetware Research
Now that we've demonstrated the soundness of the sequence optimization procedure, there naturally arises the question of whether we can improve it. Work is being done on the wetware side to further understand the relationship between mutation hotspot density and mutation rate. In addition, we are also investigating the viability of allowing conservative amino substitutions in our optimized sequences.
Hotspot Modelling
We are currently modelling the rate of mutation as a linear transformation of the counts of each type of mutation hotspot, assigning a "weight" to each count and summing the weighted counts to produce an overall mutation expectancy score, which is what our algorithm optimizes for. The weights assigned to each count are somewhat arbitrary, reflecting our intuition of their relative importance. While experimentation shows this was an effective approach, there is a strong chance that a more empirical method could be developed to better model the effect of each mutation hotspot count on actual rates of mutation. Linear functions may not be the best approach to model hotspot effects, for example, and we have not performed any analysis of whether the presence of multiple types of hotspots in the same region of a sequence affects mutation frequency differently than single types of hotspots. Unfortunately, this is the most time-intensive method of improvement, and we expect a a large number of experiments will be required to form a more nuanced theory of how mutation hotspots affect real mutation frequency.
Conservative Substitutions
As discussed in the project background, the mutation optimization scheme relies solely on homologous transformations to perform mutation optimization, to ensure sequence stability. The first method we thought of to improve the results was to allow conservative amino acid substitutions, and we are currently working to quantify the expected instability introduced by non-homologous transformations, to determine whether these substitutions would actually reduce the overall stability of the sequence. Like our ongoing research into appropriate weighting of hotspot types, this will also require intensive laboratory work, and possibly advanced chemical modelling.
Software Optimization Schemes
As mentioned in our discussion of DISCOTANGO, the mutation optimization problem can be easily modelled as a search problem in a discrete state space, where the objective function is the weighted sum of mutation hotspot counts as discussed in Wetware Research, and the variable assignments are possible codons (homologous transformations) translating to a given amino acid. The current algorithm is a greedy bounded local search scheme, which works effectively, we believe, due to the relatively short length of sequences that denote a mutation hotspot. However, we are currently studying more effective optimization methods in two distinct ways: traditional search algorithms, and purpose-built substring recognition algorithms.
Search Algorithms
Because we already have a good initial assignment of variables (the canonical sequence) to start from, the model of the mutation optimization problem discussed above lends itself to attack by a variety of well-known local search algorithms. DISCOTANGO provides implementations of simulated annealing, random-restart hill climbing, k-beam search, and genetic programming. We are also investigating whether we can model the problem as a Stackelberg game as described here. The state space (about 3-5 possible codon assignments per amino, so about 41000 for a 1Kb sequence) is large, but since the cost function is highly discontinuous and relies upon stringing together sequences of amino acids, we believe optima can be found with a relatively small number of substitutions. We are currently developing and tuning these algorithms to compare their relative efficacy.
Substring Recognition
While we began approaching mutation optimization as a satisfiability or optimization problem and using stochastic tools to solve it, we have recently realized that since all hotspot counts are essentially just instances of substring recognition, a possibility exists for a polynomial algorithm to generate the globally optimal codon assignment for a given sequence of amino acids. Even the homologous repeat count, the most complex count to calculate, relies upon substring matching at its heart. Discovery of such an algorithm would allow us to reliably optimize sequences of almost arbitrary length in a reasonable amount of time. We will develop or prove the nonexistence of such an algorithm alongside our other optimization research.