Team:Vanderbilt/Modeling/Sequences

Vanderbilt iGEM 2015

Sequences

To look at the expected change in a DNA sequence, the statistical method of Markov Chains was utilized. Markov Chains, put simply, is a stochastic process for looking at how a system will change over time through the use of probabilistic processes. The model of DNA mutation used here focused solely on single point substitutions, as this is the easiest to model, and much more likely in UV radiation.

Markov Chains work by first creating a sample space, this sample space need not be finite however in our case it is. The sample space consists of all possible states that the system can be in. An example of a Markovian sample space is the set of all possible dates in September. It can only be at most (and at least) one date in September at a time. Next, Markov Chains employ what are called transition probabilities. The transition probability ij is the probability that the system will transition from state i to state j. All possible transition probabilities are calculated and all are non-negative. To use the example of the days in September again, the probability of moving from Sept 23rd to Sept 24th is 100%, but the probability to moving from Sept 13th to Sept 28th is 0% (as are all non-consecutive date transition probabilities). Regularly, these transition probabilities are stored in a matrix called the transition matrix, where the ijth entry corresponds to the ij transition probability. With this transition probability matrix, we can fully take advantage of the power of Markov Chains.

Two separate, well-known models for single-point DNA substitutions were used, the K80 method, and the HKY85/TN93 models. The each method is characterized by the assumptions that it takes from its inception. The K80 model takes into account that transitions and transversions happen at different rates, however it does not take into account all possible transitions and transversions. Therefore, the K80 model has three parameters; time, transition rate, and a transversion rate. This model is more accurate because it both takes into account more factors and because it is truer to real life phenomenon. The major assumption that the K80 model makes is that all transition rates are equal, and all transversion rates are equal.

The next model employed for single-point DNA substitutions is the HKY85/TN93 model. This model combines the simplest model with the K80 model to create an even more realistic model of DNA mutation. The HKY85 model continues the K80 assumption that transitions and transversions occur at different rates, but the different permutations occur at the same rate. However, the point that makes this model more accurate is the assumption that the base pairs do not occur at equal frequencies, and subsequently that the substitutions on a particular base pair do not occur at the same rate. For instance, there is a much less likely chance of a Thymine to be substituted for a Thymine rather than an Adenine. This makes the model more robust, and also more accurate in the sense that it more successfully maps to empirical evidence.

The DNA mutations occur as a stochastic process, meaning more or less random. To simulate a change, the transition probability matrix is applied to the sequence of interest, and then looking at the resultant frequency. For example, if there is a 37% chance of an Adenine to be substituted with a Thymine, 37% of the time that the transition probability matrix is applied to an Adenine it will be substituted with a Thymine. The matrix is not exactly applied to the sequence as a whole, but rather applied to individual bases of the sequence in a stochastic fashion, doubly increasing the amount of randomness in the model. At every time point, a collection of random base locations is chosen so that they can the Markov Chain can then be applied to those bases. Thereby creating a continually mutating DNA sequence. The parameters for transition probability matrix can be tuned to different mutation rates depending on the source of the mutation (UV radiation, free radical oxidation, etc.). The mutated sequence is then compared with the optimized sequence to look at how much the sequence.

The Markov Chain method is also used to look at the functional changes in a DNA sequence. The Markov Chain is applied to the sequence in the same fashion as it was when looking at homology, however then the sequence is parsed into the amino acid formulation. This formulation of the sequence is then compared to the original amino acid formulation of the sequence. A matrix of amino acid similarities was used to look at conservative and non-conservative changes in the sequence. The resulting sequence is measured against the original amino acid sequence in a similar method of measuring homology. This number measures the amount of functional difference between a mutated strand and the non-mutated strand.


The figures for Markov Chains show the amount of substitutions as a function of time. Figure CODON shows the percent mutation as a function of time of the codons of a sequence, whereas Figure Sequence shows the individual percent mutations as a function of time. Figure CODON takes into account conservative changes in the changes of codons, therefore the graph only shows non-conservative changes. Both Figures show that mutations increase over time, as one would expect.