Difference between revisions of "Template:Heidelberg/software/maws"

(Undo revision 313375 by MAJendrusch (talk))
Line 23: Line 23:
 
                                     <div class="col-lg-12">
 
                                     <div class="col-lg-12">
 
                                         <p class="basicheader">Abstract</p>
 
                                         <p class="basicheader">Abstract</p>
<p class="basictext">Aptazymes, i.e. fusions of a catalytic ribozyme or DNAzyme with one or several aptamers, make construction of nucleic acids-based circuits responding to external stimuli possible, as their catalytic activity can be activated in either the presence or absence of the cognate ligand. The design of aptazymes has been hampered by the need for communication modules, which translate the binding of a ligand to the aptamer into a change of function in the catalytic nucleic acid. Standard procedure herefore is selection on the one hand and rational design on the other, resulting in tedious work which is not certain to succeed. To this day, only few attempts have been made to automatize the design process, which all target a specifically designed chassis ribozyme. Here we describe a software "JAWS" (Joining Aptamers Without Selection) using a criterion for computational selection of communication modules based upon the partition function, as well as energetic and entropic data. This is then used in a random search-like algorithm to extract optimal communication modules without the need for selection.</p>
+
<p class="basictext">
 +
Aptamers, originally conceived as a better alternative to antibodies, are now a popular object of research. This research however, is limited by the use of a selection process (SELEX) to generate aptamers. This process needs careful planning and is time consuming with a usual timeframe of multiple weeks. Furthermore, success is heavily dependent on the quality of the nucleic acid library employed in the process. Herein, we construct an entropy-like metric on the space of nowhere zero N-dimensional probability distribution. By using this in an iterative process, we select nucleotides with a probability distribution of maximum distance to the uniform distribution at each step. This yields optimal ligand binding aptamers, without the need of selection.  
 +
</p>
 
<p class="basicheader">Introduction</p>
 
<p class="basicheader">Introduction</p>
<p class="basictext">Designing communication modules for controlling functional nucleic acids usually requires either the use of known modules (insert cit here), or a selection process, whereby nucleic acids containing randomized modules are folded in the presence of the controlling ligand and assayed for their activity (insert cit here). This constitutes a tedious and lengthy process, with uncertain results, whereafter nucleic acids have to be folded in the presence of their ligand to display activity (insert Penchovsky here). This renders impractical all kinds of assays requiring more complicated switching behaviour, as increasingly complex arrays of communication modules would require increasingly complex selection schemes. Those selection schemes can prove restrictive to smaller labs, as well as iGEM teams, whose time and equipment-constraints do not allow for a selection process of multiple weeks. To alleviate the need for selection, a variety of methods have been developed for the case of the hammerhead ribozyme. These range from long, predefined communication modules (insert cit here) to computational methods involving an energetic criterion, as well as random search algorithms to construct dynamically switching modules (insert penchovsky, breaker ...). Although those approaches address the specific problem of the hammerhead ribozyme very well, they lack generality, as the hammerhead ribozymes used there are heaviliy modified to allow for an easy design process (insert cit here). Here we developed a completely general approach to the computational design of communication modules, relying on a variable set of constraints. We validated several computationally designed aptazymes with different combinations of aptamer and functional nucleic acid <i>in vitro</i> and found that all of them showed the desired switching behavior.</p>
+
<p class="basictext">
 +
In the field of nucleic acid research, functional nucleic acids have been, and continue to be one of the topics most worked on. Even outside of that field of research, aptamers, ribozymes and DNAzymes are being used to construct very sensitive and specific sensors <x-ref>Rivas2015</x-ref><x-ref>Li2013</x-ref><x-ref>Zhao2015</x-ref> for many kinds of molecules, using many kinds of techniques. Yet in all of those approaches, the aptamer constitutes the core of the sensor. Thus, to engineer new sensors for differend kinds of molecules with current experimental methods, one would have to select a nucleic acid library agains those molecules<x-ref>Stoltenburg2007</x-ref>. That process, known as SELEX (systematic evolution of ligands by exponential enrichment) involves many steps of selection and a well-planned assay, needing time and thought. With time frames of multiple weeks, SELEX is in no way suitable for a fast creation of tests against novel diseases or drugs. Due to this need for a major speedup in the SELEX process, multiple computational approaches were designed. Algorithms have been devised to tackle the problem of nucleic acid library design<x-ref>Luo2010</x-ref> as well as to simulate the process of SELEX<x-ref>Hu2015</x-ref><x-ref>Chushak2009</x-ref>. While the former do not necessarily improve on a good random SELEX pool, the latter needs a significant amount of processing power, requiring four months to set up the selection library on 300 cores<x-ref>Chushak2009</x-ref>. This obviously makes SELEX the method of choice. In 2011, another approach to the selection of aptamers was demonstrated against thrombin and phosphatidylserine as targets<x-ref>Tseng2011</x-ref>. This method was based on minimizing for each nucleotide in the aptamer its entropy of binding, where entropy was calculated by sampling a grid space of conformations of the aptamer and computing probabilities at this step. Here we expand upon the concept of using entropy to select aptamers by using the maximum entropy expression for probability as an entropy-like distance function to achieve significant improvement in performance. The generated aptamers were experimentally validated and found to be binding their targets.
 +
</p>
 
<p class="basicheader">Algorithm</p>
 
<p class="basicheader">Algorithm</p>
 
<p class="basictext">
 
<p class="basictext">
To generalize the design process of aptazymes we consider an aptamer inserted into a stem of the ribozyme of interest, adding a region of randomized nucleotides functioning as communication module. We constrain the active state of such an aptazyme to be such, that the aptamer is not bound by the rest of the structure, and all randomized nucleotides form a symmetric stem consistent with the consensus secondary structure of the native ribozyme. We constrain an inactive state to be such, that a part of the aptamer with a length of $N$
+
We first replace the relative entropy $S[P|Q]=-\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$ by an entropy-like distance using the maximum entropy expression for probability, which is the canonical ensemble $P(\mathcal{X})=\frac{e^{-\beta\cdot{\Delta{G}}}}{Z}$, with $\Delta{G}$ the gibbs free energy of a single conformation of the aptamer ligand complex and $Z=\int{e^{-\beta{\Delta{G}(\mathcal{X})}}d\mathcal{X}}$. That distance we construct in the following:
nucleotides forms a stem with the randomized nucleotides, and a portion of randomized nucleotides with length $N$. Hereafter, we shall call $N$ the shift of the communication module. By defining our communication module by a consensus structure, a length and a shift, we specify a constraint, checking for the existence of the active state consensus structure as well as the inactive structure, with a tolerance of $M$
+
wrong base pairs. We do so by inspection of the matrix of all base pair probabilities $P_{ij}$, yielded from the partition function $Z$ calculated for our sequence. We then discard all sequences not satisfying this constraint. Of the remaining sequences, only those that satisfy the constraint of the ensemble free energy difference between active and inactive state being less then the aptamer's interaction energy $\Delta{G_{Apt}}$ are retained. Using this approach, we arrive at functional communication modules for nucleic acids completely unrelated to the hammerhead ribozyme. In detail, the algorithm proceeds as follows:
+
 
</p>
 
</p>
 
<p class="basictext">
 
<p class="basictext">
Secondary structure prediction is performed using a partition function based approach using ViennaRNA (citation), which then allows for the extraction of the nucleic acid's partition function $Z=\Sigma_{I}e^{-\beta{G_{I}}}$, with $I$ an index enumerating all potential base pairs, loops and stacks, and $G_{I}$ the corresponding Gibbs free energy. This in turn allows for the construction of the probability matrix $P_{I}=\frac{e^{-\beta{G_{I}}}}{Z}$ consisting of the base pair probabilities $P_{ij}=\frac{e^{-\beta{G_{ij}}}}{Z}: i,j\in{{Bases}}$. The matrix $P_{ij}$ will be the basis of our calculations. To construct a communication module, we use the following steps:
+
Given $S[P|Q]=-\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$, we have $\bar{S}[P|Q]=-\int{P\cdot{ln{\frac{P}{Q}}}d\mathcal{X}}$. From this follows trivially that $S[P|Q]>S[P'|Q]\Rightarrow\bar{S[P|Q]}>\bar{S[P'|Q]}$. Furthermore, we can assume for further considerations that $Q$ is the uniform distribution, and define $S[P]:=S[P|Q]$, where $Q$ is uniform. We then have: $\bar{S[P]}:=-\int{-P\cdot{\beta\Delta{G(\mathcal{X})}}+ln(\frac{Q}{Z})}$
 +
Here, as everywhere hereafter, $\mathcal{X}$ designates a set of variables comprising all the aptamers coordinates. Now, every nucleotide should be scored by $S[P]$ and the distance in quality between two aptamers is given by the metric $g[P,P']=|\bar{S[P']}-\bar{S[P]}|$. Taking this into account, entropy-like fitness functions $\bar{S[P]}$ provide a partial order on the space of all nucleotide sequences $\mathcal{N}$, and $g[P,P']$ provides a metric on $\mathcal{N}$, or more specifically a metric on space of equivalence classes of all nowhere zero probability distributions, with equivalence being defined as: $A=_{S}B iff S[A]=S[B]: A,B\in{\mathcal{N}}$. We use $g[P,P']$ to choose nucleotide sequences with a certain distance to the minimum $min(S[P])$ of entropies, which is rather intuitively the most specific nucleotide to the target. To use these principles to calculate aptamers to a certain target, we use the following algorithm:
 
</p>
 
</p>
<p>
+
<ol class="basictext">
<ol>
+
 
<li>
 
<li>
Given the position of a region $\mathcal{R}\subset{Sequence}$ comprising $2\cdot{N}+A$ bases, which should surround the aptamer $\mathcal{A}$ with length $A$, construct an active conformation. Henceforth, we shall refer to $N$ as the <i>stem length</i> and to the active conformation as <i>conformation $\mathcal{I}$</i>. This conformation is such, that the aptamer does not participate in base-pairing interactions with the rest of the functional nucleic acid. Furthermore, all $2\cdot{N}$ nucleotides of the region $\mathcal{R}\diagdown{\mathcal{A}}$ should participate in base pairing interactions, forming a stem of length $N$.
+
A bounding box is generated around the target. The conformational and positional space inside the bounding box is sampled from a uniform distribution for the four possible nucleotides $C = {dAMP, dGMP, dCMP, dTMP}$. At each sample, positions and energies of the system are saved.
 
</li>
 
</li>
 
<li>
 
<li>
Given $R$, $N$ and a <i>shift</i> $S$, construct an inactive conformation, consisting of the first $N$ bases in $R$ pairing with the bases $N+A-S$ to $2\cdot{N}+A-S$. This conformation has a stem displaced by $S$ nucleotides from the active conformation, disturbing ribozyme activity. The inactive conformation shall be refered to as <i>conformation $\mathcal{II}$</i>.
+
Optionally, the bounding box is split along one principal direction. The variance $Var(\Delta{G}_{i})$ is calculated for each half $i$ of the bounding box.
 +
The space inside the bounding box is sampled again, with probability $P_{i}=\frac{Var(\Delta{G}_{i})}{\Sigma_{j}Var(\Delta{G}_{j})}$ for a sample to come from half $i$ of the bounding box. This can be iterated until sufficient variance reduction is reached. At the last step, save positions and energies at each sample.
 
</li>
 
</li>
 
<li>
 
<li>
Generate a sequence $\mathcal{S}$, with all nucleotides in $\mathcal{R}\diagdown\mathcal{A}$ randomised.
+
<p>
 +
For each nucleotide $n \in C$, where $n$ ist the nucleotide at the 3' end of the aptamer, the partition function $Z$ is calculated as $Z=V\cdot\Sigma_{I}p_{I}\cdot{\Sigma_{J}\frac{e^{-\beta{\Delta{G_{J}}}}}{N_{J}}} : J\in{Samples},I\in{Strata}$, where $V$ is the volume of the sample space
 +
</p>
 +
<p>
 +
For each nucleotide $n \in C$, the probability distribution is computed as $P_{J}=\frac{e^{-\beta{\Delta{G_{J}}}}}{Z}$
 +
</p>
 +
<p>
 +
For each nucleotide $n \in C$, the entropy-like fitness function is computed as $S[P]=-V\cdot\Sigma_{I}p{I}\cdot\Sigma_{J}(-P_{J}\cdot{\beta\Delta{G_{J}}}+ln(\frac{Q}{Z}))$
 +
</p>
 
</li>
 
</li>
 
<li>
 
<li>
Compute the partition function $Z$ of $\mathcal{S}$ and the base pair probability matrix $P_{ij}$ at $\theta{ = 37°C}$. Check, if base pairs conforming to conformation $\mathcal{I}$ exist. This is true iff $P_{ij}\gt{P_{threshold}} \forall{(i,j)\in{\mathcal{I}}}$. If it is true, accept the sequence and move on to the next step. Else, return to step 3.
+
The nucleotide $n$ with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ is accepted for the next step.
</li>
+
</li>  
 
<li>
 
<li>
Using the base pair probability matrix $P_{ij}$, check if base pairs conforming to $\mathcal{II}$ exits. This is true iff $P_{ij}\gt{P_{threshold}} \forall{(i,j)\in{\mathcal{II}}}$. Requiring simultaneous conformity to $\mathcal{I}$ and $\mathcal{II}$ ensures that the structure is bistable, with local minima of free energy coinciding with the active and inactive structures, respectively. If this is true, accept this sequence, otherwise discard this sequence and return to step 3.
+
The next steps are iterated $N$ times, where $N$ is the number of nucleotides the aptamer is to be extended by. For each sequence chosen in the previous step, for each nucleotide $n \in C$:
 
</li>
 
</li>
 
<li>
 
<li>
Compute the ensemble free energies $\Delta{G_{\mathcal{I}}}, \Delta{G_{\mathcal{II}}}$ for the conformations $\mathcal{I}$ and $\mathcal{II}$ respectively. Check if $\Delta{G_{\mathcal{I}}}\lt{E_{threshold}} \wedge \Delta{G_{\mathcal{II}}}\lt{E_{threshold}}$. This ensures that the secondary structures are stable and do not unravel easily at 37°C.
+
The previous conformation of the aptamer is kept fixed, and $n$ is appended to the sequence. Torsion angles inside that nucleotide are sampled from a uniform distribution.
 
</li>
 
</li>
 
<li>
 
<li>
Compute the difference in ensemble free energies $\Delta{G} = \Delta{G_{\mathcal{I}}} - \Delta{G_{\mathcal{II}}}$ and constrain it to be smaller in its absolute value than a threshold $\Delta{E_{threshold}}$. This ensures that the bistability of the structure is given, and the structure switches conformation upon ligand binding.
+
<p>
 +
For each nucleotide $n \in C$, where $n$ is the nucleotide appended in the previous step, the partition function $Z$ is calculated as $Z=V\cdot{\Sigma_{I}\frac{e^{-\beta{\Delta{G_{I}}}}}{N_{I}}} : I\in{Samples}$, where $V$ is the volume of the sample space.
 +
</p>
 +
<p>
 +
For each nucleotide $n \in C$, the probability distribution is computed as $P_{I}=\frac{e^{-\beta{\Delta{G_{I}}}}}{Z}$
 +
</p>
 +
<p>
 +
For each nucleotide $n \in C$, the entropy-like fitness function is computed as $S[P]=-V\cdot{Sigma_{I}(-P_{I}\cdot{\beta\Delta{G_{I}}}+ln(\frac{Q}{Z}))}$
 +
</p>
 
</li>
 
</li>
 
<li>
 
<li>
If all of the above apply, accept the sequence as a candidate sequence for the aptazyme. Else return to step 3.
+
The nucleotide with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ are accepted for the next step.
 
</li>
 
</li>
 
</ol>
 
</ol>
In terms of the optimisation of $N$ and $S$ it is sufficient to run one simulation with relaxed constraints and evaluate the two dimensional histogram of points $(S,N)$ to find the region in $S,N$ space with the highest amount of candidates generated. This should correspond to the optimal stem length and shift for the aptamer ribozyme combination to be considered.
 
 
==Results==
 
Our implementation of the algorithm, termed JAWS (Joining Aptamers Without SELEX), was done in Python using the ViennaRNA library and its Python bindings. We used JAWS to create functional DNA and RNA constructs exhibiting switching behavior. Thus, we constructed variants of the SpinachII fluorescent aptamer, which were shown to be switchable in real time by the presence or absence of ATP to a fluorescent and non-fluorescent state, respectively, using an aptamer from <x-ref>Szostak0000</x-ref>. Both variants outperformed a previouly published combination (BBa_K1614012)<x-ref>Xxxx0000</xref>, as they exhibited a wider dynamic range and stronger contrast between their active and inactive state. Furthermore, experimental evidence corresponded well to computational predictions, as candidate 2 (BBa_K1614015), predicted to have a more stable secondary structure, also exhibited slightly less background, better switching and a clearer spectrum than candidate 1 (BBa_K1614014). In addition, we created ligand switchable versions of the F8 DNAzyme <x-ref>SomeChineseGuy0000</x-ref>, which were designed to cleave in the presence of ampicillin, an oligonucleotide, ketamine, or p53. Their activity was confirmed in optimal buffer as well as in a more realistic setting for on-site detection of small molecule contamination in beverages (specifically in energy drinks spiked with carbenicillin). Finally, JAWS successfully computed switchable HRP mimicking DNAzymes for rapid characterisation of aptamers generated by MAWS, our aptamer designing tool. The switchable DNAzymes indeed showed increased activity in the presence of their respective ligands. There we observed, that different stem strength, as predicted by the software, using both free energy $\Delta{G}$ and an entropy like distance<x-ref>Kullback0000</x-ref> $S[P_{ij},\frac{1}{L}]=-\Sigma_{ij}P_{ij}\cdot{log(LP_{ij})}$, where lower energy and entropy indicates greater stem strength possessed less background activity as well as a higher dynamic range, whereas weak stems possessed higher background and a lower dynamic range.
 
  
==Discussion==
+
<p class="basicheader">Results</p>
Following the results from the switchable HRP assays, we propose the introduction of another constraint regulating  the stem strength of our candidates. Namely, at each step enforce that $S[P_{ij}]=\Sigma_{ij}P_{ij}\cdot{log({P_{ij}}\cdot{L})}$ be inside an interval of entropies$[S_{0},S_{1}]$. This ensures at least <i>in silico</i>, that the DNAzymes generated possess a certain stem strength, and thus a specified dynamic range and kinetic behaviour. Furthermore, the process of generating ligand gated, bistable structures is easily expandable to switching structures which are not simple stems. This is already implemented in our software, as the constraint regarding the existence of basepairs conforming to a stem can actually process any given secondary structure. Even so, we are currently limited to simulating DNAs and RNAs without any consideration of pseudoknots. This would be crucial for even more efficient aptazyme design, as many ribozymes feature extensive tertiary interactions to promote their activity. Again, this could be bypassed by adding additional terms to the partition function $Z$, corresponding to inter-loop interactions, which would allow for additional constraints stabilizing or destabilizing tertiary interactions in the presence or absence of ligands. That said, we expanded upon existing computational concepts, generalizing them to all functional nucleic acids and developed an extensible, modular, and universal constraint based system, allowing for the design of tailor-made ligand gated nucleic acids, fit to any specific problem.
+
<p class="basictext">Our implementation of the algorithm, termed MAWS (Making Aptamers Without SELEX), was done in Python using the openMM library and its Python bindings for molecular dynamics simulations as well as the mpmath Python library for arbitrary-precision floating point arithmetics. Using MAWS, we have generated aptamers for several protein targets, including actin, xylanase, p53, and saCas9, as well as small molecule targets, including kanamycin and ketamine. Each of those aptamers was computed within a day at most (saCas9) and two hours at the least, on a laptop computer with only four cores. The aptamers were then experimentally validated by linking them to an HRP-mimicking DNAzyme, yielding a so-called aptabody. Aptamers against protein targets were tested by a modified Western blot procedure, where the aptabody was used in place of a conventional antibody. Aptamers binding small molecules were assayed in solution. Aptamers designed by MAWS were found to be functional, binding their targets in chemoluminescence based assays (link to stefan and stefan). <i>In silico</i> validation of aptamers showed a tendency for them tightly bind their target in molecular dynamics simulations. Furthermore, MAWS predicted a very tight fit of the aptamer to its ligand.</p>
  
 +
<p class="basicheader">Discussion</p>
 +
<p class="basictext">As the principle of minimal relative entropy should be universal, one could envision aptamers comprised of biomolecules other than nucleic acids to be generated by the software. In terms of PNA, XNA or other nucleic acid derivatives it should be possible to incorporate them into the software with minor changes to the parameters. Furthermore, the second and the following steps could be changed to a markov chain monte carlo algorithm, which would forego sampling in regions where the free energy $\Delta{G}$ is very large. One could then switch to sampling interaction energy, as conformations prevented from occurring in nature by the laws of physics could not arise if the markov chain transition probability were to be chosen to be energy gradient dependent, therefore avoiding artifacts. As the algorithm can be executed in a very short time on consumer hardware, it could be beneficial to implement master-worker-style parallelization for this software to be able to harvest more aptamers in a shorter time.
  
 
                                         </p>
 
                                         </p>

Revision as of 20:49, 17 September 2015

MAWS

Abstract

Aptamers, originally conceived as a better alternative to antibodies, are now a popular object of research. This research however, is limited by the use of a selection process (SELEX) to generate aptamers. This process needs careful planning and is time consuming with a usual timeframe of multiple weeks. Furthermore, success is heavily dependent on the quality of the nucleic acid library employed in the process. Herein, we construct an entropy-like metric on the space of nowhere zero N-dimensional probability distribution. By using this in an iterative process, we select nucleotides with a probability distribution of maximum distance to the uniform distribution at each step. This yields optimal ligand binding aptamers, without the need of selection.

Introduction

In the field of nucleic acid research, functional nucleic acids have been, and continue to be one of the topics most worked on. Even outside of that field of research, aptamers, ribozymes and DNAzymes are being used to construct very sensitive and specific sensors Rivas2015Li2013Zhao2015 for many kinds of molecules, using many kinds of techniques. Yet in all of those approaches, the aptamer constitutes the core of the sensor. Thus, to engineer new sensors for differend kinds of molecules with current experimental methods, one would have to select a nucleic acid library agains those moleculesStoltenburg2007. That process, known as SELEX (systematic evolution of ligands by exponential enrichment) involves many steps of selection and a well-planned assay, needing time and thought. With time frames of multiple weeks, SELEX is in no way suitable for a fast creation of tests against novel diseases or drugs. Due to this need for a major speedup in the SELEX process, multiple computational approaches were designed. Algorithms have been devised to tackle the problem of nucleic acid library designLuo2010 as well as to simulate the process of SELEXHu2015Chushak2009. While the former do not necessarily improve on a good random SELEX pool, the latter needs a significant amount of processing power, requiring four months to set up the selection library on 300 coresChushak2009. This obviously makes SELEX the method of choice. In 2011, another approach to the selection of aptamers was demonstrated against thrombin and phosphatidylserine as targetsTseng2011. This method was based on minimizing for each nucleotide in the aptamer its entropy of binding, where entropy was calculated by sampling a grid space of conformations of the aptamer and computing probabilities at this step. Here we expand upon the concept of using entropy to select aptamers by using the maximum entropy expression for probability as an entropy-like distance function to achieve significant improvement in performance. The generated aptamers were experimentally validated and found to be binding their targets.

Algorithm

We first replace the relative entropy $S[P|Q]=-\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$ by an entropy-like distance using the maximum entropy expression for probability, which is the canonical ensemble $P(\mathcal{X})=\frac{e^{-\beta\cdot{\Delta{G}}}}{Z}$, with $\Delta{G}$ the gibbs free energy of a single conformation of the aptamer ligand complex and $Z=\int{e^{-\beta{\Delta{G}(\mathcal{X})}}d\mathcal{X}}$. That distance we construct in the following:

Given $S[P|Q]=-\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$, we have $\bar{S}[P|Q]=-\int{P\cdot{ln{\frac{P}{Q}}}d\mathcal{X}}$. From this follows trivially that $S[P|Q]>S[P'|Q]\Rightarrow\bar{S[P|Q]}>\bar{S[P'|Q]}$. Furthermore, we can assume for further considerations that $Q$ is the uniform distribution, and define $S[P]:=S[P|Q]$, where $Q$ is uniform. We then have: $\bar{S[P]}:=-\int{-P\cdot{\beta\Delta{G(\mathcal{X})}}+ln(\frac{Q}{Z})}$ Here, as everywhere hereafter, $\mathcal{X}$ designates a set of variables comprising all the aptamers coordinates. Now, every nucleotide should be scored by $S[P]$ and the distance in quality between two aptamers is given by the metric $g[P,P']=|\bar{S[P']}-\bar{S[P]}|$. Taking this into account, entropy-like fitness functions $\bar{S[P]}$ provide a partial order on the space of all nucleotide sequences $\mathcal{N}$, and $g[P,P']$ provides a metric on $\mathcal{N}$, or more specifically a metric on space of equivalence classes of all nowhere zero probability distributions, with equivalence being defined as: $A=_{S}B iff S[A]=S[B]: A,B\in{\mathcal{N}}$. We use $g[P,P']$ to choose nucleotide sequences with a certain distance to the minimum $min(S[P])$ of entropies, which is rather intuitively the most specific nucleotide to the target. To use these principles to calculate aptamers to a certain target, we use the following algorithm:

  1. A bounding box is generated around the target. The conformational and positional space inside the bounding box is sampled from a uniform distribution for the four possible nucleotides $C = {dAMP, dGMP, dCMP, dTMP}$. At each sample, positions and energies of the system are saved.
  2. Optionally, the bounding box is split along one principal direction. The variance $Var(\Delta{G}_{i})$ is calculated for each half $i$ of the bounding box. The space inside the bounding box is sampled again, with probability $P_{i}=\frac{Var(\Delta{G}_{i})}{\Sigma_{j}Var(\Delta{G}_{j})}$ for a sample to come from half $i$ of the bounding box. This can be iterated until sufficient variance reduction is reached. At the last step, save positions and energies at each sample.
  3. For each nucleotide $n \in C$, where $n$ ist the nucleotide at the 3' end of the aptamer, the partition function $Z$ is calculated as $Z=V\cdot\Sigma_{I}p_{I}\cdot{\Sigma_{J}\frac{e^{-\beta{\Delta{G_{J}}}}}{N_{J}}} : J\in{Samples},I\in{Strata}$, where $V$ is the volume of the sample space

    For each nucleotide $n \in C$, the probability distribution is computed as $P_{J}=\frac{e^{-\beta{\Delta{G_{J}}}}}{Z}$

    For each nucleotide $n \in C$, the entropy-like fitness function is computed as $S[P]=-V\cdot\Sigma_{I}p{I}\cdot\Sigma_{J}(-P_{J}\cdot{\beta\Delta{G_{J}}}+ln(\frac{Q}{Z}))$

  4. The nucleotide $n$ with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ is accepted for the next step.
  5. The next steps are iterated $N$ times, where $N$ is the number of nucleotides the aptamer is to be extended by. For each sequence chosen in the previous step, for each nucleotide $n \in C$:
  6. The previous conformation of the aptamer is kept fixed, and $n$ is appended to the sequence. Torsion angles inside that nucleotide are sampled from a uniform distribution.
  7. For each nucleotide $n \in C$, where $n$ is the nucleotide appended in the previous step, the partition function $Z$ is calculated as $Z=V\cdot{\Sigma_{I}\frac{e^{-\beta{\Delta{G_{I}}}}}{N_{I}}} : I\in{Samples}$, where $V$ is the volume of the sample space.

    For each nucleotide $n \in C$, the probability distribution is computed as $P_{I}=\frac{e^{-\beta{\Delta{G_{I}}}}}{Z}$

    For each nucleotide $n \in C$, the entropy-like fitness function is computed as $S[P]=-V\cdot{Sigma_{I}(-P_{I}\cdot{\beta\Delta{G_{I}}}+ln(\frac{Q}{Z}))}$

  8. The nucleotide with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ are accepted for the next step.

Results

Our implementation of the algorithm, termed MAWS (Making Aptamers Without SELEX), was done in Python using the openMM library and its Python bindings for molecular dynamics simulations as well as the mpmath Python library for arbitrary-precision floating point arithmetics. Using MAWS, we have generated aptamers for several protein targets, including actin, xylanase, p53, and saCas9, as well as small molecule targets, including kanamycin and ketamine. Each of those aptamers was computed within a day at most (saCas9) and two hours at the least, on a laptop computer with only four cores. The aptamers were then experimentally validated by linking them to an HRP-mimicking DNAzyme, yielding a so-called aptabody. Aptamers against protein targets were tested by a modified Western blot procedure, where the aptabody was used in place of a conventional antibody. Aptamers binding small molecules were assayed in solution. Aptamers designed by MAWS were found to be functional, binding their targets in chemoluminescence based assays (link to stefan and stefan). In silico validation of aptamers showed a tendency for them tightly bind their target in molecular dynamics simulations. Furthermore, MAWS predicted a very tight fit of the aptamer to its ligand.

Discussion

As the principle of minimal relative entropy should be universal, one could envision aptamers comprised of biomolecules other than nucleic acids to be generated by the software. In terms of PNA, XNA or other nucleic acid derivatives it should be possible to incorporate them into the software with minor changes to the parameters. Furthermore, the second and the following steps could be changed to a markov chain monte carlo algorithm, which would forego sampling in regions where the free energy $\Delta{G}$ is very large. One could then switch to sampling interaction energy, as conformations prevented from occurring in nature by the laws of physics could not arise if the markov chain transition probability were to be chosen to be energy gradient dependent, therefore avoiding artifacts. As the algorithm can be executed in a very short time on consumer hardware, it could be beneficial to implement master-worker-style parallelization for this software to be able to harvest more aptamers in a shorter time.