Team:Heidelberg/software/maws
Abstract
Aptamers, originally conceived as a better alternative to antibodies, are now a popular object of research. This research however, is limited by the use of a selection process (SELEX) to generate aptamers. This process needs careful planning and is time consuming with a usual timeframe of multiple weeks. Furthermore, success is heavily dependent on the quality of the nucleic acid library employed in the process. Herein, we construct an entropylike metric on the space of nowhere zero Ndimensional probability distribution. By using this in an iterative process, we select nucleotides with a probability distribution of maximum distance to the uniform distribution at each step. This yields optimal ligand binding aptamers, without the need of selection. In the field of nucleic acid research, functional nucleic acids have been, and continue to be one of the topics most worked on. Even outside of that field of research, aptamers, ribozymes and DNAzymes are being used to construct very sensitive and specific sensors
Introduction
In the field of nucleic acid research, functional nucleic acids have been, and continue to be one of the topics most worked on. Even outside of that field of research, aptamers, ribozymes and DNAzymes are being used to construct very sensitive and specific sensors
Due to this need for a major speedup in the SELEX process, multiple computational approaches were designed. Algorithms have been devised to tackle the problem of nucleic acid library design
Algorithm
We first replace the relative entropy $S[PQ]=\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$ by an entropylike distance using the maximum entropy expression for probability, which is the canonical ensemble $P(\mathcal{X})=\frac{e^{\beta\cdot{\Delta{G}}}}{Z}$, with $\Delta{G}$ the gibbs free energy of a single conformation of the aptamer ligand complex and $Z=\int{e^{\beta{\Delta{G}(\mathcal{X})}}d\mathcal{X}}$. That distance we construct in the following:
Given $S[PQ]=\int{P\cdot{log{\frac{P}{Q}}}d\mathcal{X}}$, we have $\bar{S}[PQ]=\int{P\cdot{ln{\frac{P}{Q}}}d\mathcal{X}}$. From this follows trivially that $S[PQ]>S[P'Q]\Rightarrow\bar{S[PQ]}>\bar{S[P'Q]}$. Furthermore, we can assume for further considerations that $Q$ is the uniform distribution, and define $S[P]:=S[PQ]$, where $Q$ is uniform. We then have: $\bar{S[P]}:=\int{P\cdot{\beta\Delta{G(\mathcal{X})}}+ln(\frac{Q}{Z})}$ Here, as everywhere hereafter, $\mathcal{X}$ designates a set of variables comprising all the aptamers coordinates. Now, every nucleotide should be scored by $S[P]$ and the distance in quality between two aptamers is given by the metric $g[P,P']=\bar{S[P']}\bar{S[P]}$. Taking this into account, entropylike fitness functions $\bar{S[P]}$ provide a partial order on the space of all nucleotide sequences $\mathcal{N}$, and $g[P,P']$ provides a metric on $\mathcal{N}$, or more specifically a metric on space of equivalence classes of all nowhere zero probability distributions, with equivalence being defined as: $A=_{S}B \iff S[A]=S[B]: A,B\in{\mathcal{N}}$. We use $g[P,P']$ to choose nucleotide sequences with a certain distance to the minimum $min(S[P])$ of entropies, which is rather intuitively the most specific nucleotide to the target. To use these principles to calculate aptamers to a certain target, we use the following algorithm:
 A bounding box is generated around the target (Fig. 3A). The conformational and positional space inside the bounding box is sampled from a uniform distribution for the four possible nucleotides $C = {dAMP, dGMP, dCMP, dTMP}$ (Fig. 3B). At each sample, positions and energies of the system are saved.
 Optionally, the bounding box is split along one principal direction. The variance $Var(\Delta{G}_{i})$ is calculated for each half $i$ of the bounding box. The space inside the bounding box is sampled again, with probability $P_{i}=\frac{Var(\Delta{G}_{i})}{\Sigma_{j}Var(\Delta{G}_{j})}$ for a sample to come from half $i$ of the bounding box. This can be iterated until sufficient variance reduction is reached. At the last step, save positions and energies at each sample.

For each nucleotide $n \in C$, where $n$ is the nucleotide at the 3' end of the aptamer, the partition function $Z$ is calculated as $Z=V\cdot\Sigma_{I}p_{I}\cdot{\Sigma_{J}\frac{e^{\beta{\Delta{G_{J}}}}}{N_{J}}} : J\in{Samples},I\in{Strata}$, where $V$ is the volume of the sample space
For each nucleotide $n \in C$, the probability distribution is computed as $P_{J}=\frac{e^{\beta{\Delta{G_{J}}}}}{Z}$
For each nucleotide $n \in C$, the entropylike fitness function is computed as $S[P]=V\cdot\Sigma_{I}p_{I}\cdot\Sigma_{J}(P_{J}\cdot{\beta\Delta{G_{J}}}+ln(\frac{Q}{Z}))$
 The nucleotide $n$ with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ is accepted for the next step.
 The next steps are iterated $N$ times, where $N$ is the number of nucleotides the aptamer is to be extended by. For each sequence chosen in the previous step, for each nucleotide $n \in C$:
 The previous conformation of the aptamer is kept fixed, and $n$ is appended to the sequence. Torsion angles inside that nucleotide are sampled from a uniform distribution (Fig. 3C).

For each nucleotide $n \in C$, where $n$ is the nucleotide appended in the previous step, the partition function $Z$ is calculated as $Z=V\cdot{\Sigma_{I}\frac{e^{\beta{\Delta{G_{I}}}}}{N_{I}}} : I\in{Samples}$, where $V$ is the volume of the sample space.
For each nucleotide $n \in C$, the probability distribution is computed as $P_{I}=\frac{e^{\beta{\Delta{G_{I}}}}}{Z}$
For each nucleotide $n \in C$, the entropylike fitness function is computed as $S[P]=V\cdot{\Sigma_{I}(P_{I}\cdot{\beta\Delta{G_{I}}}+ln(\frac{Q}{Z}))}$
 The nucleotide with the lowest entropy, and all other satisfying $g[P,P_{best}]\lt{g_{threshold}}$ are accepted for the next step.
Results
Our implementation of the algorithm, termed MAWS (Making Aptamers Without SELEX), was done in Python using the openMM library and its Python bindings for molecular dynamics simulations as well as the mpmath Python library for arbitraryprecision floating point arithmetics. MAWS, as well as its documentation, is freely available as Open Source on our GitHub repository. Using MAWS, we have generated aptamers for several protein targets, including actin, xylanase, p53, and saCas9, as well as small molecule targets, including kanamycin and ketamine. Each of those aptamers was computed within a day at most (saCas9) and two hours at the least, on a laptop computer with only four cores. The aptamers were then experimentally validated by linking them to an HRPmimicking DNAzyme, yielding a socalled aptabody. Aptamers against protein targets were tested by a modified Western blot procedure, where the aptabody was used in place of a conventional antibody. Aptamers binding small molecules were assayed in solution. Aptamers designed by MAWS were found to be functional, binding their targets in chemoluminescence based assays (link to stefan and stefan). In silico validation of aptamers showed a tendency for them tightly bind their target in molecular dynamics simulations. Furthermore, MAWS predicted a very tight fit of the aptamer to its ligand.
Discussion
As the principle of minimal relative entropy should be universal, one could envision aptamers comprised of biomolecules other than nucleic acids to be generated by the software. In terms of PNA, XNA or other nucleic acid derivatives it should be possible to incorporate them into the software with minor changes to the parameters. Furthermore, the second and the following steps could be changed to a markov chain monte carlo algorithm, which would forego sampling in regions where the free energy $\Delta{G}$ is very large. One could then switch to sampling interaction energy, as conformations prevented from occurring in nature by the laws of physics could not arise if the markov chain transition probability were to be chosen to be energy gradient dependent, therefore avoiding artifacts. As the algorithm can be executed in a very short time on consumer hardware, it could be beneficial to implement masterworkerstyle parallelization for this software to be able to harvest more aptamers in a shorter time.