Structure Prediction based on Machine Learning

Abstract

Biomolecular structure prediction is an essential part in the research of many biological and chemical areas. In our application scenario for computational riboswitch developement, we are especially intereseted in DNA and RNA structure prediction techniques that we can efficiently use to design new riboswitches. Due to the high conservation of base-pairings in RNA and DNA sequences, most computational methods for this problem direct their efforts to predicting the secondary structure of nucleic acid sequences. We developed a machine learning approach to secondary structure prediction in form of an artificial neural network model that is able to learn from results produced by different state-of-the-art algorithms and even actually observed structures in biological experiments. The resulting model is used in our riboswitch designing program as part of the riboswitch evaluation routine.

Introduction & Theory

This section will introduce some formal definitions in order to describe the methods and representations we developed for our model. Furthermore, two basic folding algorithms based on combinatoric and thermodynamics will be shortly described, as they are used to generate training data for a machine learning model.

Problem Formalization

In order to be able to apply mathematical and computational methods, we have to formalize and abstract from biological entities. We abstract from the concrete biological composition of RNA or DNA sequences by working with a character representation for the individual bases.

$S$ denotes a RNA or DNA sequence of some length $n$. The sequence is composed of $\{A, C, G, T\}$ (for DNA) or $\{A,C,G,U\}$ (for RNA), which represent the nucleo bases Adenine ($A$), Cytosine ($C$), Guanine ($G$), Thymine ($T$) and Urcail ($U$).
$S[i]\in\{A,C,G,T,U\}$ denotes the base at position $i\in\{1,2,\dots, n\}$ within a sequence $S$ of length $n$.

The secondary structures of a strand $S$ of length $n$ will be represented by a set $\mathcal{S}\in\mathbb{N}\times\mathbb{N}$, where element $(i,j)\in\mathcal{S}$ indicates that the bases at position $i$ and $j$ are paired. In contrast to common other representations, we will explicitly denote non-pairing bases in the set: an element $(i,i)\in\mathcal {S}$ denotes that position $i$ is not paired at all. In order to get a consistent representation with the technical details later on, we define a valid secondary structure to obey the following properties

$\forall k\in\{1,2,\dots,n\}\exists (i,j)\in\mathcal{S}: k\in(i,j)$, i.e. each position appears at least once in the secondary structure.
$(i,j),(k,l)\in\mathcal{S}\wedge (i=k\vee j=l)\Longrightarrow (i,j)=(k,l)$, i.e. each position appears at most once in the secondary structure.
$(i,j)\in\mathcal{S}\Longrightarrow 1 \leq i \leq j \leq n$, i.e. the pairing positions are ordered. From this follows that $(j,i)\not\in\mathcal{S}$ if $i\neq j$.

Note that 1. and 2. together imply that each position of the sequence appears exactly once in the secondary structure, hence, it is $|\mathcal{S}|=n$.

Nussinov's Algorithm

The RNA structures produced by Nussinov's algorithm [1] follow the justification that more base-pairs increase the stability of the secondary structure. Hence, the goal is to maximize the number of base-pairs in the secondary structure. In case of pseudo-knot free structures, this can be achieved by Nussinov's recursive formula $$ M(i, j)=\max\begin{cases} \begin{cases} M(i+1, j-1)+1 & \text{, if } S[i] \text{ and } S[j] \text{ can pair} \\ 0 & \text{, otherwise} \end{cases} \\ \max_{i\leq k < j}\{M(i, k) + M(k+1, j)\} \end{cases} $$ where $S$ is a RNA sequence of length $n$ and $M\in\mathbb{N}^{n\times n}$ is a matrix containing at $M(i, j)$ the maximum number of possible base-pairs for the substrand of $S$ between positions $i$ and $j$. The matrix $M$ can be efficiently computed using dynamic programming and possible secondary structures may be finally traced back from $M(1, n)$.

Zuker's Algorithm

The RNA structures produced by Zuker's algorithm [2] are based on thermodynamic parameters as measured in biological experiments in order to minimize the free energy of the secondary structure. The algorithm decomposes the prediction problem into finding the optimal configuration of loop motifs (i.e. stacks, hairpin-, bulge-, interior- and multi-loops) that minimize the Gibbs free energy of the complete structure.

**Figure 1** Basic loop motifs defined by a nucleo base pair. Stacks contribute to stability of the structure, while the other loops usually destabilize a secondary structure.

In case of pseudo-knot free structures for a sequence $S$ of length $n$, we can tackle the thermodynamic minimum-free-energy (MFE) optimization using Zuker's recursions $$ W(i,j)= \min \begin{cases} W(i+1, j) \\ W(i, j-1) \\ V(i, j) \\ \displaystyle\min_{i < k < j}\{W(i, k) + W(k+1,j)\} \end{cases} $$ where $W\in\mathbb{N}^{n\times n}$ is a matrix containing at $W(i,j)$ the minimum free energy for the substrand of $S$ between positions $i$ and $j$. This equation uses a second auxiliary matrix $V\in\mathbb{N}^{n\times n}$ whose elements $V(i,j)$ also contain the minimum free energy for the substrand of $S$ between positions $i$ and $j$, but subject to that positions $i$ and $j$ form a base-pair and $V(i,j)=\infty$ if the bases can not pair $$ V(i,j)= \min \begin{cases} \Delta G_{Hairpin}(i, j) \\ \Delta G_{Stack}(i, j) + V(i+1, j-1) \\ \displaystyle\min_{i+1 < j' < j}\{\Delta G_{Bulge}(i, j, i+1, j') + V(i+1, j')\} \\ \displaystyle\min_{i < i' < j-1}\{\Delta G_{Bulge}(i, j, i', j-1) + V(i', j-1)\} \\ \displaystyle\min_{i+1 < i' < j' < j-1}\{\Delta G_{Interior}(i, j, i', j') + V(i',j')\} \\ \displaystyle\min_{i < k < j-1}\{W(i+1, k) + W(k+1,j-1)\}+\Delta G_{Multi}(i, j) \end{cases}. $$ Here, $\Delta G_{Hairpin}$, $\Delta G_{Stack}$, $\Delta G_{Bulge}$, $\Delta G_{Interior}$ and $\Delta G_{Multi}$ are the Gibbs free energy functions of the corresponding loop motifs induced by a base-pair. Those energy functions are based on parameter files compiled from experimental measurements. Again, both matrices $W$ and $V$ can be efficiently computed using dynamic programming and possible secondary structures may be finally traced back from $W(1, n)$.

Goal

We want to develop a new secondary structure prediction model based on machine learning techniques. This model is used to aid the process of computationally designing riboswitches in our next project, that can lock and unlock specific regions in RNA sequences. In contrast to most folding algorithms so far, the focus of this project lies in evaluating secondary structures instead of producing them. Therefore, we will implement and train an artificial neural network to learn base-pair scorings for given sequences from results produced by different folding algorithms. However, we will also derive an algorithm to actually generate secondary structures from base-pair scorings prroduces by the neural network model.

Results

We attempted to combine the results of secondary structure predictions produced by Nussinov's and Zuker's folding algorithms into a single learning model. Because of the high dimensionality of the folding problem and the ability for natural online-training and readjustions, we have chosen to pursue an artificial neural network model in form of a multilayer perceptron.

The following sections will describe our resulting software system that is able to evaluate and generate secondary structures based on a trained multilayer perceptron and how to train it.

General System Architecture

The whole model for secondary structure prediction of nucleic acid strands implemented in this project can be decomposed into four main components that are shown in Figure 2.

**Figure 2** The core component of the system is the scoring model implemented as artificial neural network that was trained to predict base pair scorings for given RNA or DNA sequences. The subsequent components modify and adapt the scores to finally generate actual structures.

The components have the following functionality:

Scoring Model: Takes a RNA or DNA sequence, encodes it into a feature vector and produces a matrix containing scores for each base-pair in the sequence indicating how likely they will form a bond in the structure. This basic component is realized in this project by an artificial neural network trained to score the base-pair contacts.
Tuning Engine: Takes the matrix containing the base-pair scores and fine tunes them in order to remove base-pair violations (i.e. non-wobble and non-watson-crick pairs), heuristically adjust to a folding temperature, apply constraints (e.g. explicitly forbidden base-pairs), take co-folding split positions into account, etc.
Probability Distributor: Takes the matrix containing the final base-pair scores and converts them into a multidimensional discrete probability distribution of each base to pair with another base. The output of the distributor is a matrix with same dimensionality and semantics as the score-matrix, but the scores now became true contact probabilities.
Structure Generator: Takes the contact probability matrix and uses it to gather base-pairs based on their probability and generate actual secondary structures. For example, one way to find a single structure (the most probable) is a Maximum-Likelihood-Traceback. Other ways could be to randomly sample multiple structures according to the probability distribution or use the matrix as parameters in a simulation.

In the following, assume $S$ is a RNA sequence of length $n$ and $\mathcal{S}$ denotes a corresponding secondary structure (as defined in the introduction).

The Scoring Model

This component forms the basic part in determining which secondary structure is finally generated. Its task is to score each possible base pair within the given sequence and also to assign a score to the case that a base will not pair at all. A higher score for a base pair indicates a higher probability that they will actually pair. In this case, the implemented scoring model is a learning system in form of an artificial neural network (ANN), in particular, a Multilayer Perceptron trained on ?? million RNA sequences. The ANN takes a character sequence $S$ describing the RNA or DNA strand of length $n$, encodes it into a numerical representation vector $\mathbf{x}$ and scores each base pair with a value in $[0, 1]$. The final output of this scorer is a symmetric quadratic matrix $M\in\mathbb{R}^{n\times n}$ where entry $M_{ij}$ corresponds to the score for the base pair $S[i]$ and $S[j]$.

Binary Feature Encoding

The character sequence $S$ has to be encoded into a feature vector of fixed length that can be feed into the ANN. A simple way is a binary encoding of the sequence as follows: Note that $S$ is composed of the characters $\{A,C,G,T/U\}$ representing the bases, hence, we need two bits to encode each base. This results in a feature vector of length $2n$ but restricts the model to sequences of exact length $n$. We increase the flexibility of the model to accept sequences of maximum length $n$ by padding "empty slots" at the end of shorter sequences until they are of length $n$. Assume we support a maximum sequence length of $n$ bases, we can then encode any sequence $S$ of length $m \leq n$ into a fixed size vector of the form \[\mathbf{x} = f_{enc}(S) := (f_{enc}(S[1]), \dots, f_{enc}(S[m]),\underbrace{f(\emptyset),\dots,f(\emptyset)}_{\times n-m})\in\mathbb{B}^{3n}\] where $$ f_{enc}(b) = \begin{cases} (0, 0, 1) & \text{, if } b=A \\ (0, 1, 0) & \text{, if } b=C \\ (0, 1, 1) & \text{, if } b=G \\ (1, 0, 0) & \text{, if } b=T \vee b=U \\ (0, 0, 0) & \text{, if } b=\emptyset \\ \end{cases} $$

Binary Structure Encoding

One way to numerically encode DNA or RNA secondary structures that are shown to the ANN is by generating a fixed size contact map from the structure. Assume our ANN supports sequences of maximum length $n$. Given a secondary structure $\mathcal{S}$ for a sequence $S$ of length $m\leq n$, its contact map is a matrix $C\in\mathbb{B}^{n\times n}$ where $$ C_{ij} = \begin{cases} 1 & \text{, if } (S[i], S[j])\in\mathcal{S} \vee (S[j], S[i])\in\mathcal{S}\\ 0 & \text{otherwise} \end{cases}. $$ Hence, the contact map $C$ is quadratic, symmetric and contains a $1$ only where the base positions pair (i.e. at most once a row or column). Note that by our definition of $\mathcal{S}$, a $1$ for entry $C_{ii}$ indicates that the base at position $i$ does not pair at all. A row or column without any $1$ indicates a base position that exceeds the actual sequence length and corresponds to the "empty slots" (there are exactly $n-m$ empty slots and therefore as many zero rows and columns).

Multilayer Perceptron Architecture

Using the binary encoding of sequences and secondary structures presented above results in an ANN architecture with $d=3n$ input neurons and $k=\frac{n^2}{2}+n$ output neurons (as contact maps are symmetric, we can discard the upper or lower triangular and reduce the output dimensionality), where $n$ is the maximum length of supported sequences. As the feature vector encoding the sequence is binary, we choose the $d$ input neurons to be binary units as well. The $k$ output neurons correspond to the entries of the structure's contact map, i.e. they are shown a $1$ for pairing base positions and $0$ otherwise. From this, the ANN is able to learn scoring of the base-pairs in $[0, 1]$ by using Logistic Sigmoid activities on the output neurons. This activity function is defined by $$\phi_{out}(s) = \frac{1}{1-\exp(-s)}\in[0, 1].$$ In order to increase the modeling capabilities of the ANN, a single layer of $l$ non-linear hidden neurons is added to serve as feature detectors. They implement the commonly used Tangens Hyperbolicus Sigmoid activation $$\phi_{hidden}(s) = \tanh(s)\in [-1, 1].$$ The multilayer perceptron architecture is visualised in Figure 3.

**Figure 3** A multilayer perceptron with feed-forward architecture representing a differentiable highdimensional function and enabling efficient layer-wise vector computations on data points. Data points are applied to the input layer that feeds them into the hidden layer. The hidden layer performs non-linear transformation on the inputs (often interpreted as feature extraction) and is connected to the output layer to determine the final score.

Multilayer Perceptron Training

In order to train the ANN for predicting reasonable base-pair scorings, we showed him sequences and their possible secondary structures. For each training sequence $S$, two secondary structures were generated and presented to the ANN:

A combinatorial based structure using Nussinov Algorithm (maximizing the base-pairs). Each training epoch generates a random structure with maximum base-pairs to show the ANN general base-pair forming and valid pairs without further constraints.
A thermodynamic based structure using Zuker's Algorithm. This shall motivate the ANN to learn more energetically stable structures as produced by state-of-the-art folding algorithms.

The structures were encoded as described in the corresponding section above with the objective of minimizing the Cross-Entropy error function. The point-wise version of this objective (the error that the ANN makes on a single base-pair) is given by \[E_{CE}(i, j) = -C_{ij}\ln P_{ij} - (1-C_{ij})\ln(1-P_{ij}),\] where $C_{ij}\in\{0, 1\}$ is the contact map entry of the structure presented to the network and $P_{ij}\in[0,1]$ is the score prediction of the ANN for the base-pair $S[i]$ and $S[j]$. Hence, the whole objective for the ANN on the single sequence $S$ reads \[\min \sum_{i=1}^n \sum_{j=i}^n E_{CE}(i, j).\] In order to minimize this error function, our network uses stochastic gradient descent with adaptive learning rates (based on a leaky gradient average) [3]. The gradients are obtained using the backpropagation algorithm [3]. Due to the ambiguity of secondary structures produced by the folding algorithms, the ANN may see different structures over multiple training epochs for the same sequence.

The Tuning Engine

The tuning engine modifies the basic score matrix $M\in[0,1]^{n\times n}$ of a sequence $S$ of length $n$ obtained from the scoring model. The goal is to clean the scores or adapt them for specific parameters and constraints.

Base Pair Violations

One basic task of this component is to clean the scores from base pair violations by setting the score of those pairs to zero. A base pair at position $i$ and $j$ of $S$ is considered to be valid, if

$\{S[i], S[j]\}$ forms a Watson-Crick base pair, i.e. $\{S[i], S[j]\}\in\{\{A,T\}, \{C,G\}, \{A,U\}\}$.
$\{S[i], S[j]\}$ forms a Wobble base pair, i.e. $\{S[i], S[j]\}=\{G,U\}$.

Hence, we set $M_{ij}:= 0$ of all entries for which $S[i]$ and $S[j]$ is not a valid base pair.

Adjusting the Folding Temperature

The temperature at which we consider RNA or DNA folding has direct impact on the secondary structure that is been formed. However, the ANN was not trained on structures folded at different temperatures. We therefore use the following heuristic approach that is actually observed in biological structures: Increasing the folding temperature induces higher thermodynamic energy on the base pairs and weakens their hydrogen bonds. Weakening the base pairs (i.e. increasing the temperature) means to increase the score that a base will not pair at all. Likewise, strengthening the base pairs (i.e. decreasing the temperature) means to decrease the score. In terms of the score matrix, this corresponds to scaling the diagonal elements of $M$. A general formula for temperature scaling is then given by \[\forall i=1,\dots,n : M_{ii} := \alpha \frac{t}{37} \cdot M_{ii},\] where $t\in\mathbb{R}_+$ is the folding temperature and $\alpha\in\mathbb{R}_+$ is a scaling factor that indicates how much we scale the score down or up. Note that we assumed the scorer is based on pairings for $37^\circ C$. Furthermore, we have to somehow determine the scaling factor $\alpha$ that yields reasonable structures at different temperatures.

Probability Distributor

The task of this component lies in generating a probability distribution from the base pair scores of a sequence. In particular, we want a multidimensional discrete distribution of the probability for each base to pair with another base given a sequence $S$ of length $n$ \[\forall i,j\in\{1,2,\dots, n\}: (i,j) \sim p((x, y) | S),\] where $(i, j)$ denotes a base pair at position $i$ and $j$ in $S$. In fact, the final score-matrix $M\in\mathbb{R}_+^{n\times n}$ for $S$ already induces such a probability distribution: Element $M_{ij}$ corresponds to the score of the base pair at position $(i, j)$ and the $i$-th row of $M$ therefore contains the probability that position $i$ will pair with any other position. Hence, by normalizing the rows of $M$ to sum up to $1$ will result in a matrix $P\in[0,1]^{n \times n}$ containing the contact probabilities for each base pair $$ P_{ij}:= \frac{M_{ij}}{\sum_{k=1}^n M_{ik}}. $$

Structure Generation: Maximum Likelihood Traceback

Based on contact probabilities of the individual base-pairs we show how to generate individual secondary structures. This is an application on the results created by the trained ANN and a variety of approaches may be used to generate structures on this foundation. For example, other applications may just visualize the contacts of the base-pairs. We derived a simple maximum likelihood traceback by adapting Nussinov's recursive equations to maximize the base pair probabilities of the final structure.

Let $P\in[0,1]^{n\times n}$ be the contact probabilities of each base-pair within a strand $S$ of length $n$, predicted by our ANN. In particular, we assume

if $i < j$, entry $p_{ij}$ denotes the probability that $S[i]$ and $S[j]$ form a base-pair.
a diagonal entry $p_{ii}$ denotes the probability that $S[i]$ does not form a base-pair at all.

We want to generate the most likely pseudo-knot free secondary structure $\mathcal{S}\in\mathbb{N}\times\mathbb{N}$ (see the definition of secondary structures in the introduction) according to the pairing probabilities in $P$. This formalizes to a maximum likelihood objective $$\max_{\mathcal{S}} \prod_{(i,j)\in\mathcal{S}}p_{ij} \Longleftrightarrow \max_{\mathcal{S}}\sum_{(i,j)\in\mathcal{S}}\log p_{ij}.$$

We can generate a structure with maximum log probability using a dynamic programing matrix $L$ as follows: Assume the entry $L_{ij}$ contains the maximum log probability for the sub strand $S[i],\dots,S[j]$. A recursive decomposition to compute entry $L_{ij}$ reads $$L(i, j)=\max\begin{cases} L(i+1, j-1)+\log p_{ij} \\ \max_{i\leq k < j}\{L(i, k) + L(k+1, j)\} \end{cases}$$ The recursive base cases consist of the diagonal entries that are initialized for $i=1,\dots,n$ as $$L_{ii}=p_{ii}.$$ Notice that only the upper triangular of $L$ is relevant, which can be efficiently computed using dynamic programing. The maximum possible log probability of a secondary structure for strand $S$ can be read afterwards from $W(1, n)$ and traced back to generate a most likely secondary structure.

References

Nussinov R, Piecznik G, Grigg JR and Kleitman DJ. Algorithms for loop matchings. SIAM Journal on Applied Mathematics (1978)
Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9:133-148 (1981)
YA LeCun, L Bottou, GB Orr, KR Mueller. Efficient backprop. Neural networks: Tricks of the trade, 9-48 (1998)

Team:TU Darmstadt/Project/Bio/Modeling/sec2

Contents