Team:BostonU/Modeling

Modeling

One of the main aspects of our project was to develop and refine a model that would help us predict the best places to split a protein, in order to most efficiently implement conditional dimerization. Protein primary sequence corresponds to the fundamental structure of the polypeptide chain; it is comprised of a string of amino acids that are covalently linked by peptide bonds. Theoretically, given that a protein that is n amino acids long, there are n-1 places to split the protein, since one can split the primary sequence between any amino acid (i.e. any peptide bond can be cleaved to yield two halves of the protein).

Testing all of these configurations individually would be an incredibly infeasible problem, particularly given our time, cost, and effort constraints. Typically, many researchers that are interested in splitting proteins rely on annotated primary sequences corresponding to functional domains that are generally understood through crystal structure (generally the quaternary structure). However, this can be a laborious trial-and-error process, and oftentimes the crystal structures of proteins are not known. Furthermore, it does not account for other fundamental structural aspects related to protein folding, including secondary and tertiary structures.

A model was previously developed in Matlab by a graduate student in Wilson Wong’s lab, Billy Law; we built off of Billy’s preliminary model in our project. The overall goal of the model was to narrow down the window of split site choices by focusing on feasible regions - regions that would be least likely to interfere with secondary, tertiary, and quarternary structure elements. We hypothesized that such a model could lead to an important predictive tool for scientists, if it could ultimately find the few optimal places to split a protein that would create inert split halves but yield robust activity when dimerization occurs.

We focused on three major criteria to narrow down split site choices.

Our first criterion was to avoid the secondary structures in the protein: the alpha helices and beta sheets. (This was part of Billy’s original model.) We hypothesized that splitting through these sheets could also potentially disrupt folding activity and function.

We used an online tool called JPRED to predict where there would be alpha helices and beta sheets in the protein. It also predicted where there were likely a few catalytic residues. This tool required input of the primary structure (linear amino acid sequence) of the protein as the input, and gave an output with the secondary structure prediction based on the linear sequence. The model aligned the secondary structure prediction regions against the primary sequence, and we did not pick split sites that fell in these regions.

Our second criteria was to avoid the secondary structures in the protein: the alpha helices and beta sheets. We hypothesized that splitting through these sheets could also potentially disrupt folding activity and function.

We used an online tool JPRED to predict where there would be alpha helices and beta sheets in the protein. This tool required the primary structure (amino acid sequence) of the protein as the input, and output the secondary structure prediction. Our third criteria was to avoid splitting the catalytic domain of the proteins. We hypothesized that splitting this functional domain could interfere with the protein’s activity. We wanted to avoid any potential disruptions to protein activity so that our system would still have robust activity when the protein halves dimerize.

Using these criteria, we were able to build and manipulate a model in Matlab that would help us predict the best places to split our proteins. Below are the hydrophobicity curves for each of the proteins that we split produced by our model, along with the split sites that we chose shown in black.

We realized that these criteria do not account for a protein’s 3-D structure. As a result, our model ignores the loops and turns in between alpha helices and beta sheets. Loops and turns are structures in the protein that can contribute the most to protein function, such as binding sites, and can be identified in a protein’s 3-D structure.

However, not all proteins have known 3-D structures. We therefore conclude that our model can be used when 3-D structures are unknown, but in order to best identify the most viable split sites, it can be more beneficial to examine the 3-D structure of a protein.