Difference between revisions of "Team:BostonU/Modeling"

Revision as of 15:26, 18 September 2015

Modeling

One of the main aspects of our project was to develop and refine a model that would help us predict the best places to split a protein, in order to most efficiently implement conditional dimerization. Protein primary sequence corresponds to the fundamental structure of the polypeptide chain; it is comprised of a string of amino acids that are covalently linked by peptide bonds. Theoretically, given that a protein that is n amino acids long, there are n-1 places to split the protein, since one can split the primary sequence between any amino acid (i.e. any peptide bond can be cleaved to yield two halves of the protein).

Testing all of these configurations individually would be an incredibly infeasible problem, particularly given our time, cost, and effort constraints. Typically, many researchers that are interested in splitting proteins rely on annotated primary sequences corresponding to functional domains that are generally understood through crystal structure (generally the quaternary structure). However, this can be a laborious trial-and-error process, and oftentimes the crystal structures of proteins are not known. Furthermore, it does not account for other fundamental structural aspects related to protein folding, including secondary and tertiary structures.

A model was previously developed in Matlab by a graduate student in Wilson Wong’s lab, Billy Law; we built off of Billy’s preliminary model in our project. The overall goal of the model was to narrow down the window of split site choices by focusing on feasible regions - regions that would be least likely to interfere with secondary, tertiary, and quarternary structure elements. We hypothesized that such a model could lead to an important predictive tool for scientists, if it could ultimately find the few optimal places to split a protein that would create inert split halves but yield robust activity when dimerization occurs.

We focused on three major criteria to narrow down split site choices.

Our first criterion was to avoid the secondary structures in the protein: the alpha helices and beta sheets. (This was part of Billy’s original model.) We hypothesized that splitting through these sheets could also potentially disrupt folding activity and function.

We used an online tool called JPRED to predict where there would be alpha helices and beta sheets in the protein. It also predicted where there were likely a few catalytic residues. This tool required input of the primary structure (linear amino acid sequence) of the protein as the input, and gave an output with the secondary structure prediction based on the linear sequence. The model aligned the secondary structure prediction regions against the primary sequence, and we did not pick split sites that fell in these regions.

Our second criterion was to avoid largely hydrophobic regions, likely corresponding to the protein’s interior regions. (This was part of Billy’s original model.) Proteins often tend to adopt globular structures that bury hydrophobic residues within a core, such that they do not unfavorably interact with hydrophilic solvents. The exterior surface residues, in contrast, are generally hydrophilic. We hypothesized that splitting a protein through its hydrophobic core could potentially greatly interfere with its folding ability and overall function. Therefore, we focused on avoiding hydrophobic regions in the protein and targeting hydrophilic regions.

We used the Janin hydrophobicity scale¹, which assigns each amino acid in the primary sequence an index based on how hydrophobic it is (the higher the number is, the “more hydrophobic” the amino acid is). We took a running average of the hydrophobicity of 11 consecutive amino acids in the model to create a hydrophobicity profile of the entire protein. The model aligned the hydrophobicity profile against the primary sequence, and we did not pick split sites that fell in hydrophobic windows.

Our third criterion was to avoid splitting within a known catalytic domain of each of the proteins. (This was our main contribution to the model itself.) We hypothesized that splitting within this functional domain could really interfere with the protein’s overall activity. We wanted to avoid any such huge potential disruptions to protein activity so that our system would still have robust activity when the protein halves dimerize.

We looked into the literature and found any relevant annotations of our proteins of interest, especially noting where catalytic domains may have been located. Additionally, Billy’s model had included a few hypothesized catalytic residues for TP901-1 and PhiC31 integrases. The model aligned the “annotated catalytic domain” against the primary sequence, and we did not pick split sites that fell in this window.

Using these three criteria, we used our MATLAB tool to predict optimal regions in which to split our proteins. We chose 4-10 initial split sites for each protein, which is a much more realistic number to test compared to testing all possible split locations!

Below are the visual images that include all criteria for each of the proteins that we tested. Our chosen split sites to test are indicated in black. We did not find catalytic domain/residue annotations for all of the proteins, and thus these images may not have these elements.

Our results for our protein splits can be found in the individual results sections of our application pages. Integrase/RDF Results (link) saCas9 Results (link). (This was our main contribution to model validation.)

Our experimental results ultimately informed us that our model had promise, and that there are still ways to refine the predictive capacity. Read our considerations below to learn more.

However, not all proteins have known 3-D structures. We therefore conclude that our model can be used when 3-D structures are unknown, but in order to best identify the most viable split sites, it can be more beneficial to examine the 3-D structure of a protein.

@@ Line 95: / Line 95: @@
 </p>
 <p>
-Using these criteria, we were able to build and manipulate a model in Matlab that would help us predict the best places to split our proteins. Below are the hydrophobicity curves for each of the proteins that we split produced by our model, along with the split sites that we chose shown in black.
+Our third criterion was to avoid splitting within a known catalytic domain of each of the proteins. (This was our main contribution to the model itself.) We hypothesized that splitting within this functional domain could really interfere with the protein’s overall activity. We wanted to avoid any such huge potential disruptions to protein activity so that our system would still have robust activity when the protein halves dimerize.
 </p>
 <p>
-We realized that these criteria do not account for a protein’s 3-D structure. As a result, our model ignores the loops and turns in between alpha helices and beta sheets. Loops and turns are structures in the protein that can contribute the most to protein function, such as binding sites, and can be identified in a protein’s 3-D structure.
+We looked into the literature and found any relevant annotations of our proteins of interest, especially noting where catalytic domains may have been located. Additionally, Billy’s model had included a few hypothesized catalytic residues for TP901-1 and PhiC31 integrases. The model aligned the “annotated catalytic domain” against the primary sequence, and we did not pick split sites that fell in this window.
+</p>
+<p>
+Using these three criteria, we used our MATLAB tool to predict optimal regions in which to split our proteins. We chose 4-10 initial split sites for each protein, which is a much more realistic number to test compared to testing all possible split locations!
+</p>
+<p>
+Below are the visual images that include all criteria for each of the proteins that we tested. Our chosen split sites to test are indicated in black. We did not find catalytic domain/residue annotations for all of the proteins, and thus these images may not have these elements.
+</p>
+<p>
+Our results for our protein splits can be found in the individual results sections of our application pages. Integrase/RDF Results (link) saCas9 Results (link). (This was our main contribution to model validation.)
+</p>
+<p>
+Our experimental results ultimately informed us that our model had promise, and that there are still ways to refine the predictive capacity. Read our considerations below to learn more.
 </p>
 <p style="padding-bottom:60px;">