Difference between revisions of "Team:Peking/Modeling/Analysis"

Line 36: Line 36:
 
         .formula-line{max-height:60px;margin-top:0px; margin-bottom:0px;}
 
         .formula-line{max-height:60px;margin-top:0px; margin-bottom:0px;}
 
         .big-formula-line{max-height:80px;margin-top:0px; margin-bottom:0px;}
 
         .big-formula-line{max-height:80px;margin-top:0px; margin-bottom:0px;}
        .huge-formula-line{max-height:100px;margin-top:0px; margin-bottom:0px;}
 
 
         ul.disc {list-style-type:disc;}
 
         ul.disc {list-style-type:disc;}
 
         ul.circle {list-style-type:circle;}
 
         ul.circle {list-style-type:circle;}
Line 80: Line 79:
 
                                     <li><a href="https://2015.igem.org/Team:Peking/Design">Overview</a>
 
                                     <li><a href="https://2015.igem.org/Team:Peking/Design">Overview</a>
 
                                     </li>
 
                                     </li>
                                     <li><a href="https://2015.igem.org/Team:Peking/Design/PC_Reporter">Paired dCas9 Reporter</a>
+
                                     <li><a href="http://2015.igem.rg/Team:Peking/Design/Sequence-specific visualization">Sequence-specific Visualization</a>
 
                                     </li>
 
                                     </li>
 
                                     <li><a href="https://2015.igem.org/Team:Peking/Design/Isothermal">Iso-thermal Amplification</a>
 
                                     <li><a href="https://2015.igem.org/Team:Peking/Design/Isothermal">Iso-thermal Amplification</a>
Line 95: Line 94:
 
                                 <a class="active" href="https://2015.igem.org/Team:Peking/Modeling">Modeling</a>
 
                                 <a class="active" href="https://2015.igem.org/Team:Peking/Modeling">Modeling</a>
 
                                 <ul class="dropdown">
 
                                 <ul class="dropdown">
                                     <li><a class="active" href="https://2015.igem.org/Team:Peking/Modeling">Array Design</a>
+
                                     <li><a href="https://2015.igem.org/Team:Peking/Modeling">Array Design</a>
 
                                     </li>
 
                                     </li>
                                     <li><a href="https://2015.igem.org/Team:Peking/Modeling/Analysis">Analysis algorithm</a>
+
                                     <li><a class="active" href="https://2015.igem.org/Team:Peking/Modeling/Analysis">Analysis algorithm</a>
 
                                     </li>
 
                                     </li>
 
                                 </ul>
 
                                 </ul>
Line 143: Line 142:
 
                         <div class="col-md-6">
 
                         <div class="col-md-6">
 
                             <h2><b>Modeling</b></h2>
 
                             <h2><b>Modeling</b></h2>
                             <p>Specificity!!!!</p>
+
                             <p>Reliablity!!!!</p>
 
                         </div>
 
                         </div>
 
                         <div class="col-md-6">
 
                         <div class="col-md-6">
Line 165: Line 164:
 
                                 <h4 style="font-size:18px">Marker Finder<span class="head-line"></span></h4>
 
                                 <h4 style="font-size:18px">Marker Finder<span class="head-line"></span></h4>
 
                                 <ul>
 
                                 <ul>
                                     <li><a href="#Modeling-Overview">Overview</a></li>
+
                                     <li><a href="#Analysis-Background">Background</a></li>
                                     <li><a href="#Modeling-SSPD">SSPD Method</a></li>
+
                                     <li><a href="#Analysis-Model">Model</a></li>
                                     <li><a href="#Modeling-OligoGenerator">Oligo Generator</a></li>
+
                                     <li><a href="#Analysis-Result">Result</a></li>
 
                                 </ul>
 
                                 </ul>
 
                             </div>
 
                             </div>
 
                             <!-- Sidebar for windows < 1024px -->
 
                             <!-- Sidebar for windows < 1024px -->
 
                             <div id="sidebar2"class="widget widget-categories">
 
                             <div id="sidebar2"class="widget widget-categories">
                                 <h4 style="font-size:18px">Marker Finder<span class="head-line"></span></h4>
+
                                 <h4 style="font-size:18px">Team <span class="head-line"></span></h4>
 
                                 <ul>
 
                                 <ul>
                                     <li><a href="#Modeling-Overview">Overview</a></li>
+
                                     <li><a href="#Analysis-Background">Background</a></li>
                                     <li><a href="#Modeling-SSPD">SSPD Method</a></li>
+
                                     <li><a href="#Analysis-Model">Model</a></li>
                                     <li><a href="#Modeling-OligoGenerator">Oligo Generator</a></li>
+
                                     <li><a href="#Analysis-Result">Result</a></li>
 
                                 </ul>
 
                                 </ul>
 
                             </div>
 
                             </div>
Line 183: Line 182:
 
                         <!-- Page Content -->
 
                         <!-- Page Content -->
 
                         <div class="col-md-9 page-content">
 
                         <div class="col-md-9 page-content">
                             <div id="Modeling-Marker Finder">
+
                             <div id="Analysis-Signal Analysis">
 
                                 <!-- Classic Heading -->
 
                                 <!-- Classic Heading -->
                             <div id="Modeling-Overview" class="col-md-12" style="padding:0">
+
                             <div id="Analysis-Background" class="col-md-12" style="padding:0">
                                 <h3 class="classic-title" style='margin-top:50px'><span>Overview</span></h3>
+
                                 <h3 class="classic-title" style="margin-top:50px"><span>Background</span></h3>
 
                                 <!-- Some Text -->
 
                                 <!-- Some Text -->
                                 <p>To increase the accuracy and specificity of the detection, we developed an assay over our Paired dCas9 Reporter (PC Reporter) System to get more sequence information from the target genome. The core as well as the first step of the design of the array is to screen over the entire genome and get paired specific sequences (CRISPR target sites) with high specificity as markers. <br>
+
                                 To increase the accuracy and specificity of the detection, we developed an assay over our Paired dCas9 Reporter (PC Reporter) System to get more sequence information from the target genome in the purpose of a more reliable result. We designed m pairs of gRNA specific target sites as m markers in the MTB genome. To make sure if the idea mention above actually work, here we used the target gene and the mismatched gene to have a test, respectively. In experimental group, the gRNAs were used to detect the target gene, while in control group, the gRNA were used to detect the mismatched gene. And to reduce the random error, both the experimental and the control group were repeated n times, the result would be shown as the optical power signals, which is generated by our Paired dCas9 Reporter System. Then by comparing the intensity of the optical power signal corresponding to the target gene and mismatched gene, the difference can be seen directly.
                                    We develop a method named SSPD to achieve our aim, which is composed of 4 steps: <br>
+
                                <h4><em>Assumptions</em></h4>
                                    <ul class='disc'>
+
                                <ul class='disc'>
                                        <li>Search for guide sequences of gRNA candidates</li>
+
                                    <li>There is no recognition site for gRNA in the mismatched gene.</li>
                                        <li>Specificity test for each candidate</li>
+
                                    <li>The measure values <img alt='Peking-Analysis-X_iY_i.gif' src="https://static.igem.org/mediawiki/2015/c/cf/Peking-Analysis-X_iY_i.gif" class='formula-inline'>
                                        <li>Pair left and right target sites with optimal spacer length</li>
+
                                    of control and experimental group is independent of each other.</li>
                                        <li>Design PCR fragments</li>
+
                                    <li>The measure values from n times repeated test compose the <img alt='Peking-Analysis-X_iY_i.gif' src="https://static.igem.org/mediawiki/2015/c/cf/Peking-Analysis-X_iY_i.gif" class='formula-inline'> sample set, respectively, and the sample sets are both small.</li>
                                    </ul>
+
                                 </ul>
                                    We will introduce each step in detail separately with analysis about Mycobacterium tuberculosis (MTB) genome as an example. After the target sites are chosen, we developed an Oligo Generator to turn the target sites into oligonucleotides sequences for following sgRNA construction combined with our gRNA generator (Part).
+
                                 </p>
+
 
                             </div>
 
                             </div>
                             <!--Modeling-SSPD-->
+
                             <!--Analysis-Model-->
 
                             <!-- Classic Heading -->
 
                             <!-- Classic Heading -->
                             <div id="Modeling-SSPD" class="col-md-12" style="padding:0">
+
                             <div id="Analysis-Model" class="col-md-12" style="padding:0">
                                 <h3 class="classic-title" style='margin-top:50px'><span>SSPD Methods</span></h3>
+
                                 <h3 class="classic-title" style="margin-top:50px"><span>Model</span></h3>
                                 <!--Modeling-SSPD-S1-->
+
                                 <h4><em>Wilcoxon Rank Sum Test of Block Design</em></h4>
                                <div id="SSPD-S1" class='col-md-12' style='margin:0; padding:0'>
+
                                <p>In view of the unknown distributions and different variances of the signals by our Paired dCas9 Reporter System, we chose a non-parametric statistics method called Wilcoxon Rank Sum Test of Block Design with the data Rank instead of ANOVA.</p>
                                    <h4><em>Search for guide sequences of gRNA candidates</em></h4>
+
                                <p>
                                    <p>Recall the structure of Paired dCas9 Reporter (PC Reporter) System, a protospacer adjacent motif (PAM) sequence in the form of 5’-NGG-3’ at 3’ end of guide sequence, usually 20bp, on the non-complementary strand. (Figure 1) As it was showed in our experimental results that PAM-out orientation (5’-CCNN20-…-N20NGG-3’) was highly efficient for PC Reporter system to work, thus our model would focus on this orientation. (However, it can be more convenient to adjust our program for guide sequence design also with other orientations. See more in <a href=""><b>Supplementary Information 1 </b></a>)
+
                                In the Block Design, we regarded the same gRNA detection of two treatment, i.e. target and mismatch DNA, as a block. To test the difference between two treatments, we test the null hypothesis that two treatment have no difference. The Wilcoxon Rank Sum statistics <img alt='Peking-Analysis-W_j.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-W_j.gif" class='formula-inline'>of each block is calculated first by
                                    </p>
+
                                <div align="center" class='row'>
                                    <div class='col-md-12' id="Modeling_Fig1">
+
                                <img class='formula-line' alt="Peking-Analysis-Wj%3DsumR_i" src="https://static.igem.org/mediawiki/2015/0/0e/Peking-Analysis-Wj%3DsumR_i.gif">
                                        <p class='col-md-2'></p>
+
                                <img class='formula-line' alt="Peking-Analysis-W_j_range" src="https://static.igem.org/mediawiki/2015/5/5a/Peking-Analysis-W_j_range.gif">
                                        <img class='col-md-8' alt="Modeling_Fig1" src="https://static.igem.org/mediawiki/2015/9/94/Peking_Modeling_Figure1.png">
+
                                        <p class='col-md-12'>Figure 1. Schematic illustration of guide design in PAM-out orientation. Note the 20nt guide sequence is identical to target non-complementary strand.</p>
+
                                    </div>
+
                                    <p>We took advantage of Python 3.4.3 build-in regular expression to search for left guide sequences of gRNA (‘(?<=cc).(?=.{20})’) and right guide sequences of gRNA (‘(?<=.{20}).(?=gg)’) separately, which would be paired later for PC reporter system to function.
+
                                    </p>
+
                                    <div id="Modeling_Table1">
+
                                        <p class='col-md-1'></p>
+
                                        <table border='1' style='margin:10px;padding:10px' class='col-md-6'>
+
                                            <tr>
+
                                                <td>Left gRNA candidates (CCNN20)</td>
+
                                                <td>414962</td>
+
                                            </tr>
+
                                            <tr>
+
                                                <td>Right gRNA candidates (CCNN20)</td>
+
                                                <td>407371</td>
+
                                            </tr>
+
                                        </table>
+
                                        <p class='col-md-12'>Table 1. The number of gRNA candidates in Mycobacterium Tuberculosis genome</p>
+
                                    </div>
+
 
                                 </div>
 
                                 </div>
                                 <div id="SSPD-S2" class='col-md-12' style='margin:0; padding:0'>
+
                                 where <img alt='Peking-Analysis-R_i.gif' src="https://static.igem.org/mediawiki/2015/1/16/Peking-Analysis-R_i.gif" class='formula-inline'> indicates the serial number of <img alt='Peking-Analysis-X_j.gif' src="https://static.igem.org/mediawiki/2015/6/60/Peking-Analysis-X_j.gif" class='formula-inline'> in the population of both <img alt='Peking-Analysis-X_j.gif' src="https://static.igem.org/mediawiki/2015/6/60/Peking-Analysis-X_j.gif" class='formula-inline'> and <img alt='Peking-Analysis-Y_j.gif' src="https://static.igem.org/mediawiki/2015/7/79/Peking-Analysis-Y_j.gif" class='formula-inline'>. Note that Wilcoxon Rank Sum statistics <img alt='Peking-Analysis-W_j.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-W_j.gif" class='formula-inline'> are distribution free and its distribution is known as long as the sample number is known.
                                    <h4><em>Specificity test for each candidate</em></h4>
+
                                </p>
                                     <p>Specificity of guide sequence of gRNA here is defined as the probability of the gRNA binds to the corresponding target site instead of other similar non-target sites. It is measured by taking both quantity of potential off-target sites and similarity between off-target sites and the unique target site into consideration. Since sputum sample is commonly used in MTB detection, we compared our guide sequence candidates with Human Oral Meta-Genome (HOMG), and reserved the specific sequences orthogonal to oral meta-genome to avoid false positive signals in MTB detection. Here we adopted a BLAST-based 2-step filter approach to realize it. In general, the two steps are: a) Filter out guides with off-targets that have 12 bp PAM-proximal sequence identical to corresponding target; b) Score the reserved gRNA on specificity. The principle of the score-rule is that higher specificity should get higher score (see the detail below). Thus we can easily filter out guides with high off-target probability, which is indicated by a low score.
+
                                <div>
 +
                                    <p>For example, if <img alt='Peking-Analysis-n%3D3.gif' src="https://static.igem.org/mediawiki/2015/4/48/Peking-Analysis-n%3D3.gif" class='small-formula-inline'>,
 +
                                     <img alt='Peking-Analysis-x_sample.gif' src="https://static.igem.org/mediawiki/2015/5/50/Peking-Analysis-x_sample.gif" class='formula-inline'>,
 +
                                    <img alt='Peking-Analysis-y_sample.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-y_sample.gif" class='formula-inline'>, <br>
 +
                                    so <img alt='Peking-Analysis-xy_sample.gif' src="https://static.igem.org/mediawiki/2015/e/ef/Peking-Analysis-xy_sample.gif" class='formula-inline'>, which implies that <img alt='Peking-Analysis-R_sample.gif' src="https://static.igem.org/mediawiki/2015/8/81/Peking-Analysis-R_sample.gif" class='formula-inline'><br>
 +
                                    Under the null hypothesis, after calculate all the possible order of two sample sets, the distributions of the statistics <img alt='Peking-Analysis-W_j.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-W_j.gif" class='formula-inline'> are shown as below:
 +
                                    <table border='1' style='margin-top:0px;margin-bottom:10px;padding:10px' class='col-md-12'>
 +
                                        <tr>
 +
                                            <th>W<sub>j</sub></th>
 +
                                            <td>6</td><td>7</td><td>8</td><td>9</td><td>10</td><td>11</td><td>12</td><td>13</td><td>14</td><td>15</td>
 +
                                        </tr>
 +
                                        <tr>
 +
                                            <th>f(W<sub>j</sub>)</th>
 +
                                            <td>0.05</td><td>0.05</td><td>0.10</td><td>0.15</td><td>0.15</td><td>0.15</td><td>0.15</td><td>0.10</td><td>0.05</td><td>0.05</td>
 +
                                        </tr>
 +
                                    </table>
 
                                     </p>
 
                                     </p>
                                    <div id="SSPD-filter1" class='col-md-12' style='margin:0; padding:0'>
 
                                    <h5><em>a) PAM-proximal 12bp filtration by BLAST</em></h5>
 
                                    <p>Previous research has demonstrated that SpCas9 tolerates mismatches to a greater extent in the PAM-distal region than the PAM-proximal region, and the PAM-proximal 8-12nt of the target largely determine the specificity of the targets. Guide sequences of gRNA whose target has identical PAM-proximal 12nt off-targets is highly possible to hit off target, which suggest its fate of being eliminated. Remaining guides are uploaded to BLAST again for potential off-target site finding, target similar sequences found by BLAST are recorded as all off-target sites for the second filtration, those ignored are considered not contributing to off-target.
 
                                    </p>
 
                                        <div id="Modeling_Table2">
 
                                            <p class='col-md-1'></p>
 
                                            <table border='1' style='margin:10px;padding:10px' class='col-md-6'>
 
                                                <tr>
 
                                                    <td>Left gRNA candidates (CCNN20)</td>
 
                                                    <td>55156</td>
 
                                                </tr>
 
                                                <tr>
 
                                                    <td>Right gRNA candidates (CCNN20)</td>
 
                                                    <td>54938</td>
 
                                                </tr>
 
                                            </table>
 
                                            <p class='col-md-12'>Table 2. The number of gRNA candidates left after the first filtration in Mycobacterium Tuberculosis genome</p>
 
                                        </div>
 
                                    </div>
 
                                    <div id="SSPD-filter2" class='col-md-12' style='margin:0; padding:0'>
 
                                        <h5><em>b) Score filtration </em></h5>
 
                                        <p>Our initial filtration significantly reduce computation by limiting guide sequence score operation to only most possible off-target sites rather than entire subject. We then use Zhang Lab score methods to give an off-target effects measurement of the guide sequences. For each guide candidates, scoring process includes two steps: firstly,each of the individual off-target sites is assigned an “individual score”, then all individual scores are aggregated to an overall score of the given guide sequence.The algorithm used to score single off-target sites is as below.
 
                                        </p>
 
                                        <div align='center'>
 
                                        <img class='huge-formula-line' alt="Peking-Modeling-single_score.gif" src="https://static.igem.org/mediawiki/2015/e/e6/Peking-Modeling-single_score.gif">
 
                                        </div>
 
                                        <div align='center'>
 
                                        <img class='formula-inline' alt="Peking-Modeling-single_site_mismatch_list.gif" src="https://static.igem.org/mediawiki/2015/7/7e/Peking-Modeling-single_site_mismatch_list.gif">
 
                                        </div>
 
                                        <p>
 
                                        In the first term, <i>e</i> runs over the mismatch positions set <img class='small-formula-inline' alt="Peking-Modeling-set_M.gif" src="https://static.igem.org/mediawiki/2015/6/6d/Peking-Modeling-set_M.gif"> between the guide and the off-target site, with <img class='small-formula-inline' alt="Peking-Analysis-W.gif" src="https://static.igem.org/mediawiki/2015/6/6e/Peking-Analysis-W.gif"> representing the experimentally-determined effect of mismatch position on targeting. Term two refers to the effect caused by mean pairwise distance between mismatches(<img class='formula-inline' alt="Peking-Modeling-d_mean.gif" src="https://static.igem.org/mediawiki/2015/3/32/Peking-Modeling-d_mean.gif">). Term three represent a dampening penalty for highly mismatched off-targets, where <img class='formula-inline' alt="Peking-Modeling-n_m.gif" src="https://static.igem.org/mediawiki/2015/e/ea/Peking-Modeling-n_m.gif"> refers to the number of mismatches.
 
                                        </p>
 
                                        <div id="Modeling_Fig2">
 
                                            <img alt="Peking_Modeling_Figure2.png" src="https://static.igem.org/mediawiki/2015/4/48/Peking_Modeling_Figure2.png">
 
                                            <p>Figure 2. Schematic illustration of an off-target site. The 20 bps are numbered sequentially from the one most distant to PAM, and the red 7, 12, 19 represents the mismatches in the off-target sites.</p>
 
                                        </div>
 
                                        <p>
 
                                        Here we show an example (Figure 2) to help readers understand this formula better. Nucleotides are numbered sequentially from the one most distant to PAM, and mismatches in the off-target sites are emphasized by red sign. In this case, W(7)= 0.317, W(12)=0.508, W(19)=0.583; d<sub>mean</sub>=(d<sub>(7,12)</sub>+d<sub>(7,19)</sub>+d<sub>(12,19)</sub>)/3=7; n<sub>m</sub>=3. Therefore, the off-target site shown above has an individual score of 0.004415278985074628. A higher individual score for an off target site indicates a higher similarity to the target, and thus a higher likelihood of the CRISPR/Cas9 complex binding to the off-target site. Once individual off-target have been scored, each guide is assigned an overall score:
 
                                        </p>
 
                                        <div>
 
                                        <div align='center'>
 
                                        <img class='big-formula-line' alt="Modeling_Fm3" src="https://static.igem.org/mediawiki/2015/d/d6/Peking-Modeling-All_score.gif">
 
                                        </div>
 
                                        <p> <i>h</i> runs over the potential off-target set <img class='small-formula-inline' alt="Peking-Modeling-set_T.gif" src="https://static.igem.org/mediawiki/2015/a/a0/Peking-Modeling-set_T.gif"> of guide sequence </p>
 
                                        </div>
 
                                        <p>A higher overall score indicates a better guide with few or weak potential off-targets. Guides with individual off-target sites score 2% or above are eliminated, the remaining are ranked in the order of overall score from 100% to 0%, with a list of off-targets presented in individual score descending order. Guides having an overall score of greater than 45% are reserved finally while others were filtered out in case their lack of specificity.
 
                                        </p>
 
                                        <div id="Modeling_Table3">
 
                                            <p class='col-md-1'></p>
 
                                            <table border='1' style='margin:10px;padding:10px' class='col-md-6'>
 
                                                <tr>
 
                                                    <td>Left gRNA candidates (CCNN20)</td>
 
                                                    <td>54417</td>
 
                                                </tr>
 
                                                <tr>
 
                                                    <td>Right gRNA candidates (CCNN20)</td>
 
                                                    <td>54288</td>
 
                                                </tr>
 
                                            </table>
 
                                            <p class='col-md-12'>Table 2. The number of gRNA candidates left after the first filtration in Mycobacterium Tuberculosis genome</p>
 
                                        </div>
 
                                        <div id="Modeling_Fig3">
 
                                        <img class='col-md-6' alt="Peking_Modeling_CCN_score.png" src="https://static.igem.org/mediawiki/2015/7/72/Peking_Modeling_CCN_score.png">
 
                                        <img class='col-md-6' alt="Peking_Modeling_NGG_score.png" src="https://static.igem.org/mediawiki/2015/3/37/Peking_Modeling_NGG_score.png">
 
                                        <p class='col-md-1'></p>
 
                                        <img class='col-md-9' alt="Peking_Modeling_score_filtration.png" src="https://static.igem.org/mediawiki/2015/5/5b/Peking_Modeling_score_filtration.png">
 
                                        <p class='col-md-12'>Figure 3. Analysis of the guide sequences scores after score filtration. (a)Overall score distribution of guide sequences after the second step filtration. Most sequences have scores more than 97. (b)Ratio of reserved guides (i.e. the number of guide sequences reserved after score filtration / the number of guide sequences after first filtration) is approximately 1 on higher score gRNA and 0 on lower score gRNA.ure 1. Schematic illustration of guide design in PAM-out orientation. Note the 20nt guide sequence is identical to target non-complementary strand.</p>
 
                                        </div>
 
                                    </div>
 
 
                                 </div>
 
                                 </div>
                                 <div id="SSPD-P" class='col-md-12' style='margin:0; padding:0'>
+
                                 <div>
                                    <h4><em>Pair left and right target sites with optimal spacer length</em></h4>
+
                                    <p>Due to the small sample size, the minimal significance level is 0.05, which means only if <img alt='Peking-Analysis-Wj%3D15.gif' src="https://static.igem.org/mediawiki/2015/e/e0/Peking-Analysis-Wj%3D15.gif" class='formula-inline'> leads to a rejection of the null hypothesis, in other words only when the minimum value of <img alt='Peking-Analysis-X_j.gif' src="https://static.igem.org/mediawiki/2015/6/60/Peking-Analysis-X_j.gif" class='formula-inline'> was greater than the maximum value of <img alt='Peking-Analysis-Y_j.gif' src="https://static.igem.org/mediawiki/2015/7/79/Peking-Analysis-Y_j.gif" class='formula-inline'> to accept the alternative hypothesis instead of the null hypothesis, the two sets of data is significantly different. So the Wilcoxon Rank Sum Test may face challenge in single block test when the experimental and control group are slightly different.However, by using Block Design, we can integrate data from m blocks similar to the idea of ANOVA. We calculated the sum of <img alt='Peking-Analysis-W_j.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-W_j.gif" class='formula-inline'> as the statistics.
                                    <p>All reserved left and right target sites after two-step filtration are considered to be qualified for pairing. In this step, a left site and a right site with appropriate spacer length will be paired. The best spacer length is 19-23bp for split-luciferase dCas9 fusion system according to our experimental data (Figure 4, Link to CRISPR). Single sites that cannot pair with any other sites within the given range of the spacer length would be eliminated.
+
                                    <div align='center'>
                                    </p>
+
                                    <img class='formula-line' alt="Peking-Analysis-W_sumWj.gif" src="https://static.igem.org/mediawiki/2015/2/29/Peking-Analysis-W_sumWj.gif" style='margin-top:0px;padding:0px'>
                                    <div id="Modeling_Fig4">
+
                                        <p class='col-md-1'></p>
+
                                        <img class='col-md-9' alt="Modeling_Fig4" src=" https://static.igem.org/mediawiki/2015/0/0b/Peking-Modeling-Figure.png">
+
                                        <p class='col-md-12'>Figure 4. The effect of spacer length variation on the performance of PC reporter system. Spacer is defined as the sequence between the sgRNA pairs. The spacer length varies from 5 bp to 107 bp. </p>
+
                                        <div id="Modeling_Table4">
+
                                            <p class='col-md-1'></p>
+
                                            <table border='1' style='margin:10px;padding:10px' class='col-md-6'>
+
                                                <tr>
+
                                                    <td>Spacer Length</td>
+
                                                    <td>19</td><td>20</td><td>21</td>
+
                                                    <td>22</td><td>23</td><td>19-23</td>
+
                                                </tr>
+
                                                <tr>
+
                                                    <td>Marker Number</td>
+
                                                    <td>690</td><td>836</td><td>746</td>
+
                                                    <td>720</td><td>759</td><td>3751</td>
+
                                                </tr>
+
                                            </table>
+
                                            <p class='col-md-12'>The number of gRNA pairs under the condition of different spacer length between left and right pair from 19 to 23 bps in Mycobacterium Tuberculosis genome</p>
+
                                        </div>
+
 
                                     </div>
 
                                     </div>
                                </div>
+
                                    The Wilcoxon Rank Sum <img alt='Peking-Analysis-W_j.gif' src="https://static.igem.org/mediawiki/2015/1/15/Peking-Analysis-W_j.gif" class='formula-inline'> from m blocks are independent and identically distributed (i.i.d), according to the central limit theorem (CLT), as m approaches infinity, the random variable <img class='big-formula-inline' alt="Peking-Analysis-W_BD_statistics.gif" src="https://static.igem.org/mediawiki/2015/6/68/Peking-Analysis-W_BD_statistics.gif"> converges in distribution to a standard normal distribution <i>N</i>(0,1)
                                <div id="SSPD-D" class='col-md-12' style='margin:0; padding:0'>
+
                                     <div align="center">
                                    <h4><em>Design PCR fragments</em></h4>
+
                                    <img class='big-formula-line' alt="Peking-Analysis-W_BD_statistics_CLT.gif" src="https://static.igem.org/mediawiki/2015/7/73/Peking-Analysis-W_BD_statistics_CLT.gif" style='margin:10px'>
                                    <p>We provide two methods for determining PCR fragments. For the first, fix pair number k per fragment, search the adjacent but non-overlapped k pairs. Sort the results with fragment length. For another, fix maximal PCR fragment length, sort the results with pair numbers per fragment. Sorted results will be presented on user interface, enabling users to select fragments by themselves as needed. Selecting overlapping fragment is not allowable for array design.
+
                                    </p>
+
                                    <div id="Peking_Modeling_Figure5.png">
+
                                        <p class='col-md-2'></p>
+
                                        <img class='col-md-8' alt="Peking_Modeling_Figure5.png" src="https://static.igem.org/mediawiki/2015/f/f9/Peking_Modeling_Figure5.png">
+
                                        <p class='col-md-12'>Figure 5. Schematic illustration of PCR fragment determination method, taking 2 pairs per fragment as an example. The top xxx shows all left and right targets on given segment of pathogen genome, and the chart aside lists all pairs. Only adjacent but non-overlapped pairs can be deposited on one fragment. Users can choose one through the four optional PCR fragment listed below, since they are overlapped.</p>
+
                                     </div>
+
                                    <div id="Modeling_Fig6">
+
                                        <img alt="Modeling_Fig6" src="https://static.igem.org/mediawiki/2015/a/a6/Peking-CRISPR-Figure12.png" style='maxwidth:200px'>
+
                                        <p>Figure 6. 72 target sites (MTB-specific markers) out of 9 fragments on MTB genome, screened out using SSPD.)</p>
+
 
                                     </div>
 
                                     </div>
 +
                                    So actually we use the statistics <img class='big-formula-inline' alt="Peking-Analysis-W_BD_statistics.gif" src="https://static.igem.org/mediawiki/2015/6/68/Peking-Analysis-W_BD_statistics.gif">, also we can calculate the p-value <img class='formula-inline' alt="Peking-Analysis-W_BD_p_value.gif" src="https://static.igem.org/mediawiki/2015/d/da/Peking-Analysis-W_BD_p_value.gif">, where <img class='formula-inline' alt="Peking-Analysis-Phi%28x%29.gif" src="https://static.igem.org/mediawiki/2015/1/19/Peking-Analysis-Phi%28x%29.gif"> is the distribution function of the standard normal distribution. If p-value is less than 0.01 or <img class='formula-inline' alt="Peking-Analysis-W_BD_gt_2.33.gif" src="https://static.igem.org/mediawiki/2015/2/20/Peking-Analysis-W_BD_gt_2.33.gif">, then we accept the alternative hypothesis that the two treatment, i.e. target and mismatch DNA, is highly statistic significantly.</p>
 
                                 </div>
 
                                 </div>
 
                             </div>
 
                             </div>
                             <!--End of Modeling-SSPD-->
+
                             <!--End of Analysis-Model-->
                             <!--Start Oligo Generator-->
+
                             <!--Start Result-->
                             <div id="Modeling-OligoGenerator" class="col-md-12" style="padding:0">
+
                             <div id="Analysis-Result" class="col-md-12" style="padding:0">
                                 <h3 class="classic-title" style='margin-top:50px'><span>Oligo Generator</span></h3>
+
                                 <h3 class="classic-title" style="margin-top:50px"><span>Result</span></h3>
                                 <p>Using SSPD method mentioned above, we can easily find the reliable target sites on genome. However, designing multiple target sites into oligonucleotides sequences for following sgRNA construction manually can be laborious. Thus here we developed a supplementary program to facilitate oligo sequence generation, which is combined with our sgRNA generator (Link to Part xxx). Specifically, we used Golden Gate Cloning to make it more convenient to substitute guide sequences for different target sites. Detailed operation is explained on flow chart below. See more details in <a href=""><b>Supplementary Information 2</b></a>.
+
                                 <div>
                                 </p>
+
                                 <img class='col-md-12' alt="Peking-CRISPR-Figure13.png" src="https://static.igem.org/mediawiki/2015/4/4e/Peking-CRISPR-Figure13.png">
                                <div id="Modeling_Fig7">
+
                                <p> <b>Figure</b>. 1 Results of high-throughput assay for MTB and control strain. F denotes fragments obtained from MTB genome (a) or control strain (b); P denotes markers from each fragment.</p>
                                    <p class='col-md-1'></p>
+
                                    <img class='col-md-10' alt="Modeling_Fig7" src="https://static.igem.org/mediawiki/2015/b/bc/Peking_Modeling_Figure7.png">
+
                                    <p class='col-md-12'>Figure 7. Schematic illustration of a flow chart explaining the protocol of guide sequence substitution.</p>
+
 
                                 </div>
 
                                 </div>
                             </div><!-- End Oligo Generator-->
+
                                <div>
                             <div id="Modeling-Ref" class="col-md-12" style="padding:0">
+
                                <p>In our experiment, <img alt='Peking-Analysis-n%3D3.gif' src="https://static.igem.org/mediawiki/2015/4/48/Peking-Analysis-n%3D3.gif" class='small-formula-inline'>, so
 +
                                <img alt='Peking-Analysis-E%28W_j%29.gif' src="https://static.igem.org/mediawiki/2015/a/a5/Peking-Analysis-E%28W_j%29.gif" class='formula-inline'>
 +
                                <img alt='Peking-Analysis-Var%28W_j%29.gif' src="https://static.igem.org/mediawiki/2015/a/ae/Peking-Analysis-Var%28W_j%29.gif" class='formula-inline'>
 +
                                <div align='center'>
 +
                                <img class='big-formula-line' alt="Peking-Analysis-W_BD_statistics_p_value.gif" src="https://static.igem.org/mediawiki/2015/1/1f/Peking-Analysis-W_BD_statistics_p_value.gif"></div>
 +
                                so target and mismatch DNA, are highly significantly different in signal.</p>
 +
                                </div>
 +
                             </div><!-- End Result-->
 +
                             <div id="Analysis-Ref" class="col-md-12" style="padding:0">
 
                                 <h3 class="classic-title"><span>Reference</span></h3>
 
                                 <h3 class="classic-title"><span>Reference</span></h3>
                                <p>
 
                                [1] Hsu P D, Scott D A, Weinstein J A, et al. DNA targeting specificity of RNA-guided Cas9 nucleases[J]. Nature biotechnology, 2013, 31(9): 827-832.<br>
 
                                [2] http://crispr.mit.edu/about <br>
 
                                [3] Naito Y, Hino K, Bono H, et al. CRISPRdirect: software for designing CRISPR/Cas guide RNA with reduced off-target sites[J]. Bioinformatics, 2014: btu743.<br>
 
                                </p>
 
 
                             </div><!-- End Reference-->
 
                             </div><!-- End Reference-->
 
                             </div>
 
                             </div>

Revision as of 18:25, 18 September 2015

Modeling

Reliablity!!!!

Background

To increase the accuracy and specificity of the detection, we developed an assay over our Paired dCas9 Reporter (PC Reporter) System to get more sequence information from the target genome in the purpose of a more reliable result. We designed m pairs of gRNA specific target sites as m markers in the MTB genome. To make sure if the idea mention above actually work, here we used the target gene and the mismatched gene to have a test, respectively. In experimental group, the gRNAs were used to detect the target gene, while in control group, the gRNA were used to detect the mismatched gene. And to reduce the random error, both the experimental and the control group were repeated n times, the result would be shown as the optical power signals, which is generated by our Paired dCas9 Reporter System. Then by comparing the intensity of the optical power signal corresponding to the target gene and mismatched gene, the difference can be seen directly.

Assumptions

  • There is no recognition site for gRNA in the mismatched gene.
  • The measure values Peking-Analysis-X_iY_i.gif of control and experimental group is independent of each other.
  • The measure values from n times repeated test compose the Peking-Analysis-X_iY_i.gif sample set, respectively, and the sample sets are both small.

Model

Wilcoxon Rank Sum Test of Block Design

In view of the unknown distributions and different variances of the signals by our Paired dCas9 Reporter System, we chose a non-parametric statistics method called Wilcoxon Rank Sum Test of Block Design with the data Rank instead of ANOVA.

In the Block Design, we regarded the same gRNA detection of two treatment, i.e. target and mismatch DNA, as a block. To test the difference between two treatments, we test the null hypothesis that two treatment have no difference. The Wilcoxon Rank Sum statistics Peking-Analysis-W_j.gifof each block is calculated first by

Peking-Analysis-Wj%3DsumR_i Peking-Analysis-W_j_range
where Peking-Analysis-R_i.gif indicates the serial number of Peking-Analysis-X_j.gif in the population of both Peking-Analysis-X_j.gif and Peking-Analysis-Y_j.gif. Note that Wilcoxon Rank Sum statistics Peking-Analysis-W_j.gif are distribution free and its distribution is known as long as the sample number is known.

For example, if Peking-Analysis-n%3D3.gif, Peking-Analysis-x_sample.gif, Peking-Analysis-y_sample.gif,
so Peking-Analysis-xy_sample.gif, which implies that Peking-Analysis-R_sample.gif
Under the null hypothesis, after calculate all the possible order of two sample sets, the distributions of the statistics Peking-Analysis-W_j.gif are shown as below:

Wj 6789101112131415
f(Wj) 0.050.050.100.150.150.150.150.100.050.05

Due to the small sample size, the minimal significance level is 0.05, which means only if Peking-Analysis-Wj%3D15.gif leads to a rejection of the null hypothesis, in other words only when the minimum value of Peking-Analysis-X_j.gif was greater than the maximum value of Peking-Analysis-Y_j.gif to accept the alternative hypothesis instead of the null hypothesis, the two sets of data is significantly different. So the Wilcoxon Rank Sum Test may face challenge in single block test when the experimental and control group are slightly different.However, by using Block Design, we can integrate data from m blocks similar to the idea of ANOVA. We calculated the sum of Peking-Analysis-W_j.gif as the statistics.

Peking-Analysis-W_sumWj.gif
The Wilcoxon Rank Sum Peking-Analysis-W_j.gif from m blocks are independent and identically distributed (i.i.d), according to the central limit theorem (CLT), as m approaches infinity, the random variable Peking-Analysis-W_BD_statistics.gif converges in distribution to a standard normal distribution N(0,1)
Peking-Analysis-W_BD_statistics_CLT.gif
So actually we use the statistics Peking-Analysis-W_BD_statistics.gif, also we can calculate the p-value Peking-Analysis-W_BD_p_value.gif, where Peking-Analysis-Phi%28x%29.gif is the distribution function of the standard normal distribution. If p-value is less than 0.01 or Peking-Analysis-W_BD_gt_2.33.gif, then we accept the alternative hypothesis that the two treatment, i.e. target and mismatch DNA, is highly statistic significantly.

Result

Peking-CRISPR-Figure13.png

Figure. 1 Results of high-throughput assay for MTB and control strain. F denotes fragments obtained from MTB genome (a) or control strain (b); P denotes markers from each fragment.

In our experiment, Peking-Analysis-n%3D3.gif, so Peking-Analysis-E%28W_j%29.gif Peking-Analysis-Var%28W_j%29.gif

Peking-Analysis-W_BD_statistics_p_value.gif
so target and mismatch DNA, are highly significantly different in signal.

Reference