Difference between revisions of "Team:UESTC Software/Software.html"

Line 61: Line 61:
 
</header>
 
</header>
 
<div class="modeling_div">
 
<div class="modeling_div">
<h1>Modeling&Validation</h1>
 
<p>(Tools: php&Matlab)</p>
 
<h2>Modeling</h2>
 
 
<section>
 
<section>
<h3>Overview</h3>
+
            <h2>Work Flow</h2>
<p>Modeling is one of the most important parts in our project. In the modeling, the data we used mainly from the CEG (Cluster of Essential Genes)<sup>1</sup> data base.In the <b>CEG</b>, genes’ category  were based on their corresponding <b>KEGG</b>(Kyoto Encyclopedia of Genes and Genomes) Orthology (KO) and <b>COG</b><sup>2</sup>(Clusters of Orthologous Group) category and function descriptions.</p>
+
            <p>Now, we will briefly introduce the work flow of our software and there are four parts: Reference species select, Minimal gene set display, Gene network display and Analysis of assembly.</p>
<p><b>Ubiquity-retaining strategy</b><sup>3</sup> was first put up by Koonin .E V, we used half-retaining strategy to cover the shortage. Actually we use the 42% as the ratio not the half. Our modeling is to find it and prove it logically.</p>
+
<h3>1) Selecting reference species</h3>
<p>In the modeling, we use the knowledge of statistic. Find the relation of size of minimal-gene-set with number of organisms and HNR(Half Number Ratio). HNR is a ratio value that we use to replace the “half” in our strategy.  Then we construct the 3d modeling for that.</p>
+
<p>This is the first part of MCCAP,  and all works of constructing  a minimal cell start from here. You can choose species which you want by pulling them to right side or double clicking them. If there are troubles of finding a certain organism, do not forget the sort button in lower left quarter of window.</p>
<p>Besides, to get the <b>ideal value</b> of HNR, we use the  <b>multivariable linear regression</b> to get the fitting polynomial for the relation of OE rate with the HNR. From the polynomial, we find the appropriate <b>HNR</b> for our strategy, which has the highest OE rate and many other advantages compared with other HNR.Finally, we test our software’s <b>calculating time</b>:  start as we input organisms until we get the output result.</p>
+
<img src="" alt="" />
<p>We divide our modeling into 3 steps.</p>
+
<h3>2) Minimal gene set display</h3>
 +
<p>Click arrow button and then three bubbles will appear on the screen.  Result is minimal gene set filtered. </p>
 +
<img src="" alt="">
 +
<p>Access number and function of the gene cluster will be displayed directly, you can click on any one of them to see the details.</p>
 +
<img src="" alt="">
 +
<p>You may also want to compare how many differences are there between the two minimal gene sets. That only need to save the species selected, and then view result in history.</p>
 +
<img src="" alt="">
 +
<h3>3) Gene network display</h3>
 +
<p>Click pathway or module bubble  then you can see the details of gene network. All genes that are involved in minimal gene set are marked with red color.</p>
 +
<img src="" alt="">
 +
<h3>4) Analysis of assembly</h3>
 +
<p>The detailed information of gene cluster is shown in the following figure. It contains cluster information, corresponding gene and biobrick content.</p>
 +
<img src="" alt="">
 +
<p>Here you can know how many genes each gene cluster contains, and also detailed information of each gene. You can also see all of biobricks by clicking view all button.</p>
 +
<img src="" alt="">
 
</section>
 
</section>
 
<section>
 
<section>
<h3>Step 1</h3>
+
<h2><i>Software Architecture</i></h2>
<p>Firstly, we randomly choose 4 organisms from the data base(totally 29 organisms) as the experiment no.1 and rechoose 5 organism as experiment no.2 until we have 28 organisms in experiment no.25(We choose 10 times for one experiment, then get the <b>average of every data of it</b>) . we observed the trend of the size of minimal-gene-set from experiment no.1~25, and noticed a decrease trend when using the <b>ubiquity-retaining strategy</b>(Koonin) while a stabilization trend when using  the <b>half-retaining strategy</b>. Line chart as follow(Fig 1.1):</p>
+
<p>MCCAP is a typical C/S architecture software, which is composed of serverclient database. Communication between the three uses JSON format data.</p>
<img src="/wiki/images/6/6c/Team-UESTC_Software_figure1.1.jpg">
+
<h3>1) Server</h3>
<p>Obviously, the result of the half-retaining strategy seems more steady and more significant with the number of organisms’ increasing than the ubiquity-retaining strategy.</p>
+
<p>Server is the core of MCCAP, and all of function units and APIs are implemented in server. The program of MCCAP server is coded by PHP (http://www.php.net/). We manage the code separately by their categories. It is convenient for developers to reuse our function unit or codes. For example, If you want to deploy our server in the local, you just need to alter the files in config folder.</p>
<p>Secondly, to find the relationship between the <b>HNR</b>(Half Number Ratio) and the <b>size of minimal-gene-set</b>. We divide 100% into <b>100 groups</b> (1%~100%). Now we have 100 groups and 25 experiments. Then we respectively get the experiment no.1~25’s size of minimal-gene-set in every group. And finally get line charts like Fig 1.2 and Fig 1.3(Cause there are too much data that we only list the 1/10,2/10~10/10’s results in Fig 1.2):</p>
+
<h3>2) Client</h3>
<img src="/wiki/images/5/5f/Team-UESTC_Software_figure1.2.jpg">
+
<p>The program of MCCAP client is coded by C++ and QML with Qt Library (http://www.qt.io/). It is responsible for all of the functions of the user interaction. </p>
<img src="/wiki/images/2/21/Team-UESTC_Software_figure1.3.jpg">
+
<h3>3) Database</h3>
<p>Our ideal size of minimal-gene-set is about 150~300. The conclusion is that the HNRs between interval <b>[30/100~51/100]</b> ’s size of minimal-gene-set are more close to the ideal amount. </p>
+
<p>The database employed by MCCAP employs is MySQL(http://www.mysql.com/).</p>
<p>Finally,To get a more particular knowledge of changing trend. We made a diagram to illustrate the <b>relation</b> of minimal gene set with number of organisms and HNR.</p>
+
<img src="" alt="">
<p>We use the <b>least square method</b> to get the <b>fitting polynomial</b> for every lines in the figure 1.2(100 polynomials here). We choose the <b>3 order</b> polynomial. To get the 4 coefficients for the polynomial, we use the same method to get the <b>relation of coefficient with the HNR</b>. The 4 coefficients’  fitting polynomials are as follows:</p>
+
<p>What is worth mentioning here is that we made a little change in the software architecture. The change means, on the basis of C/S architecture, we also borrowed some features of B/S architecture. We build an inner browser in the software, so that it can display the page generated at the server side. What are the benefits of doing so? This means we can assign the frequently updated page to the server. It only need to update once on the server, all users can get latest content in the first time.</p>
<img src="/wiki/images/thumb/6/63/Team-UESTC_Software_gsabcd.jpg/800px-Team-UESTC_Software_gsabcd.jpg">
+
<p>The polynomial for the relation of the minimal gene set with number of organisms and HNR is as follows:</p>
+
<img src="/wiki/images/c/cd/Team-UESTC_Software_gsz.jpg">
+
<p>Now we have the polynomial to connect the relation of minimal gene set with number of organisms and HNR. The diagram is as follows:</p>
+
<img src="/wiki/images/6/6f/Team-UESTC_Software_NOO.jpg">
+
<p>From diagram above, we found an area that have a stable gene number when organisms number changed. It means use these half number ratio can get a stable result of minimal gene set. But to find the most stable HNR for the “half” retaining strategy, let’s move to step 2.</p>
+
</section>
+
<section>
+
<h3>Step 2</h3>
+
<p>From the Step 1, we divided 100% into <b>100 groups</b> (1%~100%). And we also have 25 experiments. Besides, before doing the experiments, we should define a new variable to measure whether the strategy is stable and effective. We call it <b>OE rate</b>( Overlapping Effective Rate). It’s calculated by this formula:</p>
+
<img src="/wiki/images/d/dd/Team-UESTC_Software_gsOR.jpg">
+
<p><i>cUGS: the number of current experiment’s universal-gene-set-result</i></p>
+
<p><i>UGS29:the number of 29-organism-experiment’s universal-gene-set-result</i></p>
+
<p>This formula is to reduce the influence of  the Gene’s amount. More OE rate means the strategy has a more sufficient result.</p>
+
<p>Now we have 100 groups and 25 experiments. Then we respectively calculate experiment no.1~25’s <b>average of OE rate</b> in every group. And finally get this line chart like Fig 2.1:</p>
+
<img src="/wiki/images/f/f0/Team-UESTC_Software_HNR.jpg">
+
<p>Obviously, we could find there are several groups has an ideal result, and also some unexpected results.</p>
+
<p>Next, we calculate experiment no.1~25’s <b>variance</b> in every group. Line chart as follow:</p>
+
<img src="/wiki/images/6/6a/Team-UESTC_Software_figure2.2.jpg">
+
<p>From the Fig 2.2, comparing the variances we could easily find the groups between 26/100~61/100’s variances are less than others which means strategy in this interval  is more steady. It’s fortunate that we nearly get the same conclusion with the best interval mentioned in step 1.</p>
+
<p>Fortunately, we got a pretty results. We use  the <b>least square method</b> to get the <b>fitting polynomial</b>. The new curve graph like this:</p>
+
<img src="/wiki/images/thumb/5/5c/UESTC_Software_figure2.3.jpg/800px-UESTC_Software_figure2.3.jpg">
+
<p>The polynomial is as follow:</p>
+
<img src="/wiki/images/5/5e/Team-UESTC_Software_gsy.jpg">
+
<p>After comparing the order of polynomial,we choose the 4 order polynomial. The  <b>correlation coefficient</b> is 0.9703. According to the list. The polynomial is 97% to be accepted.</p>
+
<p>Next we calculate the <b>maximization</b> point as our ideal HNR (42%). That is how we define the ratio value of the retaining strategy.</p>
+
</section>
+
                <section>
+
                        <h3>Step 3</h3>
+
                        <p>In step 3, we try to get the MCCAP’s calculating time. We test the experiment no.1~no.25(4 organisms ~ 28 organisms like step 1&2) in our 42% ratio retaining method to get the running time.</p>
+
                        <p>The bar chart is as follows:</p>
+
<img src="/wiki/images/b/b3/Team-UESTC_Software_times.png">
+
                        <p>From the bar chart , as the increasing number of organisms, the running time increase gradually. But when we use all of the organisms from CEG data base. We still only spend less than 0.5s to get the results. The efficiency of MCCAP’s calculating can be accepted, and MCCAP did well in this aspect.</p>
+
                </section>
+
<section>
+
<h2>Validation</h2>
+
<h3>Step 1</h3>
+
<p>We pick a organism from the 29 organisms(from no.1~no.29) <b>one by one</b> as 29 experiments . Then respectively use the <b>remaining</b> 28 organisms to screen the minimal gene set again and compare the results with the completed(29 organisms) one to get the <b>overlapRatio</b>.</p>
+
<img src="/wiki/images/a/a6/Team-UESTC_Software_table.jpg">
+
<p>The line chart is as follow:</p>
+
<img src="/wiki/images/3/38/Team-UESTC_Software_figure2.1.1.jpg">
+
<p>The <b>variance of the overlapRatio</b> is 0.000479</p>
+
<p>From the line chart and the variance, we could easily get a conclusion that all the experiments’ overlapRatios are <b>more than 90%</b> and <b>vary slightly</b>, which means our strategy is <b>accurate</b> and <b>stable</b>.</p>
+
</section>
+
<section>
+
<h3>Step 2</h3>
+
<p>Then let’s move to step 2. In step 1 we proved that our strategy is stable and accurate. Next, we compare the result with other two results(<b>Gil, Mushegian&Koonin</b>) which has got the minimal gene set by using other methods to prove our method is <b>reliable</b> and <b>significant</b>.</p>
+
<img src="/wiki/images/thumb/f/fb/Team-UESTC_Software_figure3.jpg/800px-Team-UESTC_Software_figure3.jpg">
+
<p>From the flow charts,  unlike these two groups we have fewer genes in our minimal gene set, and have a higher overlap numbers for having 190 same genes with the Gil, and 156 genes with the Mushegian&Koonin’s result.Only 5 genes are different from other two groups. After this step, it’s obvious that our result is reliable and stable.</p>
+
</section>
+
<section>
+
<h2>Reference</h2>
+
<p>1.Yuan-nong Ye, Zhi-gang hua, Jian Huang, Nini Rao and Feng-biao Guo*: CEG: <b>a database of essential gene clusters</b>. BMC Genomics 2013</p>
+
<p>2.Roman L.Tatusov, Michael Y. Galperin, Darren A. Natale and Eugene V. Koonin*: <b>The COG database: a tool for genome-scale analysis of protein functions and evolution</b></p>
+
<p>3.Mushegian AR, Koonin EV:<b>A minimal gene set for cellular life derived by comparison of complete bacterial genomes</b>. Proc Natl Acad Sci U S A 93, 10268-10273 (1996).</p>
+
 
</section>
 
</section>
 
</div>
 
</div>
 
</body>
 
</body>
 
</html>
 
</html>

Revision as of 09:39, 15 September 2015

Modeling

Work Flow

Now, we will briefly introduce the work flow of our software and there are four parts: Reference species select, Minimal gene set display, Gene network display and Analysis of assembly.

1) Selecting reference species

This is the first part of MCCAP, and all works of constructing a minimal cell start from here. You can choose species which you want by pulling them to right side or double clicking them. If there are troubles of finding a certain organism, do not forget the sort button in lower left quarter of window.

2) Minimal gene set display

Click arrow button and then three bubbles will appear on the screen. Result is minimal gene set filtered.

Access number and function of the gene cluster will be displayed directly, you can click on any one of them to see the details.

You may also want to compare how many differences are there between the two minimal gene sets. That only need to save the species selected, and then view result in history.

3) Gene network display

Click pathway or module bubble then you can see the details of gene network. All genes that are involved in minimal gene set are marked with red color.

4) Analysis of assembly

The detailed information of gene cluster is shown in the following figure. It contains cluster information, corresponding gene and biobrick content.

Here you can know how many genes each gene cluster contains, and also detailed information of each gene. You can also see all of biobricks by clicking view all button.

Software Architecture

MCCAP is a typical C/S architecture software, which is composed of server, client database. Communication between the three uses JSON format data.

1) Server

Server is the core of MCCAP, and all of function units and APIs are implemented in server. The program of MCCAP server is coded by PHP (http://www.php.net/). We manage the code separately by their categories. It is convenient for developers to reuse our function unit or codes. For example, If you want to deploy our server in the local, you just need to alter the files in config folder.

2) Client

The program of MCCAP client is coded by C++ and QML with Qt Library (http://www.qt.io/). It is responsible for all of the functions of the user interaction.

3) Database

The database employed by MCCAP employs is MySQL(http://www.mysql.com/).

What is worth mentioning here is that we made a little change in the software architecture. The change means, on the basis of C/S architecture, we also borrowed some features of B/S architecture. We build an inner browser in the software, so that it can display the page generated at the server side. What are the benefits of doing so? This means we can assign the frequently updated page to the server. It only need to update once on the server, all users can get latest content in the first time.