Team:Korea U Seoul/Project/Result/Discussion/content
① Data Processing
To search all possible paths between a set of compounds, we utilized NetworkX (Hagberg A., Schult D., Swart P. 2008). This is a Python-based software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (NetworkX n.d.). Among many graph algorithms, our program ‘Gil’ uses an algorithm named “all simple paths”. This algorithm can generate all simple paths in a graph from a source, a starting node, to a target, an ending node).
The software has several advantages. First, we can select a simple path, the path that does not include repeated nodes. Since a repeated node indicates the presence of the repeated reactions, we could have ruled out such reactions. In addition, the result shows better paths in comparison with “All shortest paths”, another well-known algorithm. Since “All shortest paths” only computes all the shortest paths in the graph, it often shows unsuitable paths for biologists. Finally, we can set up a cutoff which is a depth to stop a search. We determined a particular cutoff value due to the fact that a longer path which needs more than eight different transgenes is not a reasonable part design to a synthetic biologist.
Then, we calculated scores of every output paths and picked up the three optimal paths from each scoring factors. When it comes to a set of compounds, the maximum twelve paths are formed into a network, scores, BioBrick interlinking information, and other related data. We saved those information into JSON and text format. When it comes to the standard biological parts, there are 24,133 parts IDs and sequences. We filtered the sequences shorter than 100 bp, and 19,808 parts were left. As a result, after running the Nucleotide BLAST to KEGG GENES data, 565,163 matches were found (with E-value of 1e-5). The number of 3972 parts were linked to 20,276 gene IDs.
② Database Construction
After collecting and processing all the data needed, our team turned them into MySQL tables and JSON files. The comparison between KEGG database and the ‘Gil’ is shown on the table below.
In addition, the Gil database has XXXX E. coli K-12 genes, covering XXX percentage of overall reactions. There are also XXXX Gibbs energy values, which is XXXX percentage of overall reactions.
The program ‘Gil’ do not just enumerate results. However, this software visualizes the output networks composed of biological chemicals and reactions. The program provides the beautiful visualization as well as furthermore information which boost the instinctive inspiration of users. The ‘Gil’ is also targeting ordinary people as possible users, so anyone who is accessible to the Internet can design their own paths simply entering the start and end compounds.
①The Front Page of the ‘Gil’
There are two input boxes on the front page. The user can enter either a chemical name or a KEGG compound ID. This page also supports auto-complete function, which is an important and useful feature. As you see on Figure 1, the page provides the user a list of possible routes.
② The Result Page:
If the user enters a proper input pair and clicks the “Search” button, the result page will be showed. This page is consist of various information.
Firstly of all, every result paths is shown as a combined single graph on the left box. The yellow nodes are compounds. When the user puts the mouse cursor on a circle, the corresponding chemical name would be appeared. On the other hand, the reactions are shown as edges with KEGG reaction IDs.
To figure out more data about a specific compound or reaction, the user may click the node or edge. Then, there would be a table of detailed information on the right side of the result page. While a compound box contains chemical name, formula, exact mass, molecular weight, and structure image from KEGG Compound database or PubChem, a reaction table comprises reaction name, overall equations, Gibbs free energy, and gene download button. The most noteworthy function of ‘Gil’ is the registry of standard biological parts interlinked with KEGG GENES database. The connection was a really hard problem, but BLAST was the antidote! If a specific gene is highly orthologous to a biological part, the BioBrick ID is added to the end of each definition line of the FASTA file.
Lastly, according the purpose of the experiment, the user can select a factor along which paths were sorted. The software ‘Gil’ provides four criteria: the number of NADH, NADPH, ATP, and CO2. The Team Korea U Seoul chose the four factors to evaluate whole path output and thus give researchers optimal pathways. The reason why the four factors were chosen is elucidated under “Project Description > Biological Background”. Furthermore, E. coli K-12 information is included on this result page. If the user click “E. coli Metabolism” button, reactions which E. coli K-12 has by heredity will be highlighted in blue.
Under the boxes of network, compound, and reaction, there are detailed information of NADH, NADPH, ATP, and CO2. The information is presented in a pair of graphs and a table for each factor. When it comes to plotting output, a bar graph shows users overall difference of the three paths, and a graph with broken lines demonstrate how the number of factor molecule changes through a path progresses. Last but not least, all the data above are organized inside a table.
We used Python 2.7, Ubuntu 12.04.5, NetworkX, MySQL, and BLAST to collect data and construct database. Besides, we utilized D3.js and JavaScript to process JSON files into various graphs.