Team:Korea U Seoul/Project/Result/Discussion/content
① Data Processing
A collection of substrate-product pairs, or reactant pairs, named KEGG RPAIR exists in KEGG REACTION which is a database having biochemical reactions (KEGG REACTION, n.d.). There are abundant reactant pairs categorized into the five types. Among them, we chose the main pairs describing changes of main compounds. This is because a reaction generally consists of multiple reactant pairs, and only the main pairs appear on the KEGG metabolic pathway map. A server ‘PathPred’ also used only the main pairs (Moriya Y., et al. 2010).
Next, to search all possible paths between a set of compounds in the main pairs , we utilized NetworkX (Hagberg A., Schult D., Swart P. 2008). This is a Python-based software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (NetworkX n.d.). Among many graph algorithms, our program ‘Gil’ uses an algorithm named “all simple paths”. This algorithm can generate all simple paths in a graph from a source, a starting node, to a target, an ending node).
The software has several advantages. First, we can select a simple path, the path that does not include repeated nodes. Since a repeated node indicates the presence of the repeated reactions, we could have ruled out such reactions. In addition, the result shows better paths in comparison with “All shortest paths”, another well-known algorithm. Since “All shortest paths” only computes all the shortest paths in the graph, it often shows unsuitable paths for biologists. Finally, we can set up a cutoff which is a depth to stop a search. We determined a particular cutoff value due to the fact that a longer path which needs more than eight different transgenes is not a reasonable part design to a synthetic biologist.
Then, we calculated scores of every output path and picked up the three optimal paths from each scoring factors. When it comes to a set of compounds, the maximum twelve paths are formed into a network, scores, BioBrick interlinking information, and other related data. We saved that information into JSON and text format. When it comes to the standard biological parts, there are 24,133 parts IDs and sequences. We filtered the sequences shorter than 100 bp, and 19,808 parts were left. As a result, after running the Nucleotide BLAST to KEGG GENES data, 565,163 matches were found (with E-value of 1e-5). The number of 3972 parts were linked to 20,276 gene IDs.
② Database Construction
After collecting and processing all the data needed, our team turned them into MySQL tables and JSON files. The comparison between KEGG database and the ‘Gil’ is shown on the table below.
Name | KEGG DB | 'Gil' DB | Coverage(%) |
Compounds | 7,269 | ||
Reactions | 9.911 | ||
Paths | 1,179,511 |
In addition, the Gil database has XXXX E. coli K-12 genes, covering XXX percentage of overall reactions. There are also XXXX Gibbs energy values, which is XXXX percentage of overall reactions.
The program ‘Gil’ do not just enumerate results. However, this software visualizes the output networks composed of biological chemicals and reactions. The program provides the beautiful visualization as well as furthermore information which boost the instinctive inspiration of users. The ‘Gil’ is also targeting ordinary people as possible users, so anyone who is accessible to the Internet can design their own paths simply entering the start and end compounds.
①The Front Page of the ‘Gil’
There are two input boxes on the front page. The user can enter either a chemical name or a KEGG compound ID. This page also supports auto-complete function, which is an important and useful feature. As you see on Figure 1, the page provides the user a list of possible routes.
② The Result Page:
If the user enters a proper input pair and clicks the “Search” button, the result page will be showed. This page is consist of various information.
Firstly of all, every result path is shown as a combined single graph on the left box. The yellow nodes are compounds. When the user puts the mouse cursor on a circle, the corresponding chemical name would appear. On the other hand, the reactions are shown as edges with KEGG reaction IDs.
To figure out more data about a specific compound or reaction, the user may click the node or edge. Then, there would be a table of detailed information on the right side of the result page. While a compound box contains chemical name, formula, exact mass, molecular weight, and structure image from KEGG Compound database or PubChem, a reaction table comprises reaction name, overall equations, Gibbs free energy, and gene download button. The most noteworthy function of ‘Gil’ is the registry of standard biological parts interlinked with KEGG GENES database. The connection was a really hard problem, but BLAST was the antidote! If a specific gene is highly orthologous to a biological part, the BioBrick ID is added to the end of each definition line of the FASTA file.
Lastly, according the purpose of the experiment, the user can select a factor along which paths were sorted. The software ‘Gil’ provides four criteria: the number of NADH, NADPH, ATP, and CO2. The Team Korea U Seoul chose the four factors to evaluate whole path output and thus give researchers optimal pathways. The reason why the four factors were chosen is elucidated under “Project Description > Biological Background”. Furthermore, E. coli K-12 information is included on this result page. If the user clicks “E. coli Metabolism” button, reactions which E. coli K-12 has by heredity will be highlighted in blue.
Under the boxes of network, compound, and reaction, there is detailed information of NADH, NADPH, ATP, and CO2. The information is presented in a pair of graphs and a table for each factor. When it comes to plotting output, a bar graph shows users overall difference of the three paths, and a graph with broken lines demonstrate how the number of factor molecule changes through a path progresses. Last but not least, all the data above are organized inside a table.
We used Python 2.7, Ubuntu 12.04.5, NetworkX, MySQL, and BLAST to collect data and construct database.
Besides, we utilized D3.js and JavaScript to process JSON files into various graphs.
The novelty of this project: researcher’s instinct from well-made visualization. The program ‘Gil’ can help the construction of experiments on synthetic biology. The program has many prominent features such as manually supplemented data that make up for some flaws of KEGG database and the connection between BioBrick part registry and KEGG Genes database. Also, the outstanding visualization of networks of the ‘Gil’ excites researchers’ instinct, and they can access to more detailed information very easily. However, there are still details that need further improvement for the software to be suitable for its purpose
To show much more precise and feasible output that meets the needs of all users, our team is planning to work on three of functions. The first function is advanced search where a user can enter a stopover compound or one that should be avoided. Due to the fact that the program do not contain all known biological facts, the output path can contain a toxic compound, or a compound which is not included in the output is more productive biologically.
We also want to add a function called “two degrees" which shows not only nodes in a path but their surrounding nodes, just like " People you may know” function on the Facebook. Currently, we did succeed constructing proper database, but visualizing all the two-degree nodes is the hindrance since each node is connected with other 20.5 compounds in average.
So far the program ‘Gil’ only highlights the reactions of E. coli, the most frequently used in synthetic biology. We will add more species’ information and visualize them by designating a color for each of them. While this is a relatively easy task since we did the same work for E. coli K-12 genes, the issue is the visualization which do not appear too complicated
Until now the paths are suggested by four selected criteria—NADPH, NADP, CO2, and ATP. We are going to add more criteria on prioritizing pathways. First, to improve the reliability, we are going to prioritize the reaction in order of their citation frequency. Also, the method of evaluating carbon loss will be improved; to be specific, our program currently regards the carbon loss as CO2 production, whereas the improved version will calculate the carbon numbers of the starting and terminating compounds. Ultimately, by merging the scoring criteria, giving weight to the importance of each criterion, we aim to calculate the order of priority, considering all elements.
Besides, there are some features of web application which we are going to develop further. First, when there are more than two reactions sharing the same set of a reactant and a product, the multiple edges are overlapped. We will find a way to show them separately. Also, the current UX design is not that friendly to users. We should study more about this field so that users can understand the functions of the program much easier. For instance, we will add a ‘dashboard' from which a user can see the results intuitively. Finally, we will apply the responsive CSS which makes our software program usable in every desktop, tablet, and mobile phone . The function will optimize automatically a screen size according to the device.
Finally, in order to increase the reliability of the ‘Gil’, we managed to search pathways in the latest journals using the software. By this work, we anticipate to explain how this program can be engaged in actual experiments.
Our goal was to design a biological battery with novel metabolic pathways. To simply explain EFCs (Enzymatic Fuel Cells), they are fuel cells that obtain electrons from enzymatic oxidation reactions to produce electricity. Because an EFC is eco-friendly and does not require high-technology, it’s getting public attention as a remarkable future energy source. However, most of the EFCs are based on the use of food sources like glucose or starch, making it impractical in actual electricity production. Thus we dug into the fuel source that is abundant and not consumed as food, and finally came up with low-priced agar as the starting compound.
We searched for an agar degrading pathway to select the final compound for the ‘Gil’. According to an article published this year (Yun EJ., et al.), agar, which is extracted from red macroalgae (Gelidium amansii), dissolves in acetic acid to become agarooligosaccharides. Then, this is degraded into D-Galactose and 2-keto-3-deoxy-galactonate. During the process, NADH is formed, which can be used as an electron donor in an EFC. (see the red path on Figure 4)
In this context, we chose 2-keto-3-deoxy-galactonate as our final compound. The Figure 4 down below is the result shown by the software ‘Gil’. The program finds the pathway of the article. In addition, it also shows us a path that degrades D-galactose, producing additional NADH or NADPH, which means higher electricity output. (see the blue path on Figure 4)
Base on the search using the ‘Gil’, here is how our EFC is going to work. In the anode of our EFC, the enzymes will degrade agar producing NADH, and the electrons of NADH will be transferred into the wire by diaphorase, producing electricity. This project of building agar utilizing EFC, will be our next iGEM challenge.
Hagberg, A., D. Schult, and P. Swart. "Exploring network structure, dynamics, and function using NetworkX." Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena, 2008. 11-15.
NetworkX. Overview, all simple paths, all shortest paths. n.d. https://networkx.github.io/ (accessed 9 2015).
Yun, EJ., et al. "The novel catabolic pathway of 3, 6‐anhydro‐L‐galactose, the main component of red macroalgae, in a marine bacterium." Environmental Microibiology, 2015: 1677-1688.