Team:Technion HS Israel/LINK TO EXPLANATION
The Constants Database
abstract
We wrote a scraping program (crawler) which scans the wiki pages of all the teams and looks for tables of parameters and constants that the teams supplied. It extracts them and builds a database. In order to make the database accessible we build the Constants Database page which enables easy search.
Introduction
An important aspect of every model is choosing the right parameters. After all, the parameters are the thing that connects the equations that we derive on a blackboard or in our notebook to the real world. They connect the heuristic description of the reactions and processes to a quantitative one. It's important because it enables us to compare the results of the modelling with the results in the experiment.
When we work on our model, we had hard time finding our constants. There are a lot of different articles about the subjects we dealt with and it's hard to find the ones that are relevant to our assay.
Therefore, like many other teams in the past and probably in the future, the first part of our constants-hunting process was done by looking at the constants that past teams found, either by themselves or from literature, and gathered them. The second part was diving into the references that each team that gave a value for a constant we need has supplied.
The first part frustrated us the most, because it's hard to find the constants as they are scattered across all the tenths of thousands pages in the wikis of all the teams over the years, without order. A search engine, like the one we've built or Google, can help you find some of the constants, but it's far from perfect. You have to know exactly what you search for, and even if you do, you'll not always find the constant you were looking for as it's usually flooded in irrelevant information. Moreover, it's scattered across a lot of different wikipages and if you want to compare the values different groups gave to a constant and compare the different justifications they provided, you'll have to pick it separately from the wikipages of many different groups, and it's a lot of work. To demonstrate this point, let's have a look at the results you get from a search engine (for example, google) when you try to find all the constants that teams from 2014 gave in their wikis (or any other year).
Description for the image: the results you get from google when you look for constants teams from 2014 gave. The Xs denote pages which don't contain any constants on the page and the Vs denote pages with constants. Google, and most of the other way to search in the iGEM wiki pages, can't distinguish between regular pages and constant ones.
Our search engine, the iGEM Explorer, succeed in distinguish constant pages from other pages, but the information in it is still less accessible then it should be when it comes to constants.
Therefore we've built a user interface which is designed especially for looking for and researching constants. We call this database and user interface "The Constants Database".
Methods
The aggregation one done using our crawler, the same one that was used also by our search engine. Each page the crawler visit is scanned for a few keywords that might hint that this is a constants page. If the crawler finds a page that from the presence of these keywords it's safe to assume that it's a constants page, the page is scanned for tables. Each table is analyzed, and using the header row of the tables the crawler deduce the role of each column in the table. It tries to fit one of the following roles for each column:
After finding the roles of the columns it extracts and sort the data in the table according to it. The result of the crawler in this case is a homogenous constants table. We converted it to html and put it on our wiki (link). In order to make it more accessible we built a user interface which includes a search box which enables a free text search across all the database.
Description
In the Constant Database page there is a table of about 650 constants that were collected automatically using our crawler from the wikipages of iGEM groups from the years 2008-2014. You can filter the results using the search box. For each constant entry, its name, description, value, units and source is given, along the name of the iGEM team that gave this constant and a link to their constants page.
We hope that the iGEM community will find the Constants Database useful.
Technologies
The open source java library Crawler4J was used to create the crawler. The open source java library Crawler4J was used to manipulate DOM elements in the crawler. The internet service http://tableizer.journalistopia.com was used to convert the database to HTML table.
Screen Shots
Future work
Automatic units conversation – there is still things to use in the user experience our constants database give. One of them is the variety in the units each team used, thus a variety in the units of the constants appearing in our database. It makes comparing different constants and using different constants from the database in Quality assurance - like any automated data gathering, to achieve perfection we need manual quality assurance. It has to faces – first finding constants that our aggregator took a wrong description or value for, or other problems in collecting the information for the database, and the second checking the quality of constants themselves, that is, to check if the team that has calculated them or got them from an article didn't make a mistake in the way. Providing a way for the users to rate the quality of the results is also possible.
Dealing with other information structures – currently our aggregator finds only tables that are tagged properly as a constants table using a few parameters about the page. In order to find all the constants groups used over the years, we have to deal with the ones that are less convenient to extract, like the ones that are given in a list or the ones in pages that aren't titled in accordance to their content.
Expand the look-up space – our constants table extraction algorithm can be ran on data (מאגרי) then the iGEM pages of groups, like articles, and extract these constants into one homogenous constants table.