Difference between revisions of "Team:Technion HS Israel/Practices/SearchEngine"
(23 intermediate revisions by 4 users not shown) | |||
Line 2: | Line 2: | ||
<html> | <html> | ||
<link href="//2015.igem.org/Template:Technion_HS_Israel4/Technion_HS_Israel_menu_style?action=raw&ctype=text/css" rel="stylesheet"> | <link href="//2015.igem.org/Template:Technion_HS_Israel4/Technion_HS_Israel_menu_style?action=raw&ctype=text/css" rel="stylesheet"> | ||
− | <a href="https://2015.igem.org/wiki/index.php?title=Team:Technion_HS_Israel/headings">headings</a> | + | <!--<a href="https://2015.igem.org/wiki/index.php?title=Team:Technion_HS_Israel/headings">headings</a>--!> |
− | <h1><font color="# | + | <h1><font color="#008080">Search Engine – Previous iGEM Projects</font></h1> |
<h2>Abstract</h2> | <h2>Abstract</h2> | ||
Line 10: | Line 10: | ||
<h2>Introduction and Overview</h2> | <h2>Introduction and Overview</h2> | ||
<p> | <p> | ||
− | As a part of our human practice and software section, we designed a search engine of previous iGEM project. It consists mainly of | + | As a part of our human practice and software section, we designed a search engine of previous iGEM project. It consists mainly of two parts – <b>crawler</b> and <b>searcher</b>. The crawler is responsible of obtaining the data from all the iGEM wikis, abstracts, etc. The searcher is the user-friendly searching platform, in which the user can type queries, sort results and get a direct and convenient access to all of the shown websites.</p> |
<h2>Methods</h2> | <h2>Methods</h2> | ||
<h3>Crawler</h3> | <h3>Crawler</h3> | ||
− | <p>We've written a crawler using Java library cralwer4j. </p> | + | <p>We've written a crawler using the open source Java library <a href="https://github.com/yasserg/crawler4j">cralwer4j</a>. A crawler is a program that scans the web by following hyperlinks and downloads the pages it visits. We customized the Crawler4J to scan the info pages (example to such <a href="https://igem.org/Team.cgi?id=1767">page</a>), results pages (<a href="https://igem.org/Results?year=2014®ion=All">example</a>) and the wiki pages (example: this page). Each of this is downloaded and then analyzed. In the cases of the first two types of pages it is just collecting the uncovered and neatly organized information that is given there (teams names, regions, sections, abstracts, awards, posters, etc) and a few other fields. However, in the last type of pages, the team wikis, the situation is more complicated. The data there isn't orginaized to we have to deal with using other means. The text of wiki pages is extracted using the Open Source JSoup Java Library and saved, and the page, the title of the page and the link that has led to the page are analized in order to put the page into one of the following categories. |
+ | <ul> | ||
+ | <li>home</li> | ||
+ | <li>map (site map)</li> | ||
+ | <li>const (parameters page)</li> | ||
+ | <li>software</li> | ||
+ | <li>policy (AKA human practice)</li> | ||
+ | <li>judging forms</li> | ||
+ | <li>parts (biobricks)</li> | ||
+ | <li>safety</li> | ||
+ | <li>blog</li> | ||
+ | <li>team (team page)</li> | ||
+ | <li>notebook</li> | ||
+ | <li>bio (biology)</li> | ||
+ | <li>model</li> | ||
+ | <li>results</li> | ||
+ | <li>sponsor (sponsors page)</li> | ||
+ | <li>attributes</li> | ||
+ | <li>overview (project overview/introductions)</li> | ||
+ | <li>other (if didn't find a matching category)</li> | ||
+ | </ul> | ||
+ | It also extracts constants tables, as explained in the Constants Database documentation page. All the results are saved in the format XML, which is compatible with Solr. | ||
+ | </p> | ||
+ | |||
+ | <p> Example for the result of the Crawler: <a href="https://static.igem.org/mediawiki/2015/9/9a/Technion_HS_2015_final_2014hs.txt">2014HS data</a>. | ||
<h3>Searcher</h3> | <h3>Searcher</h3> | ||
<p> | <p> | ||
Line 24: | Line 48: | ||
<h3>Searching inside Wikis</h3> | <h3>Searching inside Wikis</h3> | ||
<p> | <p> | ||
− | Unlike similar existing iGEM projects search engines, our database consists | + | Unlike many similar existing iGEM projects search engines, our database consists of information from all the wiki pages for searching. Including all the wikis in the database wasn't an easy task at all, but it ensures that all the information that each team posted uploaded online will be searchable. Thus, even details and terms which were not included in the abstracts, but are still important enough to be in the wikis, are not omitted from the searching.</p> |
<h3>Sorting and Ordering</h3> | <h3>Sorting and Ordering</h3> | ||
Line 32: | Line 56: | ||
<h3>Results Grouping</h3> | <h3>Results Grouping</h3> | ||
<p> | <p> | ||
− | + | Our search engine searches all the wikis as well as the abstracts.Therefore, a results grouping mechanism must be used for organizing the results for convenience. This is an advantage of ours over the Google search engine. In our search engine you can search for an iGEM team or a term and limiting the search to the iGEM site will result a lot of duplicates of teams wikis (e.g. 3 results from the same wiki but different pages that contains the searched term). In our search engine, the results grouping mechanism ensure no duplicates will be shown, thus presenting the results in a concise, convenient form. | |
</p> | </p> | ||
<h3>Spell Checker</h3> | <h3>Spell Checker</h3> | ||
Line 41: | Line 65: | ||
<p> | <p> | ||
This feature highlights the query in each result in order to help the user find the searched term in the abstract and the wiki pages.</p> | This feature highlights the query in each result in order to help the user find the searched term in the abstract and the wiki pages.</p> | ||
+ | |||
<h2>Conclusions </h2> | <h2>Conclusions </h2> | ||
Line 52: | Line 77: | ||
In the future, we will extend this search engine by improving the user interface, making it Android/iOS compatible, and adding more features to the searcher, for the convenience of iGEM users. | In the future, we will extend this search engine by improving the user interface, making it Android/iOS compatible, and adding more features to the searcher, for the convenience of iGEM users. | ||
</p> | </p> | ||
+ | |||
+ | <a href="http://beam.to/igemExplorer"><h1 style="color:#274C9C">Take me to the iGEM Exlporer!</h1></a> | ||
+ | (If this link doesn't work, we are probably having problems with server. Sorry about it.) | ||
+ | <br/><br/> | ||
+ | <h2>Screen Shots</h2> | ||
+ | <span style="font-size:2em;line-height:200%"> | ||
+ | Search results page, including result highlighting. | ||
+ | <img src="https://static.igem.org/mediawiki/2015/3/34/Technion_HS_2015_search_engine_highlighting2.png" class="full"> | ||
+ | More Like This feature. | ||
+ | <img src="https://static.igem.org/mediawiki/2015/7/7f/Technion_HS_2015_search_engine_mlt2.png" class="full"> | ||
+ | Another results page. | ||
+ | <img src="https://static.igem.org/mediawiki/2015/0/0a/Technion_HS_2015_search_engine_highlighting1.png" class="full"> | ||
+ | Advanced Search Menu. | ||
+ | <img src="https://static.igem.org/mediawiki/2015/1/13/Technion_HS_2015_search_engine_adv_search_menu.png" class="full"> | ||
+ | Did You Mean feature. | ||
+ | <img src="https://static.igem.org/mediawiki/2015/d/db/Technion_HS_2015_search_engine_did_you_mean.png" class="full"> | ||
+ | </span> | ||
<h1>Source code</h1> | <h1>Source code</h1> | ||
− | <p>Our source code is released under the MIT license. You can find it in <a href=" | + | <p>Our source code is released under the MIT license. You can find it in <a href="https://github.com/itaynn/iGEM-Technion-HS-2015">our github repository</a> or here:</p> |
− | <p><a href=" | + | <p><a href="https://static.igem.org/mediawiki/2015/d/d8/Technion_HS_Israel_LINK_TO_OUR_GITHUB.zip">The iGEM Explorer's crawler source code</a> (which is diligently documented in the source code). Here is a <a href="https://static.igem.org/mediawiki/2015/7/75/Technion_HS_Israel_UML.png">UML diagram</a> of the program</p>. |
− | <p><a href=" | + | <p><a href="https://static.igem.org/mediawiki/2015/c/cd/Technion_HS_2015_conf_dir.zip">The iGEM Explorer's Solr and Velocity config files</a></p> |
<p>You can find usage explanation in the Github repository.</p> | <p>You can find usage explanation in the Github repository.</p> | ||
</html> | </html> |
Latest revision as of 23:55, 18 September 2015
Search Engine – Previous iGEM Projects
Abstract
Every iGEM team has difficulties trying to think of a project. A common step towards the decision is searching previous iGEM projects. Our search engine is designed especially for this purpose – Helping iGEM novices throughout this process, and making it easier. In addition, it can serve the needs of any iGEMer who wants to find out about previous work that has been done, getting inspiration or just learn new and exciting things.
Introduction and Overview
As a part of our human practice and software section, we designed a search engine of previous iGEM project. It consists mainly of two parts – crawler and searcher. The crawler is responsible of obtaining the data from all the iGEM wikis, abstracts, etc. The searcher is the user-friendly searching platform, in which the user can type queries, sort results and get a direct and convenient access to all of the shown websites.
Methods
Crawler
We've written a crawler using the open source Java library cralwer4j. A crawler is a program that scans the web by following hyperlinks and downloads the pages it visits. We customized the Crawler4J to scan the info pages (example to such page), results pages (example) and the wiki pages (example: this page). Each of this is downloaded and then analyzed. In the cases of the first two types of pages it is just collecting the uncovered and neatly organized information that is given there (teams names, regions, sections, abstracts, awards, posters, etc) and a few other fields. However, in the last type of pages, the team wikis, the situation is more complicated. The data there isn't orginaized to we have to deal with using other means. The text of wiki pages is extracted using the Open Source JSoup Java Library and saved, and the page, the title of the page and the link that has led to the page are analized in order to put the page into one of the following categories.
- home
- map (site map)
- const (parameters page)
- software
- policy (AKA human practice)
- judging forms
- parts (biobricks)
- safety
- blog
- team (team page)
- notebook
- bio (biology)
- model
- results
- sponsor (sponsors page)
- attributes
- overview (project overview/introductions)
- other (if didn't find a matching category)
Example for the result of the Crawler: 2014HS data.
Searcher
We used an open-source searching platform called Solr. By using this platform, we avoided some technical problems that others have already solved. Thus, we were able to focus on improving the functionality of the engine, and making it more suitable for searching iGEM projects.
The modifications on Solr were mainly made through the solr.config file, which determines the configuration and functionality of the engine (e.g. searching algorithm, sorting order, features).
Features
Besides the database we obtained, our search engine has many features that distinguish it from other engines, as well as help the user by organizing the results in convenient ways, e.g. searching in wiki, sorting, results grouping, a spell checker, highlighting.
Searching inside Wikis
Unlike many similar existing iGEM projects search engines, our database consists of information from all the wiki pages for searching. Including all the wikis in the database wasn't an easy task at all, but it ensures that all the information that each team posted uploaded online will be searchable. Thus, even details and terms which were not included in the abstracts, but are still important enough to be in the wikis, are not omitted from the searching.
Sorting and Ordering
After typing the query, the user can sort and order the results by name, year, region, prizes and awards, number of instances, etc. This feature can be used for easy finding of awards-winning teams, teams from a specific year or region.
Results Grouping
Our search engine searches all the wikis as well as the abstracts.Therefore, a results grouping mechanism must be used for organizing the results for convenience. This is an advantage of ours over the Google search engine. In our search engine you can search for an iGEM team or a term and limiting the search to the iGEM site will result a lot of duplicates of teams wikis (e.g. 3 results from the same wiki but different pages that contains the searched term). In our search engine, the results grouping mechanism ensure no duplicates will be shown, thus presenting the results in a concise, convenient form.
Spell Checker
We all make mistakes once in a while… The spell checker makes sure that even if the user doesn't know the spelling of a team name or the professional term, the search engine would be false-tolerance enough to produce appropriate results. It also suggests a few correction possibilities, based on the database.
Highlighting
This feature highlights the query in each result in order to help the user find the searched term in the abstract and the wiki pages.
Conclusions
This search engine is meant for helping iGEM newcomers in the process of searching previous projects. The features that we added to the engine are taken from our experience, in order to make the searching itself easier to user-friendly. We included wiki pages in our database for more detailed search, unlike a few search engines that were previously presented in iGEM (e.g. iGEM42 - Heidelberg 2013, Team Seeker - Aalto Helinski 2014).
Future Work
We want this search engine to serve the iGEM community in the future – years and even generations. It will be a part of every iGEM team journey throughout the competition, answering to the needs of the team regarding knowledge about the competition's history and previous projects.
In the future, we will extend this search engine by improving the user interface, making it Android/iOS compatible, and adding more features to the searcher, for the convenience of iGEM users.
Take me to the iGEM Exlporer!
(If this link doesn't work, we are probably having problems with server. Sorry about it.)Screen Shots
Search results page, including result highlighting. More Like This feature. Another results page. Advanced Search Menu. Did You Mean feature.Source code
Our source code is released under the MIT license. You can find it in our github repository or here:
The iGEM Explorer's crawler source code (which is diligently documented in the source code). Here is a UML diagram of the program
.The iGEM Explorer's Solr and Velocity config files
You can find usage explanation in the Github repository.