Difference between revisions of "Team:Austin UTexas/Failure Analysis"

Line 5: Line 5:
 
[[Image:2015_UT_AUSTIN_fri_failure_diagram.jpg|900px|thumb|center| Diagram of fri-failure-analysis work flow. Dashed boxes represent collections of files.]]
 
[[Image:2015_UT_AUSTIN_fri_failure_diagram.jpg|900px|thumb|center| Diagram of fri-failure-analysis work flow. Dashed boxes represent collections of files.]]
  
<b><i>fri-failure-analysis is a computational pipeline that identifies mutations that occur in bacterial plasmids. Each sample read, generated by Sanger sequencing, is aligned to its ancestor strain (template). By identifying homology between samples and mobile elements, the pipeline is able to correctly identify a host of mutations while minimizing false positives. The program streamlines the process of mutation analysis and offers large efficiency gains over human analysis.
+
<b><i>fri-failure-analysis</i></b> is a computational pipeline that identifies mutations that occur in bacterial plasmids. Each sample read, generated by Sanger sequencing, is aligned to its ancestor strain (template). By identifying homology between samples and mobile elements, the pipeline is able to correctly identify a host of mutations while minimizing false positives. The program streamlines the process of mutation analysis and offers large efficiency gains over human analysis.
  
 
The program is divided into phases. The initial phase creates FASTA template files that will be used to create alignments. These each contain two sequences, either a template and a sample, or a sample and a mobile element. The next phase employs the MAFFT program to perform fast pairwise alignments between each sequence. The final phase consists of two steps; first, determine if a sample exhibits homology to a mobile element; then, identify mutations in the sample using this information. These mutations are stored in the Genomediff filetype, which you can read about  [http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/gd_format.html here.]
 
The program is divided into phases. The initial phase creates FASTA template files that will be used to create alignments. These each contain two sequences, either a template and a sample, or a sample and a mobile element. The next phase employs the MAFFT program to perform fast pairwise alignments between each sequence. The final phase consists of two steps; first, determine if a sample exhibits homology to a mobile element; then, identify mutations in the sample using this information. These mutations are stored in the Genomediff filetype, which you can read about  [http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/gd_format.html here.]
  
=== Gdparse ===
+
=== gdparse ===
  
Gdparse is a tool designed to process Genomediff files by counting up mutations and categorizing them. Since the terminal output of fri-failure-analysis is in this format, the program works well as a post-processing step in our "Breaking is Bad" experiment. The program currently outputs a CSV file that tabulates mutations based on where they occur in the reference sequence. Combined with FRI-Failure Analysis, this program offers further efficiency gains over hand counting of mutations.
+
gdparse is a tool designed to process Genomediff files by counting up mutations and categorizing them. Since the terminal output of fri-failure-analysis is in this format, the program works well as a post-processing step in our "Breaking is Bad" experiment. The program currently outputs a CSV file that tabulates mutations based on where they occur in the reference sequence. Combined with FRI-Failure Analysis, this program offers further efficiency gains over hand counting of mutations.
  
Gdparse works by creating dictionaries that map a category (in this case, a BioBrick name) onto other dictionaries that are organized by the type of mutation, the location of the mutation on the reference sequence, or both. Since reference sequences are stored along with each mutation in a Genomediff file, Gdparse simply finds the reference sequence in a separate folder and determines what annotation contains the given mutation.
+
<b><i>gdparse</b></i> works by creating dictionaries that map a category (in this case, a BioBrick name) onto other dictionaries that are organized by the type of mutation, the location of the mutation on the reference sequence, or both. Since reference sequences are stored along with each mutation in a Genomediff file, Gdparse simply finds the reference sequence in a separate folder and determines what annotation contains the given mutation.
  
 
=== Results ===
 
=== Results ===
  
[[Image:2015_UT_AUSTIN_gdparse_output.jpg|900px|thumb|center| Output of Gdparse program, using Genomediff files generated by fri-failure-analysis as input.]]
+
[[Image:2015_UT_AUSTIN_gdparse_output.jpg|900px|thumb|center| Output of gdparse program, using Genomediff files generated by fri-failure-analysis as input.]]
  
 
Combining the two programs, we identified a set of mutations similar to those obtained through hand analysis. Currently, we still check data by hand to remove some false positives, such as those gathered from poor alignments or low-quality samples. Even so, the programs have demonstrated vast efficiency improvements over manual analysis; additionally, the programs minimize human-introduced errors by enforcing stringent naming constraints on template annotations, sequence files, and part labels.
 
Combining the two programs, we identified a set of mutations similar to those obtained through hand analysis. Currently, we still check data by hand to remove some false positives, such as those gathered from poor alignments or low-quality samples. Even so, the programs have demonstrated vast efficiency improvements over manual analysis; additionally, the programs minimize human-introduced errors by enforcing stringent naming constraints on template annotations, sequence files, and part labels.
Line 23: Line 23:
 
=== References ===
 
=== References ===
  
All code maintained by Tyler Camp.
+
All code maintained by Tyler Camp. fri-failure-analysis and gdparse are released under the GNU General Public License and are free software. Links to the repositories are included below.
 +
 
 +
* [https://github.com/stationarysalesman/fri-failure-analysis fri-failure-analysis]
 +
* [https://github.com/stationarysalesman/gdparse gdparse]

Revision as of 22:55, 18 September 2015


A Computational Approach: fri-failure-analysis and gdparse

Diagram of fri-failure-analysis work flow. Dashed boxes represent collections of files.

fri-failure-analysis is a computational pipeline that identifies mutations that occur in bacterial plasmids. Each sample read, generated by Sanger sequencing, is aligned to its ancestor strain (template). By identifying homology between samples and mobile elements, the pipeline is able to correctly identify a host of mutations while minimizing false positives. The program streamlines the process of mutation analysis and offers large efficiency gains over human analysis.

The program is divided into phases. The initial phase creates FASTA template files that will be used to create alignments. These each contain two sequences, either a template and a sample, or a sample and a mobile element. The next phase employs the MAFFT program to perform fast pairwise alignments between each sequence. The final phase consists of two steps; first, determine if a sample exhibits homology to a mobile element; then, identify mutations in the sample using this information. These mutations are stored in the Genomediff filetype, which you can read about [http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/gd_format.html here.]

gdparse

gdparse is a tool designed to process Genomediff files by counting up mutations and categorizing them. Since the terminal output of fri-failure-analysis is in this format, the program works well as a post-processing step in our "Breaking is Bad" experiment. The program currently outputs a CSV file that tabulates mutations based on where they occur in the reference sequence. Combined with FRI-Failure Analysis, this program offers further efficiency gains over hand counting of mutations.

gdparse</b> works by creating dictionaries that map a category (in this case, a BioBrick name) onto other dictionaries that are organized by the type of mutation, the location of the mutation on the reference sequence, or both. Since reference sequences are stored along with each mutation in a Genomediff file, Gdparse simply finds the reference sequence in a separate folder and determines what annotation contains the given mutation.

Results

Output of gdparse program, using Genomediff files generated by fri-failure-analysis as input.

Combining the two programs, we identified a set of mutations similar to those obtained through hand analysis. Currently, we still check data by hand to remove some false positives, such as those gathered from poor alignments or low-quality samples. Even so, the programs have demonstrated vast efficiency improvements over manual analysis; additionally, the programs minimize human-introduced errors by enforcing stringent naming constraints on template annotations, sequence files, and part labels.

References

All code maintained by Tyler Camp. fri-failure-analysis and gdparse are released under the GNU General Public License and are free software. Links to the repositories are included below.