Difference between revisions of "Team:Evry/Software/Pipeline"

Revision as of 00:44, 21 November 2015

All the information presented on this page (quality-control, differential expression analysis, data visualisation, variant discovery) is also available as a PDF file.

Data processing and quality control

What we produced: FASTQ files (if we don't have them), FASTQC reports, BAM and SAM files.

Figure 1: schematic overview of the pipeline for RNA-seq data analysis.

Differential expression analysis

What we produced: script for differential expression analysis, table with read counts (tab separated format, 7 columns, ENSG ids).

RNA-seq data can be difficult to interpret (especially in terms of differential expression quantitation). Thus, we decided to adopt a simple method for the analysis, based on counting, for each gene and for each sample, the number of available reads and then testing for significant differences between two experimental conditions or groups.

We wrote an R script that automatically creates a PDF file (in the current directory) with all the figures necessary for visual inspection and result interpretation. The input is a tab separated file with reads counts.

ensembl_id	melanocyte_1	melanocyte_2	melanoma_1	melanoma_2
ENSG00000000003	1964	2409	2328	2451
ENSG00000000005	0	2	10	12
ENSG00000000419	15122	19592	38225	36654
ENSG00000000457	12129	14893	7483	7812
ENSG00000000460	21930	25575	13123	13840
ENSG00000000938	48	58	26	42
ENSG00000000971	125	229	124	236
ENSG00000001036	11611	14125	14067	13518
ENSG00000001084	11429	13795	3549	3279

Figure 2: Example input format for DE analysis.

We tested two designs, as illustrated in the tables below: normal cells vs cancerous cells (4 samples), cancerous cells vs cancerous drug treated (4 samples).

Sample name	Condition
melanocyte_1	M
melanocyte_2	M
melanoma_1	C
melanoma_2	C

Sample name	Condition
melanoma_1	C
melanoma_2	C
melanoma_drug_1	D
melanoma_drug_2	D

Table 1 and 2: tested designs.

Visual exploration of the samples

Prior to checking distances between our samples, we applied a regularized-logarithm transformation (rlog) to stabilise the variance across the mean. The effects of the transformation are shown in the figure below.

Figure 3: Effect of the regularized-logarithm transformation on 'melanocyte_1' and 'melanocyte_2' samples.

We noticed that this step was particularly important for genes with low read counts.

We then checked the distances between our samples by performing Principal Components Analysis of the count data.

Figure 4: Principal Components Analysis (PCA) plot, normal vs cancerous cells.

We observed that differences between groups (normal vs cancerous cells represented in the PCA plot above) were greater than intra-groups differences, which is expected in this kind of design. However, as the inter-group differences were so pronounced, we figured that a great amount of genes would appear as differentially expressed. This is why we decided to apply really stringent thresholds for the detection:

- log2 fold change (logFC) > 5 for upregulated genes or log2 fold change (logFC) < -5 for downregulated genes.
- AND adjusted-p-value < 0.01

Differential expression analysis

Firstly, we took a look at the raw data (prior to any kind of normalization). We calculated mean counts for each gene and by condition and then the log2 fold change.

Prior to normalization, we filtered the data set to remove rows with very little or no information (remove genes with no counts or with just a single count). This allows to eliminate 17,386 transcripts already.

Using the DESeq R package (from Bioconductor), we were able to perform normalization of our data after calculation of size factors and we then were able to calculate mean counts for each gene and by condition and finally the logFC.

Figure 5: Distribution of logFC(cancerous/normal) values - raw data.

Figure 6: Distribution of logFC(cancerous/normal) values - normalized data.

Finally, we applied the nbinomWaldTest() function from the DESeq package to test for significance of coefficients in a negative binomial GLM, the model we used to assess differences in expression. As previously stated, selection of significantly up- or downregulated genes was based on the establishment of two selection thresholds: logFC and adjusted p-value (Wald test M vs C).

Figure 7: Differential expression as a function of mean expression. Left panel: threshold set at logFC > 2 or < -2. Right panel: threshold set at logFC > 5 or < -5. The red dots indicate genes for which the logFC was significantly higher than 5 or lower than -5. The circled point indicates the gene with the lowest adj-p-value.

We obtained a list of 1,649 differentially expressed genes: 931 upregulated genes and 718 downregulated genes.

Enrichment analysis

We retrieved the list of the 931 unregulated genes and the list of the 718 downregulated genes and looked for significantly enriched GO (Gene Ontology) terms in these lists (independently).

After idenfication of genes that are both overexpressed and mutated in tumor samples, we want to know if good candidate antigens can be predicted. Read more about the prediction step.

To top

@@ Line 128: / Line 128: @@
 transformation (rlog) to stabilise the variance across the mean. The effects of the transformation are shown in the figure below.</p>
-<img src="https://static.igem.org/mediawiki/2015/0/02/Rlog_transformation_plot.png" class="img-responsive" style="margin: 0 auto; "/>
+<img src="https://static.igem.org/mediawiki/2015/0/02/Rlog_transformation_plot.png" class="img-responsive" style="margin: 0 auto; max-width: 800px; height: auto; "/>
 <p class="text-center"><strong>Figure 3:</strong> Effect of the regularized-logarithm
 transformation on 'melanocyte_1' and 'melanocyte_2' samples.</p>
 <p class="text-justify">We noticed that this step was particularly important for genes with low read counts.</p>
 <p class="text-justify">We  then  checked  the  distances  between  our  samples  by  performing  Principal  Components Analysis of the count data. </p>
-<img src="https://static.igem.org/mediawiki/2015/6/6b/Pca_samples.png" class="img-responsive" style="margin: 0 auto; "/>
+<img src="https://static.igem.org/mediawiki/2015/6/6b/Pca_samples.png" class="img-responsive" style="margin: 0 auto; max-width: 800px; height: auto;"/>
 <p class="text-center"><strong>Figure 4:</strong> Principal Components Analysis (PCA) plot, normal vs cancerous cells.</p>
@@ Line 143: / Line 143: @@
 <li>- AND adjusted-p-value &#60; 0.01</li>
 </ul>
-</p>
+</p><br>
 <h3>Differential expression analysis</h3>
@@ Line 160: / Line 160: @@
 </div>
 </div>
+<br>
 <p class="text-justify">Finally, we applied the nbinomWaldTest() function from the DESeq package to test for significance of coefficients in a negative binomial GLM, the model we used to assess differences in expression. As  previously  stated,  selection  of  significantly  up-  or  downregulated  genes  was  based  on  the
 establishment of two selection thresholds: logFC and adjusted p-value (Wald test M vs C).</p>
 <img src="https://static.igem.org/mediawiki/2015/4/47/Differential_expression_plots.png" class="img-responsive" style="margin: 0 auto; "/>
 <p class="text-center"><strong>Figure 7:</strong> Differential expression as a function of mean expression. Left panel: threshold set at logFC > 2 or < -2. Right panel: threshold set at logFC > 5 or < -5. <em>The red dots indicate genes for which the logFC was significantly higher than 5 or lower than -5. The circled point indicates the gene with the lowest adj-p-value.</em></p>
-<p class="text-justify">We  obtained  a  list  of  1,649  differentially  expressed  genes:  931  upregulated  genes  and  718 downregulated genes. </p>
+<p class="text-justify">We  obtained  a  list  of  1,649  differentially  expressed  genes:  931  upregulated  genes  and  718 downregulated genes. </p><br>
 <h3>Enrichment analysis</h3>