Team:Evry/Software/Predictions

Protein sequences retrieval

What we produced: shell script to retrieve all proteins sequences associated to a list of gene identifiers, FASTA file containing all the proteins encoded by a list of genes (i. e. genes that are both overexpressed and mutated in tumor samples).

So far, our pipeline allowed us to select genes that are over-expressed (differential expression analysis) and mutated (variant discovery) in tumor samples. We retrieved the proteins encoded by these genes with a shell script that uses the Ensembl REST API, that produces a FASTA files containing all the sequences.

Two machine learning tools were used on this dataset to perform predictions related to proteasome cleavage of the protein and MHC-I affinity.

Immune system processing

What we produced: scripts to launch and parse results of NetChop and NetMHCpan, final result table containing candidate antigens and their final scores.

A good candidate antigen must be tumor-specific, sufficiently expressed in tumor cells, but also able to be processed efficiently by the immune system.

Proteasome cleavage prediction

The major histocompatibility complex class I (MHC-I) recognizes peptides of short length (8 to 10 aminoacids). These peptides are products of proteasomal degradation, a process that does not cut proteins randomly. Candidate antigens cannot contain proteasome cleavage sites as they would not be able to be presented to the immune system.

We performed proteasome cleavage sites prediction using NetChop, an open-source machine learning tool. We obtain a list of short peptides (with their associated NetChop score) that can be presented to the immune system. This list is then filtered by predicting if these antigens will be able to bind to the MHC-I.

MHC-I affinity prediction

Not all antigens are able to bind efficiently to the MHC-I. We used NetMHCpan, an open-source machine learning tool, to predict the binding affinity of all the antigens in our list. The predicted affinity is given in units of IC50nM, therefore a lower number indicates higher affinity. It is generally assumed that peptides with IC50 values < 50 nM are considered high affinity, < 500 nM intermediate affinity. However, the binding affinity is not correlated to the immune response, meaning that some antigens might bind very effectively without triggering an intense immune response.

We finally sorted the candidate antigens using a scoring function that combines the two types of predictions (linear combination of normalized score).

The YETIpredict web application

What we produced: a web interface to explore the results of the differential expression analysis, the variant analysis and the immune predictions

In order to make our pipeline as easy-to-use as possible, we created a web application that allows users to run all the steps of our pipeline that do not require a lot of computing resources. You can reach it here (beta version).

Visualisations of differential expression and variant analysis are available, as well as interactive tables that allow users to browse, sort and search easily the differentially expressed gene list or the candidate antigens list.

Figure 1: Input page of YETIpredict

To top