GePhEx external resources
In order to infer the phenotypic relationships, GePhEx leverages various resources.
The primary source of data is represented by the GWAS catalog, a repository of published genome-wide association studies, which is downloaded each six months and integrated into GePhEx to leverage updated genetic information.
During the input processing, the APIs of the external databases Reactome and Chembl are used for the subsequent retrieval of genetic variants related to pathways and drugs, respectively.
GePhEx takes advantage of the hierarchical classification implemented in the Experimental Factory Ontology (EFO) database in order to cluster together the phenotypes in the relationship graph
Various external resources are leverage for GePhEx output tables. The Variant Effect Predictor (Ve!P) is used to obtain variant annotation, while the EGA Beacon service (version 0.3) is used to check whether a given variant is present among the publicly available datasets stored at the EGA. Finally, a private EGA APIs is used in order to retrieve the links to the GWAS studies stored at the EGA.
GePhEx input processing
Regardless the provided input type, GePhEx is able to retrieve the genetic variants from the GWAS catalog that are related to the specific query. Below the various procedures used for the different input types are explained.
SNPs: for each provided SNP id (dbSNP id) GePhEx retrieves from the GWAS catalog all the entries having the columnSNPs matching that of the specific queried SNP(s).
Genes: for each provided gene name GePhEx parses the columns REPORTED GENE(s) and MAPPED_GENE in the GWAS catalog and retrieves all the entries on which the specific queried gene appears among the reported or mapped genes.
Regions: for each provided genomic region, having format chr<number/string>:start-end, GePhEx retrieves all the entries from the GWAS catalog on which the column CHR_ID matches the inserted chromosome and the column CHR_POS has a value ≥ and ≤ that of the inserted starting and ending position. Please check if the GWAS catalog version used by GePhEx is using the assembly coordinate system to which your input is referring to.
Phenotypes: phenotypes can be queried entering either free text or EFO terms. We strongly encourage you to use EFO terms in order to avoid the inclusion of false positive phenotypes in your results. If free text is used, GePhEx retrieves all the entries in which your inserted text is contained in the column DISEASE/TRAIT of the GWAS catalog. For instance, if you provide the input Lung cancer, GePhEx will include in the results also the entries in which the column DISEASE/TRAIT is Non-small cell lung cancer. If an EFO term is used, optionally the query recursively retrieves all the terms contained by the term of interest in the EFO tree. Thus, GePhEx possibly uses a list of EFO terms related to your input term and retrieves from the GWAS catalog all the entries on which the column MAPPED_TRAIT_URI contains one terms in the obtained list.
Pathways: pathways can be queried entering either free
text or Reactome pathways id. As for the case of phenotypes
query, we strongly encourage you to use Reactome id(s) in
order to avoid the inclusion of false positive pathways in
your results. If free text is used, a Reactome API is used to
infer the pathway id. For a given pathway id, a
second Reactome
API(/data/participants/{id}/referenceEntities
)
is then used to retrieve the list of genes participating to
the pathway. This method retrieves
the ReferenceEntities of
all PhysicalEntities that take part in a given
pathway. Because a pathway can contain smaller pathways
(sub-pathways), the method recursively retrieves
the ReferenceEntities for
all PhysicalEntities in every contained
pathway. Among all the possible contained objects, only those
labeled as ReferenceGeneProduct
or ReferenceIsoform are considered. The obtained list
of unique genes is then used as explained previously in the
genes query section.
Drugs: drugs can be used as input entering either free text or Chembl drug id(s). As for the case of phenotypes and pathways queries, we strongly encourage you to use Chembl id(s) in order to avoid the inclusion of false positive drugs in your results. If free text is used, the Chembl drug id is obtained through the Chembl client library. Various filtering setting are applied in order to retrieve only high confidence target genes for each specific drug. For a given drug id, only the drug activities with a standard_type equal to either IC50, EC50, XC50, AC50, Ki, Kd or Potency are retrieved. Among the obtained activities, only those having the maximum confidence score (confidence_score=9), a standard_value ≤ 1000 and standard_units equal to nM were considered. Thus for each valid activity, the list of unique genes is obtained. The obtained list of unique genes is then used as explained previously in the genes query section.
Linkage disequilibrium (LD) procedure
For each genetic variant reported in the GePhEx integrated GWAS catalog version, variants in LD are computed for all the available populations, separately, leveraging the 1000 Genomes Project Phase 3 data.
For a given SNP, a region of 500 kb centered at the variant of interest position is downloaded, and variation in LD is obtained through Plink (version 1.9). Only variants having an R2≥0.5 are kept and and loaded into an internal database to speed up the computation carried out by GePhEx.
Due to the implement LD method, GePhEx is potentially able to carry out the analysis also for genetic variants that are not reported in the GWAS catalog, but that are in LD with at least a variant reported in the GWAS catalog.
GePhEx output
When GePhEx is executed various outputs are obtained. All the result objects can be filtered on the fields of interest and downloaded. Remarkably, the output of a given query is stored in our cache system, allowing the fast retrieval of long executions for 24 hours.
The phenotypic relationship interactive graph is the result object visualized by default. The very same information can be visualized in the form of table. The table “Input SNPs” contains all the GWAS catalog entries related to the input SNPs. When LD is leveraged, the table “LD related SNPs” reports all the GWAS catalog entries for the genetic variants in LD with the input SNPs.
Details about each generated result object are provided below.
Phenotypic relationship graph and table
The phenotypic relationship graph and table report all the links existing between the traits obtained for a given input. The phenotypes visualized in the graph represent the phenotypes associated to the genetic variants related to the input SNPs and, when LD is leveraged, the phenotypes associated to the SNPs in LD with the input SNPs. Very long phenotype names are shown with ending “...”; when mousing over on the visualized text it is possible to shown the full phenotype name.
The strength of a given relationship is indicated by line thickness, which is proportional to the number of different SNP pairs supporting a given relationship, without correcting for the possible LD among the variants of the various SNP pairs. However, only thickness of links obtained from the very same execution can be compared. Indeed, thickness is scaled to the specific graph dimensions, indicating that links involving the same phenotypes but generated from different executions should not be compared.
Two different types of phenotypic relationships can be visualized in the graph: direct and indirect. Direct relationships are represented with blue lines and represent links involving phenotypes that share at least 1 SNP reported to be associated to both phenotypes in the GWAS catalog. Indirect relationships are visualized with green lines and involve phenotypes for which for all the SNPs pairs supporting the link: one of the SNPs in the pair is associated to one phenotype and the other SNP in the pair is associated to the other phenotype. Relationships that are supported by both direct and indirect links are indicated only as direct, while indirect relationships are the ones that can be obtained only leveraging LD.
Within the graph, phenotypes are grouped together into ad-hoc phenotypic relevant categories, that are represented with gray arches. Even if the category names are not shown in the graph, when mousing over a given arch it is possible to visualize the corresponding group name. Clicking on a given phenotype group it is possible to select/deselect the full set of phenotypic relationships involving all the phenotypes of the group. Phenotype groups were obtained as follow. The absolute paths for all the phenotypes in the GWAS catalog were previously obtained from the EFO hierarchical tree. The groups visualized in the graph were then obtained through manual inspection of these absolute paths. These groups represent intermediate, biologically relevant nodes in the EFO tree that can group together a substantial amount of available terms in the GWAS catalog. For phenotypes having more than one absolute path in the EFO tree, GePhEx assigns the category obtained considering the path having the minimum average tree distance to the other traits obtained in the same query.
The phenotypic relationship table visualizes the same information represented in the graph but in a different format. For each linked pair of traits in the graph, the table reports the list of SNPs pairs that support that given relationship and additional LD information.
Interestingly, in order to keep only the entries of interest, while users explore the graph and interact with the graph a filter is automatically applied to the phenotypic relationship table and the two tables explained in the sections below. Conversely, if rows are filtered in the tables through the filtering box, the graph is not updated based on this filtering. To resume, the interaction between the graph and the tables is only unidirectional and actions in the graph modify the tables content, while actions in the tables have no effect in the graph and in the remaining tables.
Input SNPs table
The table “Input SNPs” contains information related to the input SNPs. Regardless the provided input type, GePhEx is able to retrieve the genetic variants from the GWAS catalog that are related to the specific query. For these SNPs, the table “Input SNPs” reports information directly retrieved from the GWAS catalog complemented with additional information. Variant annotation is obtained through VeP!; if the variant is among the ones available in the EGA Beacon service, the EGA datasets and studies containing the variant are reported. Finally, when available at EGA, the link to the original association study that produced the association is provided.
LD related SNPs
When LD is leveraged, the table “LD related SNPs” contains information related to only the genetic variants in LD with the input SNPs. The table reports the same information already described for the “Input SNPs” table.