Proteogenomics combines genomics and transcriptomics with proteomic data to discover novel biomarkers, demonstrate causation of variants, and identify new drug targets. We’ve made incredible discoveries using only nucleic acid sequencing data. Scientists have been able to advance our understanding of the relationship between genes and disease, predict the heritability of complex traits, reconstruct population histories, and DNA fingerprint embryos.
Sequencing alone has limitations in what it can inform regarding complex traits, the effects of unknown variants (if any), and demonstrating causation. With the recent developments of affordable high plex proteomics such as mass spectrometry, Olink PEA immunoassays, and SomaScan, it’s now possible to combine sequencing data with large-scale proteomics data sets and fill in the gaps left by sequencing alone.
Neuroscience has had some phenomenal developments using just the Olink data from the UK Biobank’s small sample set of ~50,000 subjects. For example, in a study of more than 50,000 adults over a 14-year period, four key circulating proteins were identified that are predictive of dementia onset within ten years. Findings that identify targets early in the disease process can enable researchers to develop novel treatments designed to prevent or delay disease onset.
Until recently, most research on neurodegenerative diseases was conducted on symptomatic subjects or post-mortem tissue samples. The UK Biobank clinical data that follows subjects for decades combined with Olink high-plex proteomics has delivered the ability for researchers to identify changes well before the onset of neurological symptoms. The initial UK Biobank data have been so compelling that they’ve just announced the world’s largest proteomics study to be done using Olink HT on 600,000 subjects.
In this neuroscience-focused article of our Proteogenomics Blog Series, we will explore cases in which proteomics was used to validate targets from existing genome wide association studies (GWAS) and one unique case in which GWAS was used to confirm the SNPs responsible for the altered proteomics results. For other articles in this series, please see “Introduction to Proteogenomics” and “Proteogenomics in Oncology.”
Alzheimer’s Disease Protein Associations
Scientists at the National Institute on Aging have also used UK Biobank data. However, their goal was to assess the risk of all-dementia, specifically Alzheimer’s disease (AD), in combination with existing GWAS study data. External GWAS data was used to develop a Standard Polygenic Risk Score (PRS) specifically for AD. This score was split into tiers based on level of risk imparted (high, medium, low). The AD-PRS was applied to data from ~50,000 subjects in the UK Biobank to evaluate its impact on survival via Kaplan-Meier curves. AD-PRS showed a significant impact on all-dementia risk with the strongest effect detected from Tier 3 (highest risk) on women.
Researchers next used mathematical modeling and statistical analyses to determine the strongest relationships between AD-PRS score and the plasma proteome. As seen in the volcano plot in Figure 1, PLA2G7 had the strongest positive effect and TREM2 had the largest inverse association with AD-PRS.
Figure 1. Volcano plot of plasma proteomic biomarkers in relation to AD PRS: UK biobank 2006–2010. Abbreviations : See list of abbreviations for protein abbreviations and https://www.ncbi.nlm.nih.gov/gene/; AD = Alzheimer’s Disease; PRS = Polygenic Risk Score. Note: Based on a series of multiple linear regression models, with main predictor being AD PRS and the outcome being each of 1,463 plasma proteomic biomarkers (Log2 transformed, z-scored). The y-axis is the predictor’s associated p-value on a -Log10 scale and the X-axis is the β coefficient (effect of AD PRS exposure on standardized z-scores of plasma proteomic markers) from the multiple linear regression models. An estimate with a Bonferroni corrected p-value < 0.05 are marked by a different color and the plasma proteomic marker abbreviation is added for relatively stronger effect size of > 0.050 in absolute value (See UKB showcase URL: https://biobank.ndph.ox.ac.uk/showcase/). Selected proteins (k = 86) for further mediation analysis have a corrected p-value < 0.05. Details are provided on github: https://github.com/baydounm/UKB-paper12-supplementarydata
In addition to the strongest associations, there were several other proteins that showed significant AD-PRS associations including BRK1, GFAP, and NFL. These discoveries can open the door for disease prevention in those most susceptible to dementia. In addition, knowing that elevated levels of these plasma proteins contribute to dementia, researchers have a chance to study subjects before they become symptomatic and potentially uncover causative factors and novel drug targets.
Inflammatory Proteins and Their Role in Schizophrenia
In this next case study, a global team of scientists used mendelian randomization analysis on publicly available GWAS and Olink Proteomics data sets to identify nine inflammatory proteins that play a causal role in schizophrenia. Many factors contribute to the development of schizophrenia including genetics, life pressure, prenatal infections, and inflammatory cytokines. For example, research has demonstrated that IL6 and TNF-α can lead to neuroinflammation. These are elevated in the circulation of schizophrenics subjects.
The nine circulating proteins identified are TNFB (Tumor Necrosis Factor β), CCL23 (C-C Motif Chemokine Ligand 23), TNFSF14 (Tumor Necrosis Factor Superfamily Member 14), IL-1A (Interleukin 1 Alpha), TNFSF12 (TWEAK, Tumor Necrosis Factor-Related Weak Inducer of Apoptosis), 4EBP1 (EIF4EBP1, Eukaryotic Translation Initiation Factor 4E Binding Protein 1), CCL19 (C-C Motif Chemokine Ligand 19), CD40, and DNER (Delta/Notch-Like EGF Repeat-Containing Transmembrane Protein), respectively. The Forest plot in Figure 2 shows the relationship of each with schizophrenia, whether neuroprotective (inverse) or neuroinflammatory (risk).
Figure 3. Forest plot for the causal effects of circulating inflammatory proteins on schizophrenia.
Several of these cytokines already have FDA-approved drugs, although the majority of approvals are not for the treatment of mental illness. For example, CD40 is targeted by multiple drugs and the primary function is as an antineoplastic agent (used in cancer treatment).
This study is significant in that it establishes a causal role for inflammatory cytokines in schizophrenia. Although the exact mechanism(s) by which they contribute to schizophrenia are not yet known, this study delivers convincing evidence that inflammation is a key factor in development and potentially in the management of schizophrenia. Several potential therapeutic targets were identified that could lead to the alleviation of symptoms and/or prevention of schizophrenia onset.
Novel Depression-Linked Genes
Our final case study takes a slightly different approach to proteogenomics. Researchers started out with proteomic data from the UK Biobank, then used whole exome sequencing data to identify 22 genes associated with depression. Of the 22, six are completely new (TRIM27, UBD, SVOP, ADGRB2, IRF2BPL, and ANKRD12). Interestingly, these six are all associated with immune response. IRF2BPL and ADGRB2 demonstrate expression in the cerebral cortex and the caudate and hippocampus, respectively. These brain regions are known to be associated with depression.
Using phenome-wide association analysis, the authors demonstrated that UBD and TRIM27 are associated with neuropsychiatric and cognitive inflammatory traits. This is quite fascinating as Human Protein Atlas data indicate that UBD expression has not been detected in the brain. Of course, there could be many explanations for this seeming contradiction. It will be interesting to investigate the mechanisms by which all six new proteins are playing a role in depression.
The above case studies showcase how widely proteogenomics is contributing to neuroscience, including neurodegenerative diseases and mental illnesses. As more researchers include proteomic data in their genomics studies, the benefits to medicine and wellness will multiply exponentially. Please tune in for the next article in this series, where we will explore proteogenomics in cardiovascular research.