Cancers can vary greatly in their transcriptomes. In contrast to alterations in specific genes or pathways, the significance of differences in tumor cell total mRNA content is poorly understood. Studies using single-cell sequencing or model systems have suggested a role for total mRNA content in regulating cellular phenotypes. However, analytical challenges related to technical artifacts and cellular admixture have impeded examination of total mRNA expression at scale across cancers.
To address this, we evaluated total mRNA expression using single cell sequencing, and developed a computational method for quantifying tumor-specific total mRNA expression (TmS) from bulk sequencing data. We systematically estimated TmS in 5,181 patients across 15 cancer types and observed close correlations with clinicopathologic characteristics and molecular features, where high TmS generally accompanies high-risk disease. At a pan-cancer level, high TmS is associated with increased risk of disease progression and death. Moreover, TmS captures tumor type-specific effects of somatic mutations, chromosomal instability, and hypoxia, as well as aspects of intratumor heterogeneity.
Taken together, our results suggest that measuring total mRNA expression offers a broader perspective of tracking cancer transcriptomes, which has important clinical and biological implications.
Genome sequencing has led to great advances in cancer research by characterising the somatic landscape of cancer genomes and identifying mechanisms of tumourigenesis. This work has led to the discovery of biomarkers that may be targets for predictive and prognostic to targeted therapies. This talk will provide an overview of some of the bioinformatic approaches to analyse cancer genome data with a focus on tools that can be applied to understand complex tumor-immune interactions. We will also share some of our recent work using long read whole genome sequencing of cancer genomes to characterize complex structural rearrangements.
Even though all cells of the human body essentially share the same genetic information, the cell types that form the organs and tissues all have distinct properties. At the molecular level, this cellular identity is reflected in the set of transcribed genes, the transcriptome. There are more than 40,000 annotated protein coding genes and long noncoding RNAs (lncRNAs) that can generate more than 250,000 different isoforms. Even these numbers only capture a part of the complexity of the human transcriptome: it is estimated that more than 60% of the genome is transcribed, the vast majority of which is un-annotated. A new generation of long read sequencing technology using Nanopores enables amplification-free sequencing of cDNA and native RNA, potentially overcoming the major limitations of current short read sequencing technology, and promising to provide a more detailed view of the transcriptome.
Here I will present a comprehensive benchmark of Nanopore RNA-Sequencing on 5 human cell lines. Each cell line was sequenced in multiple replicates with 4 different RNA-Seq protocols covering short read and long reads, direct RNA, cDNA and PCR amplified cDNA sequencing. A systematic evaluation of the different technologies shows notable differences in throughput, transcript coverage, and sequencing biases. We compare long read RNA-Seq data with short read data in their ability to detect novel genes, quantify gene expression, identify alternative isoforms and fusion genes. Finally I will highlight computational methods that enable the comprehensive analysis of alternative splicing events, promoters, novel genes, and RNA modifications from long read RNA-Seq data to provide a detailed view of the complexity in the human transcriptome.
Acute lymphoblastic leukaemia (ALL) is the most common form of cancer in children worldwide. Although combination chemotherapy provides in general an effective treatment, resulting in an overall survival of >90%, subtypes of paediatric ALL affecting children in the first year of life or carrying rearrangements of the mixed lineage leukaemia (MLL) gene remain with a dismal prognosis. These poor outcomes highlight the unmet need for a better understanding of the molecular mechanisms of acute leukaemia. Genome sequencing studies of ALL patients have shown a very low frequency of somatic mutations, indicating that MLL-r may not require additional alterations to induce full transformation. However, the mechanisms of how gene fusions relate to disease transformation remain to be fully explained.
To uncover new molecular mechanisms linked to the observed poor outcome, we have performed an exhaustive multicohort analysis of gene-fusions and RNA processing alterations in 428 B-ALL patients. We identified 84 fusions with significant allele frequency across patients, 6 of them novel and 19 known from other blood and solid tumours but which had not been observed before in ALL. We have analysed and uncovered the similarities in their potential functional impacts. Furthermore, using MLL-r and ETV6-r as proxies for high and low risk, respectively, we found an expression signature involving MYC target genes and regulators of RNA processing in association with MLL-r patients. Moreover, this signature is predictive of risk in an independent set of patients with other or no fusions. This signature includes the upregulation of the splicing factor SRRM1, which we show that, through the interaction with other splicing factors, potentially impact a set of alternative splicing events associated with high risk.
Our findings provide evidence for a convergent mechanism of aberrant RNA processing that sustains a malignant phenotype in a subset of B-ALL cases. This convergent phenotype can complement diagnosis currently based on gene-fusion detection.
High throughput sequencing of cancer samples is an incredibly sensitive assay that can reveal deep information about the dynamics of a cancer. Studying how a cancer evolves typically requires looking at multiple samples from that individual, this may be from independent biopsies, from different metastatic sites, or from different points during treatment. By tracking variants across samples we can untangle the evolution of the cancer, and the mutational signatures of the variants in each clone provide information about the mutational process at different stages.
I will show how clonal tracking followed by mutational profiling can be used to analyse cancer sequencing data. We have analysed cases where somatic mutations in DNA repair genes or environmental exposures drastically alter the mutational pattern in subsequent clones. We see that changing mutational processes allows new driver genes to be mutated, which may contribute to changes in the behaviour of the disease.
Clonal tracking followed by mutational signature analysis is also a powerful quality control tool. Common technical issues, such as cancer infiltration in the matched normal or contamination from different individuals, can be identified and often mitigated through clonal tracking. If the data set has expected clonal structures, such as samples taken from different sites or timepoints, that information can be leveraged to further improve the accuracy of the variant calling.
Cancer immunotherapies have demonstrated remarkable efficacy for several cancer types, often leading to long-term patient survival. Unfortunately, the outcome of these treatments varies dramatically across tumour types and between patients, and this is in part due to the frequencies and characteristics of different tumour infiltrating immune cells. Due to persistent antigen exposure in the tumour microenvironment (TME), T cells often lose their functionality and express inhibitory receptors such as PD-1, a process termed ‘exhaustion’. Recent studies have shown that heterogeneity across this pool of exhausted T cells, specifically terminally differentiated vs self-renewing cells, impacts the success of immunotherapy and allows prediction of treatment outcome. Other studies have suggested that resident T cells play critical roles in anti-tumor immunity. Therefore, understanding heterogeneity and developmental trajectories of lymphocytes in the TME has crucial implications for cancer immunotherapy.
While the residency and exhaustion programs in CD8 T cells are relatively well-studied, in CD4 T cells these programs have only recently been appreciated, and in tumour resident natural killer cells they are largely unknown. Using publicly available single-cell RNA-seq data, we define molecular programs for residency and exhaustion in colorectal tumour infiltrating immune cells, including T cells and NK cells. We show that while some of these programs overlap across immune cells, there are cell-type-specific genes associated with exhaustion and residency. Finally, we show that combinations of these signatures are associated with distinct survival outcomes in colorectal cancer patients.
Prostate cancer has a predominance of large complex genomic rearrangements, known collectively as structural variations (SVs). Through deep whole-genome sequencing analysis (90X tumour/46X normal coverage) of 180 primary prostate cancer samples from African and European patients, including 138 identified as high-risk, we comprehensively studied somatic SVs and identified their signatures in SV types and genomic positions.
Using Manta and GRIDSS for high-confidence somatic SV calling, we found a large variability in the number of SVs among samples (ranging from 0 to 754), including 6 hyper-duplicated and 6 hyper-deleted samples. Additionally, we identified loci most frequently targeted by SVs and further correlated the presence of SV hotspots with different SV types and ethnic groups. TMPRSS2 and ERG gene regions were found as SV hotspots and their fusion has previously been identified as common gene fusion. We therefore identified that 33 samples are TMPRSS2-ERG fusion positive, in which around 50% of them involved multiple SV events with different SV types. Adding findings from our previous study on TMPRSS2-ERG gene transcripts (Blackburn et al., 2019), we confirmed that a single genomic fusion event can result in multiple fusion transcript isoforms.
Copy number variation data was also used to validate duplications (DUPs) and further provide an estimate of the number of copies for DUPs in hyper-duplicated samples. Oxford Nanopore long-read Sequencing data was used to validate the presence of SV and precision of SV breakpoints in one of the hyper-duplicated sample.
This study provides an invaluable resource for discovering SV signatures and insights into the different mechanisms underlying SV types in primary prostate cancer.
Structural variations (SVs), including insertions, deletions, duplications, inversions and translocations, are common and major drivers for tumorigenesis, which can also influence the phenotypic traits of cancer including drug resistance. Despite their importance, identifying SVs in cancer genomes remains elusive. SVs range in size from a few tens of bases to megabases, or even the entire chromosomes. Short reads cannot span many large SVs, instead, need to be reassembled for SV detection, which is, however, nontrivial for complex cancer genomes. The advances in long-read and linked-read technologies give rise to more efficient ways in SV detections. The DNA molecules sequenced with these technologies can easily reach tens of kilobases in length, facilitating more accurate SV detections from raw reads with improvements in both sensitivity and specificity. Here we demonstrated how Nanopore reads and 10X Genomics’ linked reads can be used to effectively detect complex SVs of different categories in cancer genomes. The results were compared to investigate the respective advantages and disadvantages of the two technologies in SV detections. We measured the differences between the SVs of primary and metastatic tumor samples to characterize the evolution of cancer genomes. Considering the potential of long reads in constructing high-quality genome assemblies, we also explored the possibility for SV detections from reassembled contigs to seek improvements over directly from raw reads.
The phosphatidylinositol 3-kinase (PI3K)-AKT-mTOR signalling pathway is a master regulator of cell growth and its activation is frequently associated with cell transformation and cancer. This is particularly common in breast cancer, where alterations in members of this pathway occur in over 50% of patients, irrespective of tumour subtype. Over the last decade, targeted drugs directed at the PI3K pathway, particularly inhibitors directed at PI3K, have been under intense clinical development. However, the emergence of acquired and/or adaptive resistance to these agents, the latter involving dynamic rewiring of signalling networks and crosstalk, has presented major challenges for the delivery of impactful treatments . This highlights the critical need to identify the molecular mechanisms through which tumour cells rewire their signalling outputs and bypass the inhibitory effect of targeted therapies. A better understanding of these events will help overcome such resistance and develop more effective combination therapies.
To address these challenges, we constructed a multi-pathway mechanistic model based on differential equations that integrates the PI3K-AKT signalling axis with key cancer-relevant pathways, incorporating known feedback and cross-talk mechanisms. We calibrated this model using time-course kinetic data in response to inhibition of PI3K by a selective and clinically-relevant inhibitor BYL719 (BYL), obtained from the T47D breast cancer cell lines. Integrative simulations/experimental analyses reveal an unexpected role for the cyclin-dependent kinase inhibitor p21, which in contrary to its known growth-inhibitory function, appears to promote resistance to PI3K inhibition. Consistent with this, model simulations further predict a dynamic and adaptive reactivation of p21 following acute BYL treatment, which we validated experimentally using immunoblotting and phosphoproteomic profiling in both parental T47D cells and cells that have become resistant to BYL. Next, following a similar approach we recently published, we simulated the effect of various potential drug combinations targeting pair-wise nodes within the PI3K integrative network to identify potential co-targets that can be effectively combined with PI3K inhibition for more anti-tumour benefit. Among these, we predict dual inhibition of PI3K and the kinase PDK1 displays the most potent synergistic effect in suppressing pro-growth signalling and cancer cell growth. Model predictions were subsequently validated using immunoblotting and cell viability assays. In addition, analysis of PIK3 and PDK1 alterations in breast cancer patients demonstrates that increased co-expression of the genes encoding these proteins is associated with worse patient survival, further supporting their validity as co-targets.
Collectively, our integrative predictive modelling and experimental analyses uncovered novel resistance mechanisms against PI3K inhibition, and identified effective combination therapeutic strategies that overcome such resistance, leading to better treatment for PI3K-driven breast cancer.
Abundance of immune cells may be critical to immunotherapy sensitivity. In adult cancer, immune signatures such as the T cell-inflamed gene expression profile (GEP) have been developed and tested to predict a patient’s response to immunotherapy by understanding the tumour microenvironment (TME). However, the TME has not been explored in paediatrics and comprehensive analysis of the TME by RNA-sequencing (RNA-seq) can identify immune-inflamed patients who may benefit from immune checkpoint inhibitors (ICIs).
Through the ZERO childhood cancer precision medicine program we have access to 348 high-risk paediatric cancers who have undergone RNA-seq analysis. Deconvolution algorithms (cibersortX, quanTIseq and MCPcounter) were used to extract the immune cell composition for every tumour, and the T cell-inflamed GEP was applied to the dataset. Combining deconvolution with T cell-inflamed GEP identified 36% of patients exhibit an immune-inflamed profile. We validated our findings in a cohort of 40 patients by performing IHC for CD45, CD8, CD4 and PD-L1, and observed correlation between PD-L1 mRNA and protein expression. Using this classification, we applied machine learning algorithms to identify novel immune signatures and biomarkers specific to paediatric cancers. We identified a novel 27-gene signature, including markers of CD4 and CD8 T-cells, markers of T-cell cytotoxicity, genes that promote T- and NK-cell recruitment and activation, expression of MHC Class II molecules, and immune checkpoints.
Here we will present our bioinformatic approaches to investigating the TME using RNA-seq data in high-risk paediatric cancers as an additional clinical benefit for precision medicine. We will present our novel findings, the results from an integrated RNA-seq with IHC approach and the impacts on patient management and response. This approach identifies patients that are immune inflamed and may potentially respond to immunotherapies such as ICIs. Conversely it also identifies immune cold patients that may require combination therapy and immunomodulators to maximise immune response.
We employed comparative scRNA-sequencing to extensively characterize cellular landscape of human liver, from development to disease. We analyzed ~212,000 cells representing human development, HCC, and mouse liver and revealed remarkable reprogramming of tumor microenvironment. Specifically, HCC ecosystem displayed features reminiscent of early development, including re-emergence of stromal and immune cells associated with early human development. In a cross-species comparative analysis, we discovered remarkable similarity between gene regulatory network of these cells. Spatial transcriptomics further revealed a shared ecosystem between development and tumor. Taken together, we report a shared immunosuppressive ecosystem during development and cancer. Our results unravel a previously unexplored reprogramming of tumor ecosystem, provides a novel target for therapeutic interventions in HCC, and opens up avenues for identifying similar paradigms in other cancer and disease.
Spatial technologies that query the location of cells in tissues at single-cell resolution are gaining popularity and are likely to become commonplace. The resulting data includes the X, Y coordinates of millions of cells, cell phenotypes and marker or gene expression levels. However, the tools for the analysis of these data are largely underdeveloped, making us severely underpowered in our ability to extract quantifiable information. In cancer, the spatial location of lymphocytes has been linked to prognosis and response to immunotherapy. While these advances have been exciting for biomarker development, the methods currently being used are coarse and largely qualitative. Appropriate quantitative tools are desperately needed to refine and uncover novel biologically and clinically meaningful information from this rich source of data.
We have developed SPIAT (Spatial Image Analysis of Tissues), an R package with a suite of data processing, quality control, visualization, data handling and data analysis tools. SPIAT includes our novel algorithms for the identification of cell clusters, cell margins and cell gradients, the calculation of neighbourhood proportions and algorithms for the prediction of cell phenotypes. By interfacing with packages used in ecology, geographic data analysis and spatial statistics, we have begun to robustly address fundamental questions in the analysis of spatial data, such as metrics to measure mixing between cell types, the identification of tumour borders and statistical approaches to compare samples. To date, our results suggest an association with prognosis and treatment response.
SPIAT is compatible with multiplex immunohistochemistry, spatial transcriptomics and data generated from other spatial platforms, and continues to be actively developed. We expect SPIAT to become a user-friendly and speedy go-to package for the spatial analysis of cells in tissues, as well as promote the use of quantitative metrics in the spatial analysis of tumour tissues and the microenvironment.
B-Cell Acute Lymphoblastic Leukemia (B-ALL) is the most common childhood cancer. An interesting feature of the disease is that the outcome of patients is heavily related to the type of genetic variation within the malignant cells. Indeed, variants that are recurrent across patient cohorts have been identified, linked to prognosis, and used to guide treatment through risk-stratification and targeted therapies. The World Health Organisation (WHO) has segmented B-ALL into seven distinct classifications, or subtypes, as of 2016. However, a recent study from the St. Jude Children’s Hospital (Gu et al. 2019) has demonstrated that their aggregated cohort of nearly 2000 cases can be further segmented into 23 subtypes based on gene expression. It is clear that a method that can identify which subtype a patient has would have clear clinical utility. Despite this, no publically available classification methods exist for this purpose using RNA sequencing (RNA-Seq) data.
Here we present ALLSorts: a publicly available method that uses RNA-Seq to classify B-ALL samples to 18 known subtypes and 5 novel meta-subtypes. ALLSorts is the result of a hierarchical supervised machine learning algorithm applied to a training set of 1235 B-ALL samples aggregated from multiple cohorts. A validation effort revealed that ALLSorts is robust to batch effects and can accurately attribute samples to subtypes. Furthermore, when applied to our Royal Childrens Hospital cohort, ALLSorts has been able to classify previously undefined samples into subtypes from the extended list. A further feature of ALLSorts is that it can attribute multiple subtypes to a sample. This is highlighted in a particular RCH sample that exhibits both Ph-like and ETV6-RUNX1-like signatures, where we later identified multiple genomic events supporting the subtypes.
ALLSorts is available for public use via a well documented GitHub repository (https://github.com/Oshlack/AllSorts/).
Variants that affect pre-mRNA splicing can have a substantial impact on the resulting protein. Besides variants that affect the canonical splice acceptor and splice donor regions, predicting the impact of a variant on splicing is challenging. Here we present Introme to address this challenge and apply this to a cohort of high-risk (expected survival <30%) paediatric cancers with whole genome sequencing (WGS) and matched RNA-Seq. Introme uses machine learning to integrate predictions from leading splice detection tools and novel functions based on splicing rules, evaluating the likelihood of each variant to alter splicing. We applied Introme to 252 paediatric cancer patients from the Zero Childhood Cancer program, analysing a subset of known cancer genes in both the germline and tumour WGS results. We have systematically reviewed the literature to identify 1050 splice-altering variants and 590 variants with no impact on splicing, all with functional support in the form of minigene, cDNA or RNA studies. A training set of 80% of the curated variants was used to optimise a C5.0 classifier. Based on the remaining 20% of variants, Introme achieved an AUC of 0.97; the best performing individual tools were SpliceAI (AUC: 0.95) and Spliceogen (AUC: 0.90). At a sensitivity of 0.90, Introme returns half the number of false-positive predictions as SpliceAI. Introme was used to analyse 252 patients, identifying a total of 146 splice-altering variants in known paediatric cancer genes including PMS2, NF1 and ATM. All variants have been confirmed to affect splicing using matched patient RNA, and 28 of the variants have been classified as pathogenic or likely pathogenic. We have developed a program which improves our ability to detect splice-altering variants. The application of Introme to large cohorts will enable the identification of novel disease-causing variants, potentially resulting in new therapeutic targets.