Cancer Bioinformatics Australia
June 25th, 2024


Diverse somatic mutation profiles drive common patterns of transcriptional rewiring related to loss of multicellularity in 10,000 cancer patients

Anna Trigos, Richard Pearson, Anthony Papenfuss and David Goode

Many hallmarks of cancer can be explained as a disruption of transcriptional networks shaped during the emergence of multicellularity, resulting in tumours relying on cellular processes that date back to unicellular ancestors (e.g., cell replication, glycolysis). However, how diverse profiles of genomic alterations converge to a similar phenotype of loss of multicellularity is not well understood.

We identified modules of genes with highly correlated expression in over 10,000 patients of 30 solid tumour types from The Cancer Genome Atlas. Modules were stratified by the predominant time of emergence in evolution of the genes in each module. We found modules dominated by either unicellular or multicellular genes were preserved between tumours and matched normal tissues, whereas modules with a mix of unicellular and multicellular genes were largely tumour-specific. Gene amplifications often led to the creation of new hubs around which tumour-specific modules formed, whereas deletions and point mutations tended to disrupt signalling between unicellular and multicellular genes, leading to module fragmentation. We applied this approach to detect potential novel gene drivers by quantifying the change of network localisation (peripheral vs. central) of mutated genes in tumour modules compared to their initial location in modules of normal tissues.

Our results reveal how integrating an evolutionary and network biology approach uncovers previously unappreciated associations between the mutational profiles and the widespread transcriptional rewiring found across tumour types, and provides a novel avenue to identify patient-specific gene drivers.

Molecular interactions as a framework to understand how the histone methyltransferase PRC2 is dysregulated by cancer-associated mutations

Emma Gail, Qi Zhang, Vitalina Levina, Nicholas McKenzie, Sarena Flanigan and Chen Davidovich

The polycomb repressive complex 2 (PRC2) is a histone methyltransferase complex that tri-methylates histone H3 at lysine 27 (H3K27me3). PRC2 represses oncogenes and tumor suppressor genes, and is dysregulated in most types of cancer. Accordingly, PRC2 is frequently-mutated in cancer, with over 1000 cancer-associated and -driving mutations identified in genes coding for PRC2 subunits — most notably EZH2. In addition to its core subunits, PRC2 interacts with various accessory subunits and RNA, providing it with extensive functional diversity. Determining what are the domains and surfaces within PRC2 that are engaged in these molecular interactions would lead to discover how the function of PRC2 is regulated by these interactions and how these interactions are dysregulated in disease. To this end, we developed workflows to automatically detect, filter and aggregate binding sites from two types of cross-linking mass spectrometry (XL-MS) methods to study protein–protein and protein–RNA interactions and applied it to PRC2. The most common protein-protein interactions within subunits of the core PRC2 complex varied slightly between the presence and absence of various protein cofactors. Strikingly, similar domains within PRC2 core subunits serve as handles to bind different accessory subunits. We identified domains and surfaces in PRC2 that bind different proteins and RNA and provide a mechanistic framework to explain why certain factors are mutually exclusive for PRC2 binding while others can co-occupy the same complex. This workflow will allow determination of how cancer-associated and driver mutations in PRC2 subunits dysregulate its function and will thus open a path for the development of new therapeutic and diagnostic approaches.

Toblerone: Detecting deletions in cancer genes using RNA-seq

Andrew Lonsdale, Lauren Brown, Paul Ekert and Alicia Oshlack

B-cell precursor acute lymphoblastic leukemia (BCP-ALL, or B-ALL) is the most common childhood cancer. High risk subtypes of B-ALL include Ph+ (presence of BCR-ABL1 fusion) and the Ph-like (similar expression profile to Ph+). These subtypes are often characterised by additional deletions in IKAROS family zinc finger 1 (IKZF1), including the deletion of exons 4 through 7 resulting in the loss of zinc finger domains and a dominant negative isoform (IK6). Here we developed a method for using RNA-seq to find IKZF1 deletions. First we generate a custom reference transcriptome that includes isoforms of deletions. Next we use fast pseudoalignment and transcript estimation to quantify deleted exon events. Using this approach we were able to detect the IK6 transcript in clinical samples with known deletions, and successfully made predictions of this deletion from RNA-seq which were subsequently validated by molecular tests on DNA. We then applied this method on publicly available annotated B-ALL data, resulting in a sensitivity of 84% for focal deletions in IKZF1, and insight into the previously unknown location of deletions. This method, Toblerone, is a targeted tool for detecting transcribed internal gene deletions from RNA-seq and can be extended to other genes of interest for research or diagnostic purposes. Exploratory work on EBF1, other members of the Ikaros gene family, and PAX5 are presented to reveal transcriptional evidence for specific deletions in public data.

Identification of RNA splicing therapeutic targets in high-risk paediatric acute leukaemia

Adria Closa, Marina Reixachs-Solé, Miriam Guillen-Navarro, Ronald W. Stam and Eduardo Eyras

Acute lymphoblastic leukaemia (ALL) is the most common form of cancer in children worldwide. Although combination chemotherapy provides in general an effective treatment, resulting in an overall survival of >90%, subtypes of paediatric ALL affecting children in the first year of life or carrying rearrangements of the mixed lineage leukaemia (MLL) gene remain with a dismal prognosis. These poor outcomes highlight the unmet need for a better understanding of the molecular mechanisms of acute leukaemia and motivate the search for new therapeutic strategies for high- risk paediatric acute leukaemia. Genome sequencing studies of ALL patients have shown a very low frequency of somatic mutations, indicating that MLL-r may not require additional alterations to induce full transformation. However, the mechanisms of how gene fusions relate to disease transformation remain to be fully explained. MLL fusions have the potential to impact the RNA processing of genes at genome scale through changes in transcriptional elongation, thereby providing a potential new layer of molecular variation that has remained undetected so far and which could lead to new prognostic markers and therapeutic strategies. We present here an exhaustive analysis of the RNA-processing alterations in infant ALL samples in relation to MLL-r. We have analysed RNA-sequencing on a cohort of 319 paediatric ALL patients from 4 different projects and performed in vitro experimental models using splicing modulating compounds in MLL-r leukaemia.

We detected a significant overexpression of multiple splicing factors (SFs) conserved across MLL-r cases and distinct to other common rearrangements like ETV6-RUNX and to cases without fusions. We identified more than 30 SFs that are differentially expressed between the different types of leukaemia analysed like MBNL1, which was previously found upregulated in infant MLL-r ALL and has been proposed as a potential therapeutic target associated with MLL-AF9 background, although no mechanisms have been provide so far. These findings suggest that MLL-r paediatric leukaemia could be vulnerable to splicing modulation.

We have initially tested this result by performing cytotoxicity assays with clinically relevant splicing-modulating compounds using in vitro models and patient derived cells of MLL-r ALL. In particular H3B-8800, a compound that is in Phase I clinical trials for adult leukaemia, showed a potent and homogeneous effect across all cell models tested. Moreover, H3B-8800 showed a strong synergy with standard chemotherapeutic agent for paediatric ALL treatment, suggesting that splicing modulation could increase the effectiveness of chemotherapy in high-risk patients.

In agreement with these findings, we found common and specific splicing events that are differentially spliced between MLL-r samples and controls. We further present an analysis of the potential functional impacts and the phenotypic convergence of these alterations across patients. These findings indicate that splicing is generally altered in childhood acute leukaemia and represents a novel therapeutic opportunity.

Supervised deconvolution of population heterogeneities in single cell RNA-seq via similarity-based embeddings

Soroor Hediyeh-Zadeh, Yi Xie and Melissa Davis

The emergence of large amounts of data from single-cell RNA-seq and the widespread availability of transcriptome readouts from various cells, tissues, and species has attracted an interest in the integration and comparison of such data for reliable annotation of cell types in new datasets. High-throughput gene expression technologies have additionally facilitated curation of a repertoire of gene signatures and classifiers that delineate various kinds of cellular stimuli, cell types and cancer subtypes. The PAM50 signatures, for example, are frequently used to identify different breast cancer subtypes. Methods that are capable of applying the existing signatures to transcriptome profiles for mapping similar phenotypes or determining cell-type identities are, therefore, of interest.

In this work we propose a method to associate gene sets to groups of cells, hence decoding population heterogeneities in scRNA-seq profiles. To characterise different phenotypes in single cell RNA-seq data based on pre-existing cell-type classifiers or gene expression signatures, the cells and gene sets are mapped to a lower-dimensional embedding in which the correlation between the expression profiles of the cells and the gene sets are maximised. The embedding is then used to infer biological associations between the cells and gene sets. Unlike existing methods, the proposed method does not require reference datasets, and is robust to noise. We provide examples where the method successfully replicates cell type annotations in published datasets, predicts the annotation of unresolved cells, and is capable of capturing biological events such as immune infiltration and immune mimicry in tumour cells; a capability that is not reported by the existing methods for cell type annotation. The proposed method has various applications in cancer, including characterisation of cancerous and immune phenotypes in tumour scRNA-seq profiles.

Identification of epigenetic complexes driving haematopoiesis

Yih-Chih Chan, Enid Yi Ni Lam, Jessica Morison, Anthony Papenfuss and Mark Dawso

Haematopoiesis is a precise balance of self-renewal and differentiation of haematopoietic stem cells. Haematological malignancies often hijack and disrupt these essential processes. The molecular mechanism, such as the epigenetic and transcriptional programs, that govern the critical decisions of self-renewal and differentiation are not well understood.

Most epigenetic proteins function in multi-member protein complexes, with different members having essential roles. Polycomb (PcG) and Trithorax (TrxG) complexes are evolutionarily conserved epigenetic modifiers that control transcriptional repression and activation, respectively, of key genes in cellular differentiation and development. PcG or TrxG protein complexes comprise of core groups of proteins, some of which have 5-6 different members, as well as facultative proteins. It has been estimated that over 180 distinct PcG or TrxG complexes exists, each thought to have distinct functional roles.

To unravel this complexity, we used in silico methods to analyse RNA-Seq data of different normal haematopoietic cell types to reconstruct the major PcG or TrxG complexes that dominate each main stage of haematopoietic development. Using a combination of expression profiling, clustering and rank based methods, we have identified the main PcG or TrxG complexes in haematopoietic stem cells, committed progenitors and terminally differentiated blood cells. The value of our results from the analysis is highlighted by the fact that several of our in silico predicted complexes agrees with published experimental data. For example, PcG complexes that contain CBX7 proteins are important in maintaining self-renewal in haematopoietic stem cells, whereas those that contain CBX2, CBX4 or CBX8 orchestrate differentiation in these cells.

Our data demonstrating the importance of key epigenetic complexes involved in self renewal and lineage commitment provide a source of information to manipulate haematopoietic cells in vitro and potentially in vivo. By identifying proteins that overcome the block in differentiation we hope to identify new candidates to which therapies can be developed to reinstate the normal haematopoietic differentiation in haematological cancers.

Unsupervised Feature Extraction from Breast Cancer Multi-Omics Data with Deep Learning Techniques

Richard Lupat, Jason Li and Sherene Loi

Rapid advancement in genomic technologies has produced a vast amount of clinical genomic data across different levels of omics variables. Some of these data are accessible through public repositories such as The Cancer Genome Atlas (TCGA). However, the enormous volume of data requires the application of specialised techniques for data mining, integration and interpretation to provide valuable insights. There have been various machine learning algorithms, supervised and unsupervised, successfully applied to these data and led to clinically relevant conclusions. However, these algorithms often rely on prior biological studies or limited to a selected number of most significant features in the data.

In this project, we applied an unsupervised deep learning based method, known as Autoencoders, to extract complex patterns from breast cancer genomic data independent of prior known biology. We designed an autoencoder using features derived from gene expression and copy number data from the TCGA cohort. We used 746 samples to train and validate this model, which extracted 128 features from the combination of all input variables. To evaluate the performance of this method, these extracted features are used as part of our dimensionality reduction step for our downstream supervised classifiers of ER status and PAM50 intrinsic subtype. The classifiers were applied to the same training datasets and accuracy of each was assessed on the validation set (70:30 datasets split). This combination of autoencoder and feed-forward neural network classifier distinguished ER status (92% accuracy), Basal-like vs. non-Basal-like (94% accuracy) and able to predict samples’ breast cancer PAM50 intrinsic subtype (88% accuracy). This initial result provides a good foundation for our further study in developing a deep-learning based prognosis model.

The role of RNA sequencing in providing comprehensive molecular characterisation of patient’s cancer

Jacek Marzec

Introduction: Precision oncology is becoming a standard approach in cancer patients care, with cancer molecular characterisation through genome sequencing being the major focus. In addition, there is growing evidence showing that patients transcriptome profiling can contribute to our knowledge of individual cancers by revealing additional layers to the disease biology. In this work we developed a pipeline for using cancer patient’s RNA sequencing (RNA-seq) data to complement genome-based findings and aid therapeutic targets prioritisation.

Method: We use bcbio-nextgen RNA-seq pipeline ( to process the RNA-seq read data from patient’s tumour, followed by gene fusions prioritisation, per-gene read count data normalisation and transformation into standard scores to address challenges associated with analysing data from a single-subject. In addition, we build an internal reference cohort using a set of in-house high-quality tumour samples to assure input material and data processing compatibility. Finally, we integrate transcriptome data with genome-based findings from patient’s whole-genome sequencing (WGS) data and annotate results using public knowledge bases to provide additional evidence for dysregulation of mutated genes, as well as genes located within detected structural variants or copy-number altered regions.

Results: We developed a pipeline capable to process and analyse RNA-seq data from an individual patient’s tumour. In addition, the inclusion of an internal reference set assures the input material and data compatibility. The results are visualised in an approachable html-based interactive report with searchable tables and plots, providing variant curators with a tool to verify and prioritise genome-based findings.

Discussion: RNA-seq technology holds great promise for the clinical applicability in molecular diagnostic standpoint. However, it is not straight forward to translate this technology into clinical practice, mainly due to its single-subject setting. We developed a pipeline for integrating information from both WGS and RNA sequencing approaches to provide additional clinically relevant information that can help prioritise variants for therapeutic intervention.

Post-Transcriptional Regulatory Networks within Breast Cancer Progression

Holly Whitfield, Joseph Cursons and Melissa Davis

Over the past decade there have been large advances in our ability to investigate 'non-coding' RNAs (ncRNAs). It has become increasingly apparent that ncRNAs play an important role in regulating cell behaviour, and accordingly, RNA is not just an intermediate product involved in the production of proteins from genes. A class of ncRNAs known as microRNAs (miRNAs) appears to be critical for the correct regulation of specific target genes.

The leading cause of death for breast cancer patients is metastasis, mediated in part by epithelial-to-mesenchymal transition (EMT), a regulatory program controlling cell phenotype. The miR-200 family plays a central role in EMT regulatory networks that underlie breast cancer progression. MicroRNAs can control cellular phenotypes through the coordinated effects of multiple mRNAs, including additive effects of multiple miRNA co-targeting individual or functionally-related mRNAs, as well as individual miRNAs targeting multiple mRNAs. Furthermore, it has been proposed that transcripts with numerous binding sites for miRNAs can 'sponge up' or sequester the miRNA, decreasing its availability for regulating other transcripts, also known as the competitive endogenous RNA (ceRNA) hypothesis. As a consequence, miRNAs operate within the context of a larger RNA regulatory network, however, experimental approaches often require the isolation of miRNAs and their targets, disregarding much of their broader biological context.

Here, will discuss some of my work on ceRNAs and describe my approaches to address broader regulatory interactions which may help to reinforce, or lead to the dysregulation of, miRNA activity. To investigate these dynamic regulatory interactions during EMT, I have constructed a miRNA-mediated regulatory network which integrates computationally predicted interactions with experimental evidence. By using transcriptomic data from both an EMT cell line model, as well as publicly available patient-derived tumour samples, I have prioritised candidate ceRNA relationships for experimental validation. With evidence of a regulatory effect, I propose a mechanism through which candidate ceRNAs may disrupt core regulatory motifs that control EMT and the metastatic cascade, thus contributing to our understanding of the molecular mechanisms that underpin breast cancer progression.

MINTIE: identifying cryptic variants in cancer transcriptomes using RNA-seq data

Marek Cmero, Breon Schmidt, Paul Ekert, Ian Majewski, Alicia Oshlack and Nadia Davidson

Gene fusions, tandem duplications and other transcriptomic structural variants can modify gene function and have important implications in cancer prognosis and treatment decisions. While calling fusions from RNA-seq data is well established, ‘cryptic’ variants such as fusions with non-gene sequence, tandem duplications, novel splice sites or other complex variants, are difficult to detect using existing approaches. While some of these events may be identified through DNA sequencing, others may be the result of post-transcriptional modification, and are thus only visible in the transcriptome.

To identify these variants in cancer transcriptomes, we developed MINTIE, an integrated pipeline for the detection of cryptic variants using RNA-seq data. The MINTIE pipeline first performs de novo assembly of transcripts. Next, novel transcripts are selected and all transcripts are then quantified using pseudo-alignment. Finally, differential expression versus controls is performed to identify over-expressed novel transcripts in each sample. MINTIE also performs comprehensive annotation and visualisation of candidate cryptic variants.

In order to demonstrate MINTIE, we ran the pipeline on a cohort of high-risk B-ALL patients, which included a subset of patients with poor outcome but no detected driver variant. In this subset, we identified several novel candidate driver cryptic variants. One such variant was a gene-disrupting cryptic fusion involving a truncation of the tumour suppressor gene RB1. We believe MINTIE will be able to identify new cancer driver mechanisms across a range of cancer types.

Predicting patient response to immunotherapy using Innate Immune Fitness profiles

Jared Mamrot, Nathan E. Hall and Robyn A. Lindley

Immunotherapy can result in complete cancer remission when all other treatments fail, yet caveats include variable patient response rates and a multitude of debilitating and life-threatening side-effects. Accurately predicting patient response to immunotherapy before treatment commences would significantly improve the clinical utility of these drugs.

To address this, we have developed a software platform to identify and profile deaminase-associated mutation metrics from genomic sequencing data. The presence of somatic mutations with deaminase-associated motifs are an indication of innate immune system dysfunction, and our aim is to correlate overall Innate Immune Fitness (IIF) metrics with a patient’s capacity to respond to immunotherapy.

IIF profiles have been generated for hundreds of patients from multiple immunotherapy trials. This data has been used to train machine learning models to predict patient response to immunotherapy based on their IIF profile. Application of trained models to validation datasets has shown improved predictive accuracy (>86%), compared to the current state-of-the-art.

Clinical implementation of the IIF profile will enable a more personalised approach to cancer treatment. Our findings have demonstrated successful patient outcome predictions with relatively high accuracy using our unique approach. We are currently working to refine our methods, improve predictive accuracy, and further evaluate our methods on larger validation datasets.

Whole Transcriptome Sequencing Improves Actionability in Children with High-Risk Cancer

Chelsea Mayoh, Marie Wong, Amit Kumar, Paulette Barahona, Alexandra Sherstyuk, Emily Mould, Patrick Strong, Dylan Grebert-Wade, Maely Gauthier, Noemi Fuentes-Bolanos, Sumanth Nagabushan, Dong Anh Khuong Quang, Loretta Lau, Michelle Haber, Vanessa Tyrrell, Paul Ekert and Mark Cowley

In the context of paediatric cancer precision medicine, several groups are utilising whole genome (WGS) or targeted sequencing (whole-exome (WES) or panel) and transcriptome sequencing (RNA-Seq) to identify the molecular basis for a patient’s cancer. Whilst the feasibility of using WGS/WES/panel for mutation detection is well established, most groups only use RNA-Seq in clinical context for fusion detection. The Zero Childhood Cancer (ZERO) program provides a comprehensive precision medicine approach to High-Risk (HR) paediatric malignancies (expected survival <30%) to improve treatment outcomes. We integrate findings from a comprehensive molecular profiling platform (WGS, RNA-Seq) for assessment and recommendation to a national Multidisciplinary Tumour Board (MTB). Our aim is to identify driver variants and targetable aberrations. We developed a pipeline to increase the utility of RNA-Seq in precision medicine through identification of driver fusions, somatic mutations and over-/under-expressed genes.

We have incorporated 3 fusion callers (STAR-Fusion, JAFFA and Arriba) into our RNA-Seq pipeline which accurately identify expressed in-frame driver fusions, and out-of-frame fusions disrupting tumour suppressor genes. Arriba increased the detection rate of lowly expressed fusion events and detects duplications, insertions and inversion events with a significantly shorter run time. Resulting in identification of in-frame fusions which arise from complex structural variant events which were challenging to resolve in WGS. RNA-Seq mutation analysis identified 59% of driver mutations and showed loss-of-function mutations acquiring allele specific expression confirming the pathogenicity in 18% of mutations. For example, loss of expression of wild-type TP53 allele in tumours with heterozygous TP53 mutations. Detecting expressed fusions, SNVs and Indels from RNA-Seq analysis validates the mutation findings from WGS, reducing the need for additional Sanger Sequencing or further clinical testing.

Gene expression outlier analysis is a potential valuable resource for actionability but presents significant challenges. Our RNA-Seq pipeline utilises the ZERO database for identification of over-/under-expressed outlier genes for each patient. As our database has grown, we have observed an improvement in the accuracy of detection and retrospective analysis increased the number of potential outlier genes. Resulting in potential actionable targets increasing the clinical utility and providing potential explanation for drug responses.

ZERO is currently one of the largest and most comprehensive translational cancer initiatives in paediatric precision medicine and has currently enrolled 282 patients since September 2017 across the full range of paediatric cancer subtypes. At least one recommendation was made for ~70% of the patients. The RNA-Seq pipeline has expanded the targeted therapeutic options identified beyond that which WGS/WES alone identify. This has been an important contribution to therapeutic recommendations to the treating clinicians. The ZERO study is generating a valuable comprehensive dataset across a wide range of HR and rare paediatric cancers to drive a deeper understanding of the biology of these malignancies and to develop more effective therapies.

Here we will present our bioinformatic approaches to investigate the additional clinical utility of integrating comprehensive RNA-Seq analysis into precision medicine. We will present our novel findings, results from the integrated pipelines and the impact on patient management and response.

GRIDSS2: sensitive and specific somatic structural variant detection

Daniel Cameron

The sensitive and specific identification of somatic structural variants remains a challenging problem. Here, I present GRIDSS2, the successor to the best-in-class structural variant caller GRIDSS that extends the capabilities of GRIDSS to include single breakend detection, assembly-based variant adjoinment, and somatic calling capabilities. Using results from running GRIDSS2 on the 4,000 WGS metastatic samples in the Hartwig Medical Foundation clinical sequencing cohort, I show that how these novel capabilities combine with the highly sensitive and specific somatic call sets generated by GRIDSS2 to not only recapitulate known biology but also bring new insights into the biological processes that produce genomic rearrangements in cancer.

Single breakends detection: Low mappability regions with such as centromeres, telomeres, and LINE elements have long been considered inaccessible to short read sequencing. Although rearrangements fully contained within such regions are indeed inaccessible, breakpoints in which only one side falls in a low mapability region can be identified and reported as single breakends. Single breakends , are breakpoints in which only one side can be identified/placed. Whilst the VCF file format specifications have supported single breakend calls for over three years, GRIDSS2 is the first caller to report such events. By identifying repeats and viral sequence in the breakend sequence reported by GRIDSS2, single breakends can be classified. Using the Hartwig cohort, I show that this unique capability allows the detection of viral integration and uncovers novel biological insights into the nature of somatic rearrangement events involving retrotransposons, centromeres, and telomeres.

Assemby-based variant adjoinment: The unique breakend assembly approach taken by GRIDSS enables the determination of whether proximal breakpoints are adjoining, that is adjacent and co-occurring on the same chromatid, or occur on different chromatids. Whilst conceptually similar to phasing and linkage, adjoinment is a key capability of GRIDSS that greatly assists in the karyotypic resolution of derivate chromosomes. Startingly, 9% of all breakpoints in the Hartwig cohort are adjoined with a further 3% proximal but occurring on different chromatids. I show that by combining this with a breakpoint-aware copy number caller, many instances of chromothripsis can be fully resolved to base-pair accurate resolution.

GRIDSS2 is available under a GPL license at

A 9-gene score for predicting B-ALL overall and relapse risk

Feng Yan, Nicholas C. Wong, David R. Powell and David J. Curtis

B-cell Acute Lymphoblastic leukemia (B-ALL) is the most common cancer in children. Although the 5-year overall survival rate is up to 85%, there are still 20% of patients that succumb to relapse. Cytogenetic biomarkers in B-ALL are used as prognosticators however, up to 30% of patients carry noninformative cytogenetics. Leukemia stem cells (LSC) are a rare population in leukemia, which are believed to cause relapse after conventional treatment. We aimed to determine if an enriched LSC gene signature is predictive of relapse.

Methods: Publicly available RNA-seq generated from chemo-resistant cells in a PDX model for B-ALL was obtained from GEO and analysed using RNAsik and edgeR. Training data including patient RNA-seq expression matrix and clinical information was obtained directly from TARGET website. Test datasets included microarray data from GEO and TARGET with associated clinical information. LASSO regression was performed on training data with 10-fold cross-validation (CV) using different input features. CV was performed 100 times for each input to identify the top 3 models with most occurrence. All models generated were then tested in all three test datasets to validate the power of risk prediction based on the hazard ratio and p-value. Survival analysis was based on Cox model.

Result: Differentially expressed genes from LSCs were enriched in pathways related to immune response (MHC family) and cell cycle arrest. Genes upregulated in LSCs with significant adverse survival impact were selected for LASSO regression. The final model is a linear combination of 9 gene expression (S100A10, ZMAT3, PSAT1, RIMS3, LRRC25, H1FX, TSPO, NID2 andCCDC69), and showed superior predictive power in all three test datasets including 2 paediatric and 1 adult B-ALL from different platforms. Moreover, it not only worked in full dataset but also in the subset of patients with noninformative cytogenetics. Our model provides a potential prognosticator for B-ALL with noninformative cytogenetics using gene expression data.

Conclusion: We were able to develop a 9-gene expression score to calculate the risk of patients. The score is agnostic to gene expression platform, patient age and cytogenetics.

Opportunities and challenges of analysing multi-regional tumour biopsies to characterise heterogeneity in cancer

Sebastian Hollizeck, Dineika Chandrananda, Lavinia Tan, Stephen Wong, Christine Khoo, Lisa Devereux, Heather Thorne, Benjamin Solomon and Sarah-Jane Dawson

Even though tumour heterogeneity is a widely accepted fact, the possibilities to study this phenomenon and its impact on the treatment of patients are limited. The CASCADE program enables rapid autopsies after death of the patient and therefore allows unique insights into the clonality and emergent resistance mechanisms of the different metastasis. This however creates a set of new bioinformatics challenges to manage the amount of data available for each of the patients and the combined analysis of this data spanned by the different samples.

In our current work we explore variant calling capabilities of different methods in a multi-tumour-matched-normal sample scenario, to allow the reconstruction of evolutionary trajectories of all the tumour sites in the metastatic process. In this analysis we have utilised multi-regional tumour samples from 5 patients with advanced non-small cell lung cancer (n=4 EGFR mutant and 1 EGFR non-mutant) who underwent rapid autopsy through the CASCADE program. An average of 7 samples were analysed per patient by either whole exome or whole genome sequencing. To ensure high confidence variants we have used a consensus method of three variant callers. First an adapted version of the somatic variant calling with Freebayes from the BCBioinformatics pipeline, second our own developed 2-pass variant calling workflow with Strelka2 and lastly the newly developed joint calling capabilities of Mutect2 from GATK.

This work aims to develop improved approaches for sensitively characterising the diverse mutational processes governing treatment resistance in non-small cell lung cancer.

Signalling networks in the Analysis of Proteomic Data

Hannah Huckstep, Jarrod Sandow, Andrew Webb, Liam Fearnley and Melissa Davis

Signalling networks have the potential to provide useful insight into mechanisms driving cancer progression. It has been estimated that as much as a third of the eukaryotic proteome is phosphorylated at one time indicating the significance of phosphorylation in modulating cell signalling. Nevertheless, the simple identification and quantification of proteins from different conditions is not sufficient to reconstruct the mechanisms underpinning the observed differences. Functional analysis methods have been developed to help with the interpretation of proteomic and phosphoproteomic data, however, these methods suffer from a range of limitations and fail to account for the complexity of cellular signalling networks. Thus, there is a need for tools, methods and frameworks that consider underlying network structures to aid accurate interpretation and reconstruction of the biological mechanisms at play. An important first step is the derivation of the network since most knowledgebases today deal in pathways, which do not properly represent the global flow of information across the entire signalling system. Here we have developed a set of algorithms to extract and interrogate a more-global signalling network from the knowledgebase determined to be the most complete for this purpose. We also demonstrate how phosphoproteomics measurements can be mapped to this network to interpret the functional consequences of the observed changes in protein phosphorylation. This approach will enable a more unbiased and complete analysis to be performed over networks encompassing specific proteins and phosphoproteins of interest in cancer.

Detecting and validating fusion genes from whole transcriptome sequencing data - challenges and insights

Sehrish Kanwal


The driving role of fusion genes during tumorigenesis has been recognized for decades, with clinical effectiveness demonstrated for targeted therapies. However, our understanding of the phenomenon has been impeded by surge in the bioinformatics methods available for predicting gene fusions, the high percentage of false positive calls, the prioritization and relevance of the predicted fusion calls to the patient’s phenotype and disease.


In this work, we leveraged a well-established and community driven RNA sequencing pipeline ( to identify fusion candidates from patient’s transcriptomic data. We have further implemented downstream prioritization, validation and visualization of predicted fusion events critical for cancer development.


Our work shows the importance of fusion predictions using patients transcriptome profile and the corresponding verification of results using patient’s genomic profiling of structural variants. The exploration and evaluation of fusion calls from RNA sequencing data provided an opportunity for detailed assessment of the phenomenon to aid in cancer patient’s transcriptome characterization, by providing comprehensive overview of fusions role in the disease. Furthermore, it also aids in increasing our confidence in the results from patient’s whole genome sequence analysis.

Discussion and Conclusion:

Whole transcriptome sequence analysis holds great potential in providing comprehensive knowledge about patient’s disease profile. Fusion calling from the transcriptome data enable new paths of research for reliable identification of genomic sequence changes. In this talk, I would provide an overview of the approach taken at the University of Melbourne Centre for Cancer Research (UMCCR) to expand and report this valuable information moving forward.

Predicting radiation-induced immune trafficking and activation in localised prostate cancer.

Simon Keam, Thu Nguyen, Catherine Mitchell, Franco Caramia, David Byrne, Sue Haupt, Georgina Ryland, Phillip K Darcy, Shahneen Sandhu, Piers Blombery, Ygal Haupt, Scott Williams and Paul J Neeson

Prostate cancer is frequently cured with high-dose rate brachytherapy as a front-line treatment. However, a significant number unfortunately develop intrinsic resistance. Although considered to be an immune-excluded tissue, immune responses are implicated in driving tumour-eradication in prostate cancer. This has not been proven, and yet is used as the rationale for numerous clinical trials combining radiation and immunotherapies. We hypothesise that there is a predictable but differential relationship between radiation and the immune responses in prostate cancer that could be used to fulfil a clinical need - identifying patients that would benefit from immune intervention in conjunction with radiation.

We present here the results of comprehensive immunological profiling of a cohort of world-unique pre- and post-radiation tissues from 24 patients (RadBank cohort). These were assessed using pathological classification, tissue segmentation (cancer/surrounding stroma), multiplex IHC, gene expression profiling, T-cell receptor sequencing, and spatial computational analysis.

Our data resolved three classes of prostate cancer tissue based on immune infiltrate level, immune-activation and -checkpoint gene signatures, spatial clustering and T cell clone sequencing: We have begun to resolve clear patient classifiers based on immune responses to radiation, and identified patients groups likely to benefit from immune therapy alongside radiation. Importantly, these classifications are associated with baseline gene expression profiles that may be used for pre-clinical stratification and more sophisticated treatment paradigms.


1.Investigating Clonal Haematopoiesis of Indeterminate Potential (CHIP) in the ASPREE Cohort.

Nick Wong, Zoe McQuilten and David Curtis

Clonal Haematopoiesis of Indeterminate Potential (CHIP) is the phenomenon of carrying somatic mutations associated with leukaemogenesis in otherwise normal, healthy people. The ASPREE (ASPirin in Reducing Events in the Elderly) study is a large cohort study (~10,000 participants) with comprehensive clinical and phenotypic measures taken throughout the duration of the study. The study is ongoing and endpoints include the onset of cancer and cardiovascular disease. Participants are recruited into the study at 60 years or above and are otherwise healthy at the time of recruitment.
A baseline blood sample from all participants were collected and a subsequent follow up sample three years later. This provides a unique opportunity to investigate the incidence of CHIP in an otherwise healthy population and the change in CHIP three years on.
A targeted approach with deep sequencing will be designed to investigate CHIP to 0.5% variant allele frequency. Given the scale of this study, logistical considerations in handling and processing ~20,000 samples are significant. This includes data handling, transfer and analysis through standardised bioinformatic pipelines.

2. Identifying primary site of lung-limited Cancer of unknown primary based on relative gene expression orderings

Mengyao Li, Hongdong Li, Guini Hong, Zhongjie Tang, Guanghao Liu, Xiaofang Lin, Mingzhang Lin, Lishuang Qi and Zheng Guo


Precise diagnosis of the tissue origin for metastatic cancer of unknown primary (CUP) is essential for deciding the treatment scheme to improve patients’ prognoses, since the treatment for the metastases is the same as their primary counterparts. The purpose of this study is to identify a robust gene signature that can predict the origin for CUPs.


The within-sample relative gene expression orderings (REOs) of gene pairs within individual samples, which are insensitive to experimental batch effects and data normalizations, were exploited for identifying the prediction signature.


Using gene expression profiles of the lung-limited metastatic colorectal cancer (LmCRC), we firstly showed that the within-sample REOs in lung metastases of colorectal cancer (CRC) samples were concordant with the REOs in primary CRC samples rather than with the REOs in primary lung cancer. Based on this phenomenon, we selected five gene pairs with consistent REOs in 498 primary CRC and reversely consistent REOs in 509 lung cancer samples, which were used as a signature for predicting primary sites of metastatic CRC based on the majority voting rule. Applying the signature to 654 primary CRC and 204 primary lung cancer samples collected from multiple datasets, the prediction accuracy reached 99.36%. This signature was also applied to 24 LmCRC samples collected from three datasets produced by different laboratories and the accuracy reached 100%, suggesting that the within-sample REOs in the primary site could reveal the original tissue of metastatic cancers.


The result demonstrated that the signature based on within-sample REOs of five gene pairs could exactly and robustly identify the primary sites of CUPs.

3. RAD51C Promoter Methylation Stability Influences PARP Inhibitor Response in High Grade Serous Ovarian Carcinoma Patient Derived Xenografts

Matthew Wakefield, Ksenija Nesic, Rachel Hurley, Cordelia McGehee, Olga Kondrashova, Maria Harrell, Giada Zapparoli, Ashan Musafer, Ming Wong, Elizabeth Swisher, Melissa Southey, Alexander Dobrovic, Scott Kaufmann and Clare Scott

PARP inhibitor (PARPi) resistance in High Grade Serous Ovarian Carcinoma (HGSOC) can be acquired as a result of restored Homologous Recombination (HR) due to secondary or reversion mutations in HR genes, such as BRCA1, BRCA2 and RAD51C, or due to loss of BRCA1 promoter methylation (meBRCA1). Our group has recently demonstrated that homozygous meBRCA1 can be lost or reverted to heterozygous methylation following treatment with platinum-based chemotherapy, resulting in HR competent PARPi resistant tumours. RAD51C promoter methylation (meRAD51C) is detected in approximately 2% of HGSOC cases and, as for meBRCA1, is associated with gene silencing and HR deficiency. However, less is known about acquired PARPi resistance in this context.

Here we present two Patient Derived Xenograft (PDX) models of HGSOC with RAD51C gene silencing caused by meRAD51C. These PDX have distinct meRAD51C profiles (measured by methylation-specific high-resolution melt analysis and targeted bisulfite next generation sequencing), and different responses to PARPi treatment pressure. PDX PH039 loses methylation and regains RAD51C expression after only 2 cycles of PARPi re-treatment (niraparib), resulting in PARPi-refractory tumours by cycle 3-4. Illumina EPIC methylation array analysis of PH039 revealed increasing global methylation losses following each round of PARPi treatment. Lack of meRAD51C stability and rapid development of PARPi resistance in PH039 may be due to the high degree of meRAD51C heterogeneity within the tumour favouring selection of pre-existing HR competent clones under PARPi pressure. In contrast, PDX 183 has a very homogeneous and stable meRAD51C profile, for up to 4 cycles of PARPi re-treatment (rucaparib) not impacting the degree of methylation at the RAD51C promoter, restoring gene expression or reducing response to PARPi. Re-treatments are underway for PH039 with rucaparib to ensure that effects observed are not niraparib-specific.

Using these unique PDX models, we have demonstrated that meRAD51C confers response to PARPi in HGSOC, but that PARPi treatment pressure can cause loss of methylation and drug resistance in some tumours. The contrasting PARPi responses of these PDX provide a platform for study of meRAD51C stability in vivo and may present therapeutic opportunities to improve meRAD51C durability and PARPi responses in patients.

4. Differential co-expression based detection of conditional relationships in transcriptional data: Comparative analysis and application to breast cancer

Dharmesh Bhuva, Joe Cursons, Gordon Smyth and Melissa Davis

Elucidation of regulatory networks, including identification of regulatory mechanisms specific to a given biological context, is a key aim in systems biology. This has motivated the move from co-expression to differential co-expression analysis and numerous methods have been developed subsequently to address this task, however, evaluation of methods and interpretation of the resulting networks has been hindered by the lack of known context-specific regulatory interactions.

In this study, we develop a simulator based on dynamical systems modelling capable of simulating differential co-expression patterns from regulatory networks. With the simulator and an evaluation framework, we benchmark and characterise the performance of inference methods. Defining three different levels of “true” networks for each simulation, we show that accurate inference of causation is difficult for all methods, compared to inference of associations. We show that a z-score based method has the best general performance. The evaluation framework and inference methods used in this study are available in the dcanr R/Bioconductor package.

Our analysis of networks inferred from simulated data show that hub nodes are more likely to be differentially regulated targets than transcription factors. Based on this observation, we propose an interpretation of the inferred differential network that can reconstruct a putative causal network. Application to a breast cancer dataset reveals differential regulation of immune processes dependent on estrogen receptor status and we show how HSH2D can be a potential marker of tumour infiltrating lymphocytes in basal-like tumours which are mostly hormone receptor negative. The potential of differential co-expression analysis remains largely unexplored due to difficulties in interpreting results. We have attempted to address some of the limiting factors and provide recommendations on their application to cancer datasets. Applications of methods are not limited to co-expression and may be applied to associations in general.

5. Shiny-SoSV: A web app for interactive evaluation of somatic structural variant calls

Tingting Gong, Vanessa Hayes and Eva Chan

Somatic structural variants (SVs) play a significant role in cancer development and evolution. Accurate detection of these complex variants from whole genome sequencing data is influenced by many variables, the effects of which are not always linear. With increasing demand for the application of whole genome sequencing in clinical settings and research, there is an unmet need for clinician scientists and researchers to easily make technical decisions for every unique patient and sample.

To address this, we have developed Shiny-SoSV, an interactive web application for evaluating the effects of five common variables on the sensitivity and precision of somatic SV calls, thereby enabling users to quickly make informed sequencing and bioinformatics decisions early on in their study design.

Firstly, a somatic SV evaluation framework was developed to evaluate the effect of the several parameters, including SV caller, sequencing depth, variant allele frequency, and required SV breakpoint resolution unbiasedly. Secondly, a statistical model, based on these parameters, was developed to predict sensitivity and precision of SV calling. Thirdly, a web app translating these findings into an interactive and visual platform allowing users to easily explore the effects of each, as well as the combinations, of these parameters was developed.

The web app is free to access. It has been tested on three SV callers (Manta, Lumpy and GRIDSS) and their pairwise combination sets with a realistic range of sequencing coverages of tumor (20x-90x) and matched normal samples (15x-90x), variant allele frequencies (5%-100%) and breakpoint precision thresholds (2bp-200bp).

Shiny-SoSV provides an easy to use and visually interactive platform for evaluating the interacting effects of multiple variables impacting somatic SV detection. Inclusion of addition SV callers can easily be incorporated with existing simulation datasets, while assessment of additional variables can be achieved with further simulation datasets.

6. Calling variants, filtering and then filtering some more: Somatic variant calling from unmatched tumour RNA-Seq

Andrew Pattison and Richard Tothill

Somatic changes to the tumour genome are key to development and progression of cancer. DNA sequencing methods such as whole genome sequencing (WGS) and whole exome sequencing (WES) are commonly used to study these somatic alterations. Well known examples of DNA sequencing being used extensively to study somatic variation include The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC). The aggregation of the somatic information from these studies has greatly expanded our understanding of the genetic drivers of many tumour types. Information about DNA variation (germline and somatic) can also be found in RNA sequencing (RNA-Seq) data, however, variant calling from RNA-Seq is not common practice. Key factors obscuring information about genetic variation in RNA-Seq data include post-transcriptional modifications to RNA, low coverage within lowly expressed genes, errors in converting RNA to cDNA prior to sequencing, allele-specific gene expression and nonsense-mediated decay of transcripts. More challenging still is that most RNA-Seq data is unmatched, meaning that germline variants (which vastly outnumber somatic variants) must also be identified and filtered out. Despite these technical challenges, it is still possible to call variants from RNA-Seq experiments provided the genes containing a variant are expressed and carefully selected filters for RNA-specific artefacts and are applied. Germline variants can then be aggressively filtered from the calls using databases of known human germline variation. In some instances variant calling from RNA-Seq may actually outperform WGS and WES, including when somatic variants exist in cancer driver genes (which are likely to be highly expressed) and in measuring variants from regions of the genome that may be difficult to sequence. Variants called from RNA-Seq can additionally be used to measure RNA editing and allele-specific gene expression.

Here we present ‘unmatched Variant Calling from RNA-Seq’ (uRSVC), a pipeline for calling somatic variants from RNA-Seq datasets. The uRSVC pipeline was applied to a training set of 26 lung cancers from the TCGA with matched WES data as a validation set. Somatic variants were correctly identified at 19% of sites by the uRSVC pipeline with a false positive rate of 18%. This low rate of detection was primarily due to the aggressive filtering required by the pipeline. Despite this relatively low call rate, key somatic variants including TP53 were identified and mutational signatures could also be identified with striking similarity to those calculated from WES. Tumour mutation burden (TMB) could also be called with reasonable accuracy from RNA-Seq (Pearson’s r = 0.65, RNA-Seq vs WES). The uRSVC pipeline and has also proven useful in the analysis of a higher quality Merkel cell carcinoma dataset where it achieved a Pearson’s correlation of 0.96 when comparing TMB derived from a DNA panel vs RNA-Seq. Successful variant calling from RNA-Seq will allow new genomic information to be extracted for thousands of existing samples where only RNA-Seq data is available. The uRSVC pipeline is still under active development but has already shown the potential to be useful in determining somatic mutations from tumour RNA-Seq data.

7. Latin American Study of Hereditary Breast and Ovarian Cancer LACAM: A Genomic Epidemiology Approach

Rosalia Quezada Urban, Felipe Vaca Paniagua, Javier Oliver, Clara Estela Diaz Velasquez, Claudia Alejandra Franco Cortés, Gabriela Torres Mejía, Luis Enrique Romero Cruz, Ernesto Rojas Jiménez, Fernando Vallejo Lecuona and Sandra Perdomo

Purpose: Hereditary Breast and Ovarian Cancer (HBOC) syndrome is responsible for approximately 5-10% of all diagnosed breast and ovarian cancer. In Latin America (LA), breast cancer is the most common malignancy and the leading cause of cancer-related mortality among women. The main objective of this study was to develop a comprehensive understanding of the genomic epidemiology of HBOC throughout the establishment of The Latin American consortium for HBOC-LACAM, formed by specialists from 5 countries in LA and the description of the genomic results from the first phase of the study.

Methods: We have recruited 403 individuals that fulfilled the criteria for HBOC from 11 centres of Argentina, Colombia, Guatemala, México and Peru. A pilot cohort of 222 individuals was analyzed by NGS gene panels. The genes were selected based on their putative role in susceptibility to different hereditary cancers. Libraries were sequenced on the MiSeq (Illumina) and PGM (Ion Torrent-Thermofisher) platforms.

Results: The overall prevalence of pathogenic variants was 17% (38/222) and the distribution spanned 14 genes and varied by country. The highest relative prevalence of pathogenic variants was found in patients from Argentina (25%, 14/57), followed by Mexico (18%, 12/68), Guatemala (16%, 3/19) and Colombia (13%, 10/78). Of the total number of pathogenic variants, 20% were found in the BRCA1 and 29% in BRCA2 genes. Pathogenic variants in non-BRCA genes were found in 12 genes, including high and moderate risk genes such as MSH2, MSH6, MUTYH and PALB2. Additional pathogenic variants were found in HBOC unrelated genes such as DCLRE1C, WRN, PDE11A and PDGFB.

Conclusion: In this first phase of the project, we recruited 403 individuals and evaluated the germline genetic alterations in an initial cohort of 222 patients from 4 of the 5 countries. Our data show for the first time in LA the distribution of pathogenic variants in a broad set of cancer susceptibility genes in HBOC. Even with extended gene panels, there was still a high proportion of patients without any detectable pathogenic variants, which emphasizes the larger, unexplored genetic nature of the disease in these populations.

8. Network based modelling for Neuroblastoma drug discovery

Samuel Lee, Bellamy Cheung, Glenn Marshall and Jessica Holien

Neuroblastoma is the most common solid tumour in infants. However, due to its heterogeneous presentation more than 50% of children with high-risk Neuroblastoma do not survive despite multimodal therapies. Treatment of Neuroblastoma is further affected by the high rates of refractory and relapsed metastasis that occur in patients. There currently is an unmet need for treatments specific to this relapsed form of the disease.

While a number of proteins have been associated with Neuroblastoma (e.g., N-MYC) few druggable targets have been found to date. By utilising multiple extensive public protein-protein interaction networks together with transcriptomic data from Neuroblastoma patients, we can identify which aspects of the interaction networks are perturbed within primary and relapsed neuroblastoma. Further, by integration of structural data for proteins in our networks, protein interactions between targets that are amenable to structure based drug design can be prioritised for cell based assays. This combination of disease specific data with databases detailing protein interactions and structures allows for richer prediction of potential targets for treatment of Neuroblastoma.

9. Invasive lobular breast cancer: An integrated genetic and epigenetic approach to characterise lobular tumors and unravel unique etiology related to tumorigenesis and progression.

Medha Suman, Melissa C. Southey, Tu Nguyen-Dumont, Jihoon Eric Joo, Ee Ming Wong, Neil O’callaghan, Melissa Yow, John L Hopper, Graham G. Giles, Roger L. Milne, Abcfs, Mccs and Kconfab

Invasive lobular breast cancer (ILBC) is the second most common histological subtype of breast cancer (BC) and accounts for 10-15% of all BC cases. It is recognised as a distinct subtype and differs from ductal BC at histological, clinical and molecular levels and some studies have also reported a difference in treatment response. However, currently there is no specific treatment regimen for lobular BC. Here, we aim to characterise ILBCs based on their genome-wide DNA methylation and whole-exome somatic genetic variation profile to identify genes and biological pathways associated with this subtype.

We performed an unsupervised clustering analysis based on the genome-wide DNA methylation levels, involving 449,005 CpG sites from 151 ILBC and identified 3 groups. Differential methylation analysis revealed genes with >30% methylation difference between the selected groups. We found group 1 to be the most hypermethylated and group 3 to be the most hypomethylated group. Survival analysis suggested a significant difference in overall survival between group 1 and 3. We are now conducting somatic WES to further examine these ILBC subgroups.

We will present data from the pilot whole-exome sequencing run involving 2 samples for which libraries were prepared from tumour DNA (FFPE), Guthrie card and frozen blood DNA using SureSelect XT low input library preparation kit and using SureSelect Clinical Research Exomev2 as the capture library. Somatic variants were called using VarDict and the mutation signatures were detected using deconstructSigs. We sought to identify methods to integrate the genome-wide DNA methylation and somatic mutation data that will further characterise pathways involved in lobular tumour development and progression. This will have potential in developing a more precise prognosis and targeted therapy for ILBC.

10. Fully convolutional neural network for automatic skeletal muscle segmentation in CT scans

Kaushalya C. Amarasinghe, Jamie Lopes and Julian Beraldo


Convolutional neural networks (CNNs) have successfully used in analysing different types of biomedical images including radiological, microscopy and histopathological images. Previous studies have demonstrated the higher performance of CNNs in delineating organs and tumours in radiological images compared to traditional machine learning methods. Skeletal muscle delineation of the cross-sectional slice of CT scan at third lumbar vertebra (L3) is an important step in evaluating whole body mass of a cancer patient. Specifically, this helps the early detection of muscle wastage due to cancer cachexia, which is associated with poor outcomes and shortened survival in cancer patients. Therefore, we aim to develop a CNN based algorithm to automatically delineate L3 muscle in full body CT scans, which can be integrated into routine clinical practice.

Material and methods:

Full body PET-CT scans of 66 non-small cell lung cancer patients who underwent radio therapy at Peter MacCallum Cancer Centre were used to develop the model. Each patient had 1-4 CT scans taken at different time points (prior to, during and after therapy). Skeletal muscle was manually segmented on 148 L3 slices of CTs according to the Alberta protocol. These manual contours served as ground truth labels. All patients were divided into train (=41) and test (=25) cohorts. The test set was treated as an independent validation set of the model. There were 90 CTs in the train set, which we randomly split in to train and validation sets for model training and selection purposes during the training phase. We implemented a fully convolutional neural network commonly known as UNet to perform the segmentation. The final layer of the network produced the binary classification of the pixels in to muscle and non-muscle area. The trained model was used to segment the L3 muscle in CTs in test set. The model performance was calculated using dice score, which is the pixel wise F1 score and gives the spatial overlap between ground truths and predicted labels.


We developed an automated algorithm based on fully CNN to delineate the skeletal muscle at L3 slice. The model outputs muscle contours in DICOM format, which can be visualised with the original CT scan using any third party DICOM viewer. Our model achieves a mean dice score of 0.90 on the test set. The mean absolute area difference between manual and automatic segmentations was 3.54 cm2 / 3.31%, (0.08% - 17.81%). In the test set, 21 patients had CTs at two time or more time points and showed on average 2.8% increase in muscle area between first and second CTs. Four patients had CTs taken at all four time points and showed an average decrease of 3.4% in muscle area.

11. Pan-Cancer Clonal Evolution Reconstruction Using Evolutionary Modelling

Luis Lara-Gonzalez, Sherene Loi, Davide Ferrari, Anthony Papenfuss and David Goode

Tumours at diagnosable sizes have undergone years of clonal adaptation and selection leading to intratumor heterogeneity rendering difficult for an accurate reconstruction of clonal evolution.

To tackle this problem, we implemented a fitting procedure that compares simulated vs next-generation sequencing data using the discrete time branching process (DTBP), an evolutionary model that tracks the expansion of diverse clonal lineages as they acquire driver alterations. We simulated 13,500 tumours from the model considering different parameter combinations of mutation rates (from 10-5 to 10-7) and driver mutation selective advantages (from 0.1 to 0.001), then identified which simulations best recreated the tumour profile observed in the patient. This is achieved by minimising the Cramer-von Misses statistic and adjusting by sequencing assay (amplicon, WES, single or multiregion) and clinicopathologic factors (i.e. tumour size, number of nodes, etc). We tested our approach with publicly available data using the clonality cancer cell fractions from single-region WES of TCGA cohort (Andor et. al.) and the multiregion WES study of 99 non-small cell lung cancer adenomas (Jamal-Henjani et. al.).

The applications of DTBP are multifold, it is capable to estimate tumour growth, aid clonality tools for phylogeny reconstruction, and compensate/account for the systematic and sampling biases introduced by sequencing assays.

Reconstructed clonal histories of TCGA showed an average of 2-6 clonal expansions proportional with the tumour fitness. Surprisingly, few clonal expansions with high fitness have shaped tumour evolution in malignancies with elevated number of subpopulations reported such as melanoma, lung and stomach–making the remaining subpopulations consequential or passenger noise.

We found concordance in the clonality and phylogenies in the multiregion WES study corroborating the results observed in the TCGA cohort. The mortality cases showed and increased fitness suggesting early dissemination as more cells are committed to expansion relative to the non-fatalities.

Combining evolutionary modelling with commonly used clonality tools can result in improved clonal evolution reconstruction with prognostic power.


Andor, Noemi, et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nature medicine22.1 (2016): 105.

Jamal-Hanjani, Mariam, et al. Tracking the evolution of non–small-cell lung cancer. New England Journal of Medicine376.22 (2017): 2109-2121.

12. CNspector: a web-based tool for visualisation and clinical diagnosis of copy number variation

John Markham, Satwica Yerneni, Georgina Ryland, Huei San Leong, Andrew Fellowes, Ella Thompson, Wasanthi De Silva, Amit Kumar, Richard Lupat, Jason Li, Jason Ellul, Stephen Fox, Michael Dickinson, Anthony Papenfuss and Piers Blombery


Targeted sequencing using panels of disease-related genes is now routinely used in pathology departments to find clinically relevant somatic and germline sequence variations in patient samples. Clinical assessment of copy number variations (CNVs) and large-scale structural variation (SVs) is still challenging however. While tools exist to estimate both, their results are normally presented separately in tables or static plots which can be difficult to use for clinical interpretation and reporting. When CNVs and SVs are displayed together, it is often in the form of CIRCOS plots which, while useful as a summary or overview, are not suitable for detailed interrogation of the data.


We have addressed this problem with CNspector, a multi-scale interactive browser that shows CNV in the context of other relevant genomic features to enable fast and effective clinical reporting.

We illustrate the utility of CNspector at different genomic scales (exon and chromosome scale), with different sample types (fixed formalin paraffin embedded and fresh frozen tissue) and different sequencing strategies (targeted and whole genome sequencing).

We supply utilities to import the outputs from several popular copy number callers so that their output can be viewed within CNspector. For those callers that only use one reference sample, we are able to improve the CNV estimates by loading all samples at once (multi-sample mode) and calling copy number against a dynamically generated reference. Multi-sample mode may also be used to do one-to-many and one-to-one comparisons of samples - useful, respectively, for comparison with patient cohorts and for comparison between samples from one patient taken at different time points. Finally, in a research context, CNspector can be used to view other, log-scaled, counts data. This allows visualisation of, for example, differential RNA expression at not only the exon and gene scales but also at larger scales due to long-range regulatory processes.


We have provided a web-based clinical CNV browser tailored to the clinical application of targeted sequencing for CNV assessment. We have demonstrated its utility in typical applications across a range of tissue types and disease contexts encountered in pathology departments.

CNspector is written in R using Rshiny and the source code is available for download under the GPL3 Licence from A server running CNspector loaded with the figures from this paper can be accessed at

13. Developing portable variant calling pipelines with Janis

Michael Franklin, Richard Lupat, Jiaan Yu, Evan Thomas, Daniel Park, Bernard Pope, Tony Papenfuss and Jason Li

The decreasing cost of next-generation sequencing has allowed whole genome sequencing (WGS) analysis to become commonly used in research; the results of which will lead us closer to personalised cancer treatments. At the same time there is an increasing demand for FAIR (Findable Interoperable, Accessible, Reusable) data principles and shareable computational analyses. The significant amount of computation required by WGS places constraint of in-house high-performance computing systems (HPCs) and creates demand for pipeline systems that are portable and reproducible across on-site and cloud environments. Projects such as the Common Workflow Language (CWL) and Workflow Description Language (WDL) have made progress towards a portable, containerised specification that decouples the workflow specification from the execution environment. However, CWL can be difficult to write, WDL does not have wide support, and neither specification has an easily extensible typing system.

To address these problems, we have developed Janis (, a Python framework to create shareable and type-safe workflow specifications with the ability to generate CWL and WDL. Workflows are composed of containerised task-based components which addresses portability and allows for each tool to be independently optimised based on its resource requirements.

Using Janis, we have produced two portable WGS variant calling pipelines (germline and somatic). These workflows use GATK (HaplotypeCaller for germline and Mutect2 for somatic), Strelka and VarDict for variant calling. Both workflows, a) take raw sequence data in the FASTQ format; b) align to the reference genome using BWA MEM; c) mark duplicates using Picard; d) call the appropriate variant callers; and e) output the final variants in the VCF format. The analysis has been parallelised via sub-genomic regions to decrease the total runtime of the workflow. These pipelines will be extended to include variant annotation, copy number and structural variation analysis.

These pipelines were tested using the Broad Institute’s Cromwell execution engine, which supports Google Cloud and other batch systems (SLURM and PBS) with minor configuration. Cromwell was configured to run Singularity at HPCs where Docker is not supported. These workflows were successfully run using the Genome-in-a-bottle data sets in HPCs at the Peter MacCallum Cancer Centre (Rosalind), University of Melbourne (Spartan) and Walter Eliza Hall Institute of Medical Research (Milton), and additionally in the cloud through the Google Cloud Platform. We have validated the germline variant calls using the best practices established by the Global Alliance for Genomics and Health Benchmarking Team, achieving a recall of 99.25% and precision of 92.02%, identical across each of the environments. In this presentation, we will discuss how Janis addresses the challenge of developing these portable pipelines to work efficiently on multiple HPCs and cloud environments.

14. Experience of bioinformatics training at a cancer institute

Maria Doyle, Roxane Legaie, Miriam Manning Yeung, Richard Lupat, Liz Christie, David Ma and Anna Trigos

With the increasing amounts of data produced in biomedical research, training of researchers in data skills is needed and in high demand. In 2018 we circulated an Expression of Interest around our institute to assess demand for acquiring data skills and received >100 responses. In response to that demand, so far we've provided several training courses in R, Python and Galaxy and places have been filled in minutes. For the teaching content, we are making an effort to use material by others in the community where possible, and also adapting content to what we believe is most useful to our researchers. We've used a format of short sessions, lasting a couple of hours. Short session format was used, rather than full day or multi-day sessions, to give attendees time to digest the information between sessions and to try to integrate the teaching more easily into schedules for both attendees, trainers and helpers. The courses have been well received and we are trying to iteratively improve what we deliver using the feedback and lessons learned. We will discuss our experience, what we have learnt so far, and future aims.

15. Unlocking data mining capability with Analytic Database: a use case of Molecular Genomics Core operational support

Niko Thio

Vast technological advances have resulted surge on genomic sequencing demands. Within the research core genomic sequencing facility in Peter MacCallum Cancer Centre, sequencing demand has doubled in each year starting from 2015. Throughout years of sequencing activities, various laboratory instruments involved from sample preparation to sequencing have generated immense amount of data, mostly heterogeneous in structure and disparate in storage. This presented challenge in performing holistic analysis of information gained over the years.

In this talk we presented MGC Database which is an implementation of analytic database in Molecular Genomics Core (MGC) - the research core genomic sequencing facility. The most prominent feature of this system is enabling end-to-end analysis essential for optimising and troubleshooting sequencing protocols. This is achieved through internal data consolidation service, allowing users to analyse quality control (QC) metrics from laboratory preparation to sequencing outcome.

MGC Database provides interfaces for two distinctive use cases. The first interface is dashboard-style web interface, which aimed for rapid insight retrieval for supporting routine operations (e.g. QC monitoring, and first-tier troubleshooting). The second interface is notebook script-style, which aimed for advanced users to perform specialised analysis (e.g. protocol benchmark and optimisation).

The implementation and deployment of analytic database is a substantial milestone in the larger context of implementing scalable analytic platform for supporting operational activities of core research facility.

16. NanoCrest: streamlining special-purpose data analysis of Nanostring assay for clinical trial application

Niko Thio

Methods for gene expression analysis have undergone major advances in biomedical research, making biomarker analysis to be incorporated within clinical trial studies. Clinical trial specimens commonly are archival and stored in formalin-fixed paraffin embedded (FFPE), where Nanostring platform is often favourable for its high efficiency on performing gene expression profiling of FFPE sample type.

Typical requirements of data analysis delivery for clinical trial includes accountability and reproducibility. This implies every analysis outcome need to be trackable to the analysis pipeline version that was used, and every repeated analysis must yield the same outcome, regardless the operator, machine or time point. Delegating these requirements to individual operators directly will be prone to human error, and the repetitions involved are inefficient to be done manually.

We present NanoCrest, a desktop-based software to streamline data analysis delivery of Nanostring assay. NanoCrest is designed to perform specialised analysis pipeline with minimal configuration, reducing human error factor on delivering analysis outcome. This also allows users with minimal programming skill to operate and deliver analysis results. NanoCrest tracked analysis version in each generated analysis to support accountability aspect. The required dependency packages used in the analysis pipeline are bundled within NanoCrest workflow, to support reproducibility aspect. The specialised analysis pipeline is typically developed independently from NanoCrest workflow by bioinformaticians or researchers. Once the analysis pipeline tested and reached releasable state, it is adapted and incorporated into NanoCrest analysis module.

NanoCrest supports both single and multi-users environment. For multi-user environment, version management and deployment is incorporated within NanoCrest software, which either automatically upgrades the software on the background, or alerts the user of the new release version available.

NanoCrest has been actively used on supporting two clinical studies in Peter MacCallum Cancer Centre. The first is BEACON clinical trial study, where NanoCrest is used in patient screening for C1 subtype of high grade serous ovarian cancer.

The second is SUPER study, where NanoCrest is used for generating primary site prediction of patients diagnosed with cancer of unknown primary.

17. Removing of unwanted variation from TCGA RNA-seq data

Ramyar Molania, Johann A Gagnon-Bartsch, Momeneh Foroutan, Antony T Papenfuss, Alexander Dobrovic Dobrovic and Terence P Speed

The Cancer Genome Atlas (TCGA) consortium assessed a large cohort of breast cancers by RNA sequencing over a span of 5 years (2010-2014). To generate the data, fresh frozen samples were collected from many institutes and allocated to different batches and processed at multiple time points. All these elements can cause batch effects that may compromise the integration and accurate interpretation of the data.

Importantly, the TCGA consortium changed flow cell chemistry in 2012. We identified a substantial batch effect in the TCGA breast cancer normalized RNA-seqV2 data set that was introduced by the change in the flow cell chemistry in 2012. We demonstrated the unwanted variation introduced by this batch effect affected downstream analysis such as identification of co-expressed genes and the comparison of paired primary and metastatic samples. We proposed an approach based on our recently developed normalization method, RUV-III to remove this batch effect. In the absence of true replicates, we used pseudo-technical replicates to remove batch effects.

We used a range of statistical tools including RLE plots and principal component analysis, as well as biological positive controls to assess how effectively batch effects were removed and biological heterogeneity was preserved by RUV-III. We demonstrated that RUV-III normalization led to accurate estimates of gene co-expression and more precise classification of PAM50 breast cancer intrinsic subtypes.

In summary, the use of RUV-III based on pseudo-technical replicates and suitably chosen negative control genes can lead to satisfactory normalization of RNA-seq data where current normalization methods exhibit shortcomings.

18. A bioinformatics platform for translating cancer genomics to the clinic, from fastq to clinical trial matching and reporting

John Grady, Mark Cowley and David Thomas

Identification of genetic alterations in patient tumours has the potential to substantially improve patient care, though more precise diagnosis and identification of optimal therapies. We established the Molecular Screening and Therapeutics (MoST) program to bring genomics led therapeutics to patients with rare and advanced adult cancers. To date we have screened and reported over 1000 patients, using several gene capture panels, currently the tumour-only Illumina TST170 gene targeted sequencing panel using DNA and RNA from FFPE samples. In the course of this we have developed a modular and automated bioinformatics pipeline to facilitate hands-off interpretation, discussion, and reporting of patient genomic tumour profiles and genomically matched therapies.

The system comprising a workflow manager (Refynr2) driving a modular cloud based workflow on DNAnexus to produce interim outputs from fastq inputs (e.g. variant VCFs, copy number and structural variants, RNA fusion calls). This required solutions to numerous difficult bioinformatics problems, including tumour purity estimation, somatic/germline status, copy number variants from targeted gene panels, and gene fusion detection. We developed the final module (Gentian) as a solution to integrate and filter the genomic, clinical and other data sources (including clinical trials and drug therapies), to identify critical genomic alterations in each patient, producing a tumour landscape report for discussion at a molecular tumour board (MTB). Gentian also integrates with clinical patient management systems (Progeny, and JIRA for case tracking) for pre- and post-MTB patient management and reporting.

It has been flexibly designed to integrate with other sources of genomic data (not restricted to gene capture panels), and other databases/patient management systems.

This system has enabled us to upscale our patient throughput by an order of magnitude by reducing the human input from several hours per patient to minutes or less, whilst also standardising the analytics and facilitating cohort analysis and research.

19. The Australian Bioinformatics Commons Paediatric Cancer Pathfinder Project: harmonised analysis of geographically separated and jurisdictionally protected data resources

Andrew Lonie, Mark Cowley, Allison Heath, Steven Manos, Maely Gauthier, Marie Wong-Erasmus, Jack DiGiovanna, Michele Mattioni, Adam Resnick, Paul Coddington, Brian Davis, Chris Myers and Frankie Stevens

Large-scale cancer WGS, RNA-Seq and methylome analyses have made a substantial impact on our understanding of many cancers, including their aetiology, identifying disease subtypes, novel pathways and new drug targets. While there are a number of extensive genomic cancer research programs globally, most focus on adult cancer; however, as all high-risk paediatric cancer subtypes are rare diseases, statistically significant correlation between subtype and genomic variation is inherently dependent on large sample numbers. With only ~200 new cases of high-risk paediatric cancer in Australia per year, it is imperative that we aggregate Australian data with global data to understand and develop strategies to effectively treat high-risk childhood cancer.

The ZERO Childhood Cancer program (ZERO), led by the Children’s Cancer Institute in partnership with Kids Cancer Centre at Sydney Children’s Hospital, aims to recruit 400 children on the National Clinical Trial by September 2020, in addition to the 58 children recruited on the pilot study (a total of 260 patients enrolled to date), applying deep whole genome (WGS) and transcriptome (RNA-Seq) sequencing, and methylome profiling to obtain a multi-dimensional molecular portrait of each child’s cancer. The Gabriella Miller Kids First Paediatric Research Program (Kids First) is a global-scale National Institutes of Health initiative devoted to exploring and analyzing genetic predisposition and/or somatic association within various childhood cancers and structural birth defects. Data from approximately 8,000 DNA and RNA samples from children affected with cancer or structural birth defects and their families are ready for analysis now, and the resource is expected to grow to more than 30,000 over the next few years. Kids First analyses are built on the CAVATICA platform - a mature, highly capable and widely used genomics analysis platform currently underpinning the Kids First Data Resource, developed and supported by Seven Bridges Genomics.

A collaborative partnership between ZERO, Kids First, Seven Bridges Genomics, and the Australian Bioinformatics Commons (a $2.5m joint research infrastructure program funded by Bioplatforms Australia, the Australian Research Data Commons, and AARNet) aims to establish internationally federated computational infrastructure that will enable the harmonisation of ZERO Australian paediatric cancer data with the extensive genomic datasets from Kids First.

The approach in the first instance is to extend the existing CAVATICA platform to enable efficient harmonised analyses across geographically separated and jurisdictionally protected data resources, leveraging commercial cloud standards. The extended CAVATICA orchestration engine will allow ZERO and Kids First workflows and analysis tools to be used interchangeably and seamlessly across both datasets, effectively aggregating the separate datasets into a single virtual pan-continental dataset from the researcher’s perspective, highly accessible through a global best practice analysis platform.

By pooling together these data, we will have the power to identify rare brain cancer subtypes. We propose to harmonise the brain cancer transcriptome data from ZERO (n=160), and the KidsFirst PBTA/CBTTC project (n=921), and identify phenotypically distinct novel brain tumour subtypes.

We anticipate this project will highlight the potential for cross-cloud, international data harmonisation and integration for discovery, and clinical benefit.

20. Gene Fusions in Granulosa Cell Tumours of the Ovary

Maria Alexiadis, Kirill Tsyganov, Simon Chu, David Powell and Peter Fuller

Adult Granulosa cell tumours of the ovary (aGCT) are a unique subset of malignant ovarian tumours defined by the presence of the C134W somatic mutation in the FOXL2 gene. Although aGCT are generally regarded as having a good prognosis, late recurrences occur which usually lead to the patient’s demise. Neither reliable methods of predicting relapse, nor the molecular mechanisms of relapse or aggressive behaviour are known. aGCT display few of the conventional drivers of tumorigenesis: that is, no genes have been found to be recurrently mutated, beyond FOXL2 and TERT (1). To establish a complete genomic landscape for aGCT we sought to establish whether expressed fusion genes (translocations) contribute to the pathogenesis of aGCT.

RNA expression analysis has confirmed the changes seen in our previous microarray transcriptomic analysis (Alexiadis et al 2016). JAFFA analysis predicted a mean of 606 fusion events per sample (a mean of 11 per sample were with high confidence) and Arriba predicted a mean of 19 fusion events per sample (a mean of 5 per sample were with high confidence). 1 translocation was found with both JAFFA and Arriba fusion detection tools, which is present in all samples.

The identification of aberrant transcripts in GCT suggests they play a role in the pathogenesis of aGCT.

  1. Alexiadis et al Molecular Cancer Research 17: 177, 2019.
  2. Tsyganov et al J. Open Source Software 2018

21. Development of Prostate Cancer Database: Integration of clinical, experimental and genomic data of prostate cancer cohorts

Shivakumar Keerthikumar and David Goode

Goal: To develop a web-based user friendly interface for annotation, storage, retrieval and analysis of clinical, experimental, genomic and proteomic data from a diverse set of advanced prostate cancers. The resource combines multipronged data collected from patient samples as well as cell lines, Patient Derived Xenografts (PDXs) and organoids derived from these samples, to facilitate data sharing and collaboration between clinicians and scientists.

Progress: It has been implemented as an object oriented database using Zope as a front-end web application server, connected to a patient centric MySQL database as backend data storage system, all coded in Python. We are building annotation tools into the resource to allow users to add, edit and update the clinical and experimental details, as well as search, download and visualize sample-derived high throughput genomics data. Currently, the database is hosted on a Nectar Cloud server with secured access currently limited to internal research purposes. Significance: We are planning to expand the database to incorporate a broader array of tumor samples from the Melbourne Urological Alliance (MURAL) project, which was initiated in 2017 by the Peter Mac and Monash University. We believe this cancer resource would serve as a discovery tool for elucidating the molecular basis therapy resistance in prostate cancer, leading to the development of novel therapeutic intervention to improve diagnosis and prognosis of prostate cancer patients.