Peter Laird, University of Southern California, USA
70% of CpGs methylated. In a cancer cell a number of changes. Widespread (rather than global) hypomethylation and focal CGI hypermethylation in cancer, frequently in promoter regions. What is the relationship between those two discordant events?
125 colorectal adenocarcinomas genomes profiled for methylation, 29 adjacent normals. Clusters nicely, but are those associated with clinical features? Align with BRAF mutation status, for one (but mutations do not induce the methylation status). Another Distinct CpG-island methylator phenotype, CIMP+, micro-satellite unstable subtypes also overlap (if incompletely). Can be used to generate epigenetic subtypes of colorectal cancer, applied to TCGA with the same outcome.
Synergy between cancer genetics an epigenetics. ES-Cell polycomb repressor complex targets are prone to abnormal DNA methylation in cancer (polycomb keeps master regulators of differentiation in poised state). Large number of promoters acquire methylation in cancers compared to matched normals, enrichment of polycomb targets among those cancer-associated methylated targets. Same genes also acquire increased methylation with age.
Polycomb crosstalk likely leads to cumulative stochastic methylation. Loss of polycomb and replacement with more permanent silencing method — i.e., methylation — means target no longer able to differentiate, gets stuck in stem cell state without full differentiation capability. Great target to accumulate additional mutations over time. Potentially not an active process in cancer, but a hallmark of the event that lead to the cancer development. Not a competitive advantage to develop a cancer, but a passenger event indicating that the progenitor cell could no longer differentiate.
Would explain DN methylation of ~50% of cancer-specific methylated genes, consistent with stem cell-like behaviour of cancer cells, explains observation of epigenetic field effects adjacent to tumors: a differentiation block.
Focal hypermethylation and long-range hypomethylation: WG bisulfite sequencing of primary tumors and normal tissues. Shows sample CpG site in normal/cancer, striking difference of signal. Zooming out to ~20kb windows, nice heatmap/scatterplot of methylation signal show partially methylated domains. Shows erosion of methylation pattern in window from ES cells to differentiated colon to cancer. Hypomethylation not uniform distributed, clear windows. Epigenetic unstable regions close to the nucleous (late replicating regions, lamin attachment regions).
Comparison across cancer types: holds up across multiple cancers studied, with individual differences that should be interesting to tease apart. Comparison of 2200 TCGA cancers 409 normals reveal cancer-specific profiles in unsupervised clustering and pairwise correlation analysis of all cancer types using all sites.
Peter Park, Harvard Medical School, USA
45% of human genome derived from transposable elements (TEs), able to replicate (copy/paste) across the genome via an RNA intermediate. Previous studies of TEs mostly in germline, >7000 insertions in 185 samples from the 1000G set. Implicated in single gene diseases with ~100 insertions reported so far (L1s, ALUs, others), e.g., neurofibromatosis type 1 contains 18 retrotransposon insertions. Some events found in cancer, for example L1 in MYC — but discovered in low-throughput studies.
Studied 43 cancer/matched normal genomes (150 billion reads) from TCGA (GBM, ovarian, colorectal, prostate, myeloma). Thousands of novel germline insertions, 194 high confidence somatic insertions identified (majority L1s).
Detecting events challenging. Numerous, often identical TE instances. Find cluster of read pairs where one end maps uniquely, other end maps to TE consensus sequence. Use of clipped reads (partially aligned reads) key. Added custom assembly of repeat elements to the genome to be able to find repeat families at once. Tool called Tea (transposable element analyzer); insertions validated by Sanger. All somatic L1/Alu insertions in cancers of epithelial cell origin, none in blood or brain cancers (Ouch: postdoc started with those tissues by chance.).
64 out of 194 insertions located in genes, including tumor suppressors (UTRs and introns). Somatic insertions tend to occur in genes commonly mutated in cancer, disrupt expression levels of target genes, biased towards regions of cancer-specific DNA hypomethylation (all statistically significant).
Unclear where and how often these happen
Quick shout out to Galaxy and Peter’s Refinery system, a data repository connected to the Galaxy backend for data analysis.
Matthew Ellis - WUSTL
Need for controlled screens rather randomly sampling frozen samples from patients treated in different ways as part of a broad mutation discovery program. Relate cancer phenotype and genotype through recurrence and validation efforts, identify drug-able matches, validation studies and clinical trials.
Summary of a breast cancer clinical trial (3 drugs for ER+ patients for patients where degree of required surgery not feasible; smaller surgery possible if tumor shrinks in size). Use samples of well-controlled trial for sequencing analysis. Split discovery set of 50 cases into half based on whether a drug worked or didn’t, 2 samples each, 70+% tumor cell content, compare to germline, again tiered annotation of somatic mutations. Prioritized list of genes, familiar and novel [not published yet and asks not to share examples].
Knockdown of target genes (with and without estrogen depletion) and tracing apoptosis to zoom in on candidates for therapy followed by pharmacological targeting of pathways (PIK3 in this case); study sensitivity versus mutation status. Genome-forward trial with upfront sequencing for targets, mutation status. Adjusting trials difficult for loss-of-function mutations.
[He goes through a number of additional genes and pathways, druggable status, etc., but I’m unclear on what can or cannot be blogged. Erring on the side of caution here.]
Extend from SNPs to areas of gene amplifications with potential targets, preferably drivers of higher prevalence rates (putative resistance genes). Yields even more druggable targets (‘lots of ‘em’).
Forward: do genome upfront, all patients then have a theoretical target and a realistic chance of responding to the drug.
PSCA variant identified in bladder cancer GWAS as a risk factor, also associated with kidney cancer. PSCA is a GPI-anchored signaling protein. SNP in the first exon, mis-sense generates alternative start codon upstream of regular start site. Patients with allele with increased PSCA expression. Addition of nine amino acids likely to affect leader peptide and surface expression; confirmed through confocal microscopy, FACS and immunohistochemistry. T risk allele associated with increased RNA and surface expression. Patients might be eligible for PSCA personalized treatment.
Single DNA molecules stretched out, restriction enzyme added which cuts at specific sites, image analysis of the cut positions to generate an ‘optical’ restriction map [lovely microscopy image]. Fragments with about 10% error, 2k additive errors (fragments as large as 2kb removed, incomplete digests, etc.). Align in silico map to optical map to identify where in the genome a fragement belongs.
Typical genome assembly constructed from de Bruijn graph, construct contigs, connect with mate pairs. Use optical maps instead of mate pairs to eliminate incorrect paths through the graph. Test for bacterial genomes in simulations with different levels of noise, restriction enzymes. Correctness measured by longest common subsequence between reference genome and algorithm results. Test on ~350 bacterial genomes with additional simulated optical map [uh…] and a full experiment for Yersinia Pestis. At 1% error rate 99% of sequence reconstructed correctly; drops to 86% for real optical map, indicating an error rate of >10%.
Simulated data on ‘average’ bacterial genomes looks promising even at similar error rates.
Peter Campbell - The Wellcome Trust Sanger Institute, UK
Exome sequencing of myeloid malignancies (MDS); progression defined by changes in proliferation and partial differentiation blocks (neoplasms - syndromes - AML). Myelodysplastic syndromes under-diagnosed as they require a bone marrow biopsy, age of onset is high (70+). Clonal stem cell disease with aberrant differentiation, enhanced self renewal; acquired somatic mutation in hematopoietic stem cells likely cause. List of MDS driver genes by now fairly long, most prevalent is TET2 present in around 20%. Majority of genes infrequently mutated in 1-5% of patients. Classification in flux and constantly updated.
MDS exome sequencing study in nine patients, 64 somatic mutations [no details on the bioinformatics workflow, unfortunately; would love to hear how they screen / filter candidates], excess of non-synonymous mutations. Mutations differ in ‘allele burden’ (frequency of reads with the mutation from 5-80%). In 7/9 genomic screen hits known driver genes and could be diagnosed by a blood screen. Distribution of mutations within Sf3b1 with the usual hotspots, all heterozygous, knock-down in zebrafish lethal. Hematopoietic cells still being formed, but don’t initiate the expected developmental program.
Gene expression profiling in patients (CD44+, ~50 patients) to identify pathways associated with Sf3b1 expression. Enrichment in patients with mutations indicate down-regulation of aspects of mitochondrial function. Gene expression change precedes cancer development. Follow-up with exon-specific arrays on 12 patients, only 20 genes show differences in splicing patterns (at 5% FDR); example of ACACA. Unclear how the mutations in Sf3b1 affect splicing in patients, might require more samples. Genotype/phenotype correlation somewhat significant for platelet and white cell counts.
Sam Aparicio - BC Cancer Agency, Vancouver
Intro to the classification of breast cancer subtypes. Therapeutic results improved, but far from a solved problem, particularly for triple negative subtypes. Classification based on whether the cancer ER status (endocrine receptor) and HER2 status; can also be classified into subgroups by expression signatures such as PAM50.
Large scale chromosomal events, copy number changes and single mutations prevalent (>5% of cases), other variations less frequent. METABRIC consortium with SNP and expression arrays from ~2200 breast cancer samples allow to survey the variation landscape, highlights the expected landmarks, e.g., 15% of samples with ErbB2 mutation. Still staggering diversity, for example, copy number variations across all samples cover 70% of the genome.
CNV (germline), CNA (somatic), SNP (germline, single nucleotide). Expression relationships
in cis present, again including ErbB2, MAP2K4 with gain or loss of expression that is clearly correlated to a specific CNA.
in trans can also be detected using the same statistical analysis (example here is loss of PTEN and change in target genes). In a cohort of 1000 samples about 30% of genes are implicated in CNA
in cis expression variation. Trace extreme outliers using the distribution of expression changes to CNA, defines about 25 hotspot regions of the genome enriched for known driver genes, but also identifies novel candidates such as the AIM1 locus. Resolution also allows to identify hotspots of deletion (PTEN, others) which might include likely tumor suppressors. Some hotspots specific to breast cancer subtypes; again, analysis only possible due to large sample size.
Distinct associations also present in
trans loci that link CNA events with expression changes at other loci with relevant pathway annotations [nice plot of gene expression changes on one axis vs segmental copy numbers on the other].
More of an art than science. Ab initio clustering of 1000 patients using integrative clustering methods on the CNA events and expression changes. Identifies ErbB2 (positive control); at least 10 stable groupings when run genome-wide including basal cancers. Two new groups of interest:
4, genomically quiescent tumours. 17% of the overall population
chr11group, second highest hazard in ER-positive patients in 15 year survival plots
Lack ER, PR, HER2 amplification; least two distinct subtypes (basal, non-basal) based on expression profiles, likely more subtypes. 104 TNBC samples through WG, exome and transcriptome sequencing; workflow and validation similar to what Elaine describes.
Split cancers in basal/non-basal subtypes based on PAM50, verified by clustering of transcriptome results (roughly). Alternative expressed gene networks identify cell-shape related genes, motility, actin dynamics. INPP4B isoforms (phosphatase) as an example (surrogate for PTEN?). Differences basal/non-basal can be traced to different gene isoforms, catalytic subunit driven by a different transcription start site that is missing in tumor samples. Not a cancer-specific change; long isoform expressed in normal basal myoepithelium, not in luminal. Property of the epithelial cells. Similar patterns in other genes (MYO6).
Driver mutations can be sub-clonal, similar to previous talk identified by allele frequencies; many mutations can be grouped by function into modules. Less than half of somatic mutations are expressed in RNA.
Elaine Mardis - Washington University in St Louis, USA
[Elaine talks fast as always, slides full of text. Will try to capture at least a part of it. @bioinfosm found almost the same slides from the talk online (pdf)]
Genome evolution in liquid tumors (since we have a number of breast cancer talks later today already), AML and relapse. Since the 70s understood cancer as a disease of the genome, Rowley’s work on studying cancer chromosomes microscopically. Current work on correlative cancer genomics requires an understanding of the history of the patient’s cancer progression by studying the prevalence of mutations in a tumor sample (which is a collection of genomes in a tumor block). Applied to end-stage or difficult to diagnose patients to aid treatment decisions.
Human genome just serves as a reference; all work here completely depends on the human genome project achievements. Standard workflow and battery of tools to detect somatic changes (SNPs, CNVs, Insertions/Deletions, Inversions, Translocations, …) applied to normal and tumor matched samples sequenced at around 30X, paired-end Illumina. Assign tiers to alterations (coding / 2%, conserved / 8%, unique / 40%, other / 50%). Important validation step through comparison of variants to SNP arrays and custom probes for each [!] putative variant sites for custom array-based capture validation, 1000-10,000 target regions for each chip. Combine multiple sample analysis, validate through deep seq counts (~ 1000X of variant sites).
Oldest mutations present in virtually every cell, heterozygous mutations should be around 50% (or 500 reads at average coverage), newer mutations are going to be a mix of wild type cells for that specific mutation and mutated cells. Sort by mutant allele frequency highlights the progression of gene mutations. Identify tumor subpopulations by grouping clusters of mutations with similar allele frequencies, easy to spot in a graph.
Plot sites and their frequencies in de novo tumor and relapse samples makes it easy to identify single dominant clones or subpopulations with multiple clones only present in the relapse group. Can also identify genes no longer present in the relapse genome; those likely dropped out during the therapy and are no longer present.
Complexity of genome alterations increases with time as more subpopulations arise at different levels of prevalence. Can be used to identify biomarkers and driver populations by highlighting known / clinically relevant genes in the different populations. These can and do change between de novo samples and relapse, allowing for targeted therapies.
[Great graph that shows disease progression over time, following subclone diversity and frequencies; typical population bottleneck during therapy followed by expansion of a single clone that carries through chemo.]
Tumor heterogeneity varies wildly across patients; information can be used to model tumor progression and relapse depending on whether one or multiple clones carry through therapy. Work on 8 patients allows to define clonal evolution patterns through relapse. Both models invoke clonal evolution and addition of new mutations, likely through the chemotherapy itself.
Myelodysplastic syndromes share dysplasia, ineffective hematopoiesis, 1/3 progress to AML not curable by chemotherapy. Progression from background mutation to initiating and progressing mutations. Patients followed in the clinic from MDS to sAML; also have normal skin to provide germline information. 30x WGS on normal vs sAML samples (not MDS), identify somatic events in secondary phase, validate by deep sequencing. Repeat with flow-sorted MDS samples for validated sites, check for recurrence in 150 paired MDS/sAML samples. 15 pairs completed so far.
Again, plot mutation age in AML samples; MDS mutations overlap with the ‘earlier’ mutations but are not present in the later, sAML-specific mutations. Frequencies can be clustered:
Frequencies differ somewhat for each patient (different number of clusters, too), but can aid in the identification of dominant and minor subclones. sAML is multi-clonal (oligoclonal), different progression mutation in each patient resulting in a mosaic of genomes likely responsible for the poor treatment results of these patients.
Lincoln Stein - Ontario Institute for Cancer Research, Canada
Cancer as a genomic disease, coupled with the underlying stochastic processe: every cancer genome is different, less than ideal for a one-size-fits-all treatment regimen. Few notable targeted therapies reported so far (Herceptin for HER2 amplifications in breast cancer, Erlotinib for EGFR mutations).
Huge scope, no country can do all the required data analysis alone. Allows coverage even of less frequent cancers, but requires standardiztion of QC, formats, dissemination. Data is distributed, but accessible through common portal — interpreted data through a federated database system. Only really tractable on the level of interpreted data rather than the raw reads.
50 different cancer types, 500 donors per type, sample tumor and normal tissue (blood or adjacent tissue), test for differences in the equivalent of 50.000 human genome projects.
Discovering patterns and mechanisms of altered genes in cancer. Difficult due to a ‘long tail’ of cancer genes, very few genes consistently involved. Mutations affect pathways which can be knocked out in multiple different ways. Using the Functional Interaction Network (part of Reactome) to analyze these dependencies. Gene lists interpreted using a simple protocol:
[Part of our standard gene list annotation system by now; highly recommend the FI modules in addition to a standard pathway or GSEA analysis!]
Example of 310 genes from just 5 ovarian cancer patients with non-synonymous mutations in the cancer specimens. Emerging patterns provide links to KRAS, Wnt/Cadherin, MAPK, Hedgehog and other known cancer-related pathways. Can be enriched [and potentialy stabilized] as sample size increases.
Can also be used to derive prognostic signatures by running PCA on the gene expression profiles of relevant modules only rather than the gene level. Example uses a re-analysis of survival curves in public breast cancer GEO data sets (within the same study). Transfers well to four independent breast cancer series. One of the key benefits: the signature has a meaning (here: kinetochore and aurora-b signalling, hallmarks of rapidly dividing cells).
Apply genomics in a clinical setting at OICR. Clinical trials of targeted agents been inefficient so far; multiple geno-typing procedures to identify suitable trials. A more rational process checks (all) mutations of patients who failed standard chemotherapy prior to enrollment in relevant clinical trials. Currently limited to a targeted sequencing of ~20 genes with existing trials or prognotic value / clinical implications.
Demonstrate feasibility and optimize process to get usable results back to patients and clinic quickly (three weeks or less). Using single molecule sequencing (PacBio) due to the short turnaround time. Created a workflow to report raw gene sequence data into usable clinical report, using a knowledgebase of common mutations and their clinical consequences. Includes information for ~200 genes obtained from local oncologists, ~800 from MSKCC, COSMIC and other sources. Used to generate a preliminary report and an expert panel that meets weekly, reviews report and requests changes or makes suggestions. Reports go back into the knowledgebase.
Draft physician report generated by the workflow system, combined with an online tracking system that handles patient consent, samples, genomic files and interfaces with a standard clinical trial informatics system. Eventually open source so the framework can be re-used at other sites.
Mike Schatz (Cold Spring Harbor Laboratory, USA)
Provides an overview of genetics and genomics, from Mendel to more recent ‘milestones’ and the increase in sequencing output, and its application to large scale projects (TCGA, 10k, Encode, Human Microbiome project, etc.). Walks through methods (alignment and variation, quantitative sample comparisons, de novo assembly). A worldwide effort, followed by the by now standard data tsunami statement and the growth charts — with a nod to the Pennisi article (Science 2011, “Will computers crash genomics?”).
Beyond the genome means that technology will continue to push the frontier forward (higher throughput, different types of data). Resulting demands for making sense of digital observations:
Rest of the talk focus on the observation side of things. Overcome problems by co-development of protocols and computational methods, modeling error types, filtering. Strong need to overcome computational demands, though, as sensor improvements can overwhelm computational resources. Couple of parallelization options:
Storage and transfer limitations another venue for research; compression, smarter protocols and parallel file systems with tiered storage partially address these problems.
Jnomics introduction, a new tool for cloud scale genomics (Matt Titmus as lead author). Rapid execution of parallel pipelines, format agnostitc (BAM/SAM, BED, FASTQ), sorts / merges / filters, aligns, runs variation analysis, etc. Sample project of structural variation in esophagael cancer using Hydra discordant pair analysis (Aaron Quinlan) to identify unusual read pair distances and orientation to detect breakpoints, inversions. Workflow with multuple alignment steps (BWA, filter BWA, re-align with Novoalign, filter, re-align with Novoalign again) to capture difficult to map read pairs. Jnomics parallelizes alignment steps to many different nodes.
Detailed analysis still in progress, difficulty to distinguish real signal from noise. Circos maps (tumor specific vs normal/tumor variation) at different thresholds.
Unfortunately no link to the framework (yet?). Sourceforge page is just a placeholder.
Cell death avoidance, uncontrolled cell division, eventually resistance to treatment. Causes include infectious diseases. mutations (predisposing inherited vs acquired through mutagens, environment). Complete decoding of genome, transcriptome and epigenome required to understand cancer biology. (Disclaimer: might also have to include the tumour microenvironment)
Cancer genome evolution: gradual accumulation of changes until the internal control systems are overwhelmed. Think of cancer as a heterogenous group even within the same tumour (including minority variants already present which give rise to resistant subtypes). Decoding genomes helps unravel the heterogeneity and supports treatment choices, Gives an overview of Smith Genome Center projects. All tied to new developments in algorithms and methods.
Focus for this talk on recent mutation discovery in B cell lymphomas. Cancer in the lymphatic cells of the immune system, present as an enlargement of the lymph nodes, extra-nodal sites include skin, brain, bowel, bone. 55% of all blood cancers, 5% of all cancers, 43 types organized into 4 large groups (Hodgkin (4 subtypes), mature T and NK neoplasms (13), immunodeficiency associated, mature B cell neoplasms (14) — focus here on follicular and diffuse large B-cell lymphoma).
Cell type of origin for B cell malignancies seem to be diverse (different cell types in the germinal center), fundamentally different regulatory programs of development. Non-hodgkin lymphoma with a 4% compound increase in diagnosis in North America, mortality also increases (unlike most other cancers). Subtypes with different gene mutation profile in terms of frequencies (follicular vs diffuse germ cell vs diffuse activated; also differ in response to treatment). 90% of follicular lymphoma (FL) with translocation between chr14/18 (BCL2 under control of an Ig promoter), not seen in ABC type at all.
Genome and exome data for a number of FLs and Diffuse Large B-cell Lymphoma (DLBCL), about 100 RNA-Seq libraries. Mutation discovery based on normal/tumour genome comparison to find somatic, tumour-specific changes yield 143 candidate genes (validated), 400 more in the pipeline. Intersect with RNA-seq, final set of 231 genes (change in the genome and at least two more hits in the RNA-seq data). No matched controls for RNA-seq, so those are filtered against the standard known mutation databases, but still a lot of ‘private’ (germ line) variation present, or include RNA-editing changes. Identify hot spots, mutations that hit the same amino acid but different parts of the codon, yields a small set of genes enriched in chromatin regulation. Focus on two particular identified genes with high mutation frequency. Check for genes with non-random distribution of changes (ie, having a selection signature), both for mutations (potentially with gain of function) and truncation (loss of function).
Is genome packaging mis-regulated in lymphomas? Potentially new therapeutic options by targeting histone modifying genes.