Sharon Plon, Baylor College of Medicine, USA
Whole exome as a clinical diagnostic test. Recap of typical diagnosis of genetic disorders (clinical diagnosis, imaging, x-ray, metabolic assays); majority of patients remain undiagnosed. NIH Undiagnosed Diseases Program with carefully selected patients, review medical records, 326 cases accepted, 160 admitted to NIH Clinical Service. Able to find causative gene in 24% of the cases, 12 due clinical findings, 19 by molecular.
Simply finding disease genes not enough, needs to move to the clinic Whole Genome Laboratory at Baylor, joint effort between medical genetics and HGSC with merger of expertise, weekly meetings. CLIA-certified test includes capture platform, base calling, analysis and annotation pipeline (Mercury), hand-off of report to clinical reporting team, decision what to report on and sign-out. Whole process needs to be codified, changes difficult. Set up clinical exome metrics to ensure consistency in terms of coverage, quality, etc.
Clinical reporting include sequence results, Sanger confirmation, parental inheritance of significant findings (no trio sequencing), SNP array data; three levels of review including certified faculty. Difficult cases reviewed at WES Sign-Out Conference. Report focused on deleterious mutations in disease genes related to phenotype, variants of unknown significance in those genes, and clinically actionable variants. No such lists exists just yet; base on guidelines / literature. Expanded report can be generated upon request, including variants unrelated to the phenotype. (Bit surprising given previous talks).
Rapid increase in scale, almost 100 cases last month alone. 85% pediatric patients, wide variety of indications with the majority being neurological. Referrals from across the US. As of September 128 samples completed for analysis, 26% samples with causative deleterious mutation found.
Walks through medically relevant findings (~20% of patients have actionable findings), including 7 month old child with intraventricular hemorrhage (among other symptoms). No findings from standard clinical tests, high density SNP test. Exome-seq cound RBM10 mutation (TARP syndrome), de novo splice site modification, Result provides diagnosis, prognosis and conveys low recurrence risk.
16 yo female with neurodegenerative disorder and chronic headaches, 13 yo sister with similar symptoms. Autosomal recessive disorder, found ulta-rare disorder (Ataxia of Charlevois-Saguenay), SACS gene. One rare, two novel mutations (father carried two, mother carried frameshift). Prenatal testing for future children recommended. Prognosis available, though no treatment.
Overall strong interest in exome testing with early evidence of cost-effective clinical utility. Reporting remains challenging, as is the rapidly increasing spectrum of mutations associated with a given phenotype and the number of incidental findings. Benefits include determination of mode of inheritance, potentially improved preventive care.
Cancer exomes launched, but lag behind. Similar pipeline with deeper coverage, focus on reporting somatic mutations as tiers.
Daniel MacArthur, Massachusetts General Hospital, Boston, USA
Importance of aggregating signal across a large number of patients; need to analyze sequence in the context of tens of thousands of genomes. Requires consistent variant calls across studies. Mendelian studies need large reference sets — are variants in a patient unusual — to build a catalogue of variation. The ‘squaring the matrix’ problem of large individual vs variant position data sets. Merging cohorts when a subset of sites is only called in one cohort — true difference, or difference in technology, capture problems? Recall all variants across all individuals — but how to do this simultaneously?
Reduce reads with no diversity to synthetic read summarizing the region; from 10GB exome to ~100MB, SNP/Indel calling ‘as accurate’ as single sample, 3-10x faster. 97% of the calls shared with a reference batched calling approach, ~40k new sites likely to be real and requiring the full data set to be called correctly. Used for joint calling of ~26,000 (!) exomes. T2D, Autism, Cancer studies, 1000G.
Early data, only a small fraction of the sites called just yet. The way we think about the mutation frequency spectrum changes: nearly 50% of LoF variants only seen once in 26,000 exomes. Landscape dominated by ultra-rare variants. Distribution across splice sites: known splice site mutations enrich at the GT/AG donor/acceptor, natural selection weeds out variants > 1%. If you go down in the frequency spectrum there are more and more variants that have not been selected against yet.
Site list and population allele frequencies will be made available via Exome Variant Server and used to generate next generation genotyping arrays. All non-singleton LoF variants, eQTLs, reported disease causing variants.
Second part of the talk on how to use these reference data sets to sift through data obtained from mendelian exome seq. Goes over the key recommendations from the NHGRI workshop in implicating sequence variants in human disease (specificity of support, beware known mutations, assess probability given large reference panels of controls, honest assessment of confidence in causality).
Annotation, segregation, conservation, frequency, using expression and interaction information. xBrowse as a family exome browser, prioritize candidates based on PPI. xBrowse with the ability to share data across consortia.
GTEx project collecting RNA-sequencing data from multiple tissues alongside with genetic data, 290 samples so far, cluster by expression. What are patterns across tissues for known disease genes (e.g., epilepsy genes)? Need for replication; but what to test? Large scale functional screen of candidate gene, test of genes in large cohorts?
Lynn Jorde, University of Utah, USA
Challenges associated with WGS: disease gene detection and mutation rate analysis. Sequence explosion chart, includes exome-seq of family of four with Miller syndrome (cranifacial and limb malformations). Affected children with bronchiectasis, normally not associated with the syndrome. Both children inherited five disease-causing variants (Ng et al, 2010, Nature Genetics, writeup ‘Genomes on prescription’ in Nature 2011).
34.000 Mendelian inheritance errors in two offspring, only two dozen expected. 99.9% likely to be sequencing errors. All resequenced (Agilent capture array, Illumina). Mutation rate of 1.1 x 10^-8 per nucleotide per generation, each gamete with ~35 new variants (published estimates range from 1.0 to 2.5; newer results seem to converge around 1.1).
Filtering methods to get from ~3 million variants to a tiny subset of variants tend to be somewhat ad hoc. VAAST compares patient sequences with variant frequencies in sequence databases (1000G et al), can handle indels, and incorporates functional impact (BLOSUM, OMIM, HGMD) and uses evidence of purifying selection and conservation, non-coding annotation, pedigree data and LOD scores (composite likelihood ratio test).
Family data does improve disease-gene identification. Exome-seq use four individuals in three families; WGS of first family identified four loci; after incorporating parent’s sequences only two candidate genes remain (both disease-causing). Second example of a progeria-like disease (lethal, X-linked), exome-seq of just the X chromosomes, one significant locus (NAA10) found. Loss of function effect, confirmed by genotyping complete family, resulted in CLIA approved test. Found again in a second, unrelated family.
How well do these methods work for more common loci / conditions? Apply VAAST to detect CHEK2 variants as a cause of breast cancer. Detection rate increases with the use of contextual information (phastcons in particular). Similar results for NOD2 in Crohn disease, power of 0.8 at sample size of around 3-400 case/controls. Apply to studies of Utah Population Database identifies new candidate (can’t share yet). LOD score helps; more control genomes results in more power.
Sequencing approaches can identify disease-causing mutations with small number of families when combined with methods such as VAAST. Incorporation of family data brings back ‘real genetics’.
Leslie Biesecker, National Human Genome Research Institute, USA
Clinical research paradigm: gather info, formulate hypothesis, phenotype subjects apply assay, interpret results, refine, extend. Clinical practice paradigm history, examine patient, differential diagnosis, select and apply clinical tests, interpret, refine, diagnose.
Both are low throughput paradigms, limited by speed of assays (expensive, noisy, time limited). Genomics got rid of two of those three. Move towards hypothesis-generating clinical research:
CMAMMA project. Autosomal recessive disease in children, 12 genes found in single trio sequencing (exome), one leader gene. Added 7 additional patients, six with two rare variants in the same gene, nails causation (ACSF3).
Change context: ClinSeq Cohort, 900 subjects, primary for arteriosclerosis, consented for full sequencing and any downstream genotyping. Used VarSifter to work through data. Detected CMAMMA rare mutation in normal patient. Brought in 66 yo patient — memory problems, blood tests confirm that the mutation causes biochemical phenotype, but associated with a much wider phenotypic spectrum than initially anticipated; genetic modifiers matter.
Need to broadly study patients in a less biased way rather than just the patients we think have the correct phenotype. Matter of trust (consent).
Secondary variants and screening. Starting point cancer susceptibility, 55 syndromes, no causative genes, left with 27 syndromes. Filter variants of associated genes, assign score based on mutation type, family history, etc., to come up with a pathogenicity scale. Eight class 5 (pathogenic) variants found among ten subjects. Screened for malignant hyperthermia susceptibility, prevalence 1:2000 to 1:10000, two genes cause for 85% of mutations (RYR1, CACNA1), but huge allelic heterogeneity. Causative mutation (published) present in 1.4% of ClinSeq exomes, two orders more frequent than cases attributed to this disease. Unlikely to be causative. Patient with mutation among 30 causative RYR1 mutation with no history — he would have been diagnosed with the disease, but is not affected.
Nice samplings of results when you just go looking: false negatives, false positives, real hits. Half of the patients with hits with no family history.
How and who do we screen? Context is crucial. Mundane for prior molecular diagnostics, mild degree of surprise for patients with prior family challenge, but a counseling challenge for those without.
Known unknowns: no idea of the spectrum of phenotypes associated with full spectrum of genotypes. Biases in our penetrance assessment; diagnostic abilities less than we think they are. Could use genomics to improve our trial and error medicine, even though it is challenging on the individual level. Even little improvements likely to be a big advance.
David Dooling, Michael Schatz, James Taylor
DNA-encoded famous quote inserted into reads taken from a portion of an organism’s reference genome Identify the sequence, decode the quote and identify the speaker. Winners to be announced via @ddgenome. Get started at website.
Taken from the Church paper: convert text into ASCII. Hello World to 72-101-108…; from there to binary (01001000-01100101-01101100…). Getting close to AGCT code: 0 is A/C, 1 is T/G. Redundancy allows flexibility to avoid difficult to synthesize / sequence resulting string.
Includes algorithm as dna-encode.pl as well as Illumina reads in FASTQ format (first group 2x100, 200bp fragments, second group 2x50, 2kb fragments, third interleaved reads from 1 and 2). Mike Schatz has the details photographed.
By default encodes to DNA, but has a decode option which can be used for the challenge. Can set byte/bit order (not needed) and can handle reverse complements depending on how insertion is being handled. Note: Insertion better be a multiple of eight (due to the bit encoding).
Josh Stuart, University of California Santa Cruz, USA
Mutation analysis assumption, bad mutations are to frequent that we can consider them all good for the tumor. But for mutations context matters. Some patients with the ‘right’ mutation do not respond to treatment — but why? What about recurrent, but low frequency variants that are still uncharacterized? Can we target novel mutations in individual patients?
Mode of action (loss of function, gain of function) required to improve our understanding of disease mechanism and treatment. Pathway-based approaches offer a complimentary approach to predict LoF/GoF. Two themes to the analysis: identify the drivers, but also predict the essential genes in a given genetic background, i.e., find genes with variants in 1000G, HapMap, but not mutated in cancers. The latter question will likely need more data.
Another overview of mutation frequency analysis. How to correct for background mutation rate, are we biased towards earlier rather than late driver genes?
Start with functional mutation analysis (protein domains, evolutionary conserved, non-random pattern) along with biological examples. Driver prediction through combination of sequence / frequency analysis. Impact can be summarized across samples. Other approaches try to predict mutations based on gene expression signatures (difficult to do due to a mix of LoF/GoF). Recap of pathway / network analysis (‘pathways as the mutable unit’), followed by a summary of probabilistic graphical models (Dana Pe’er, Friedman, others) to integrate gene expression and interaction information:
Details in the Paradigm paper, software on the lab page. Uses a cohort, combine CNV, mRNA, methylation, other data and reduce to a single output of inferred activities. The resulting maps need further dissecting, e.g., testing for hubs of interest.
Sewing machine analogy (pedal as regulator, gain/loss of function of the needle as indicators if the machine works). Paradigm-shift to predict the impact of mutations on genetic pathways, uses all neighbours for inference. Disconnect gene from regulators, retain downstream neighbours. Second run is targets only, detect shifts where upstream/downstream runs disagree on a given gene activity. Identifies RBI in GBM, NFE2l2 as a gain of function in lung cancer (known oncogene in lung cancer). Passenger mutation score shift as a negative control.
System enables probing into infrequent events, can detect the impact of non-coding mutations, and extends to data beyond mutations — but requires genes with a pathway representation.
(Rushes through the remaining slides due to lack of time, touching upon classifiers looking at neighbourhood information, Google PageRank approaches, work from Yulia Newton on finding master controllers in breast cancer).
Ben Raphael, Brown University, USA
Slight change of title to ‘Dissecting somatic mutations in cancer genomes’. Driving mutations vs passenger mutations that accumulate but have little consequence to cancer progression. Typical tumor with ~10 driver mutations, thousands of passengers. Learning from TCGA, ICGC, other projects. Challenges:
Reduce the information of prior information (ideally no prior information) to identify combination of genes ‘working together’. 10^20 combinations for fewer than six gene combinations, so not quite achievable. Intermediate approaches use the interaction network (PPI, others) reduces this to 10^18.
PanCancer study to look at all of TCGA data together (~1400 cancer samples) identify ‘rare’ mutation networks, e.g., Cohesin (STAG1/STAG2/RAD21), Polycomb groups.
Extended the algorithm in multiple ways to handle other data types, better statistics.
Future challenge will be to incorporate diverse data types (methylation, expression).
Gabor Marth, Boston College, USA
We know 99% of SNP variants in any individual. Number steadily increasing from 1% in May 2011 (!). As of phase 1 38 million SNPs known. Newly discovered SNPs are now truly rare / private. Technical challenges moved on to more complex variants.
(Work from Erik Garrison of Freebayes fame). Quick overview of Indel variants, haplotype effects, frame-restoring indels. In general difficult to align (hence haplotype-aware re-alignment or micro-assembly approaches used by modern aligners). Definition of complex variants as a combination of multiple ‘allelic primitives’ (nod to Daniel McArthur).
Shift from atomic variant analysis (orphan reads) to local haplotype calling approaches within a given detection window. Used SNPs detected during the 1000G pilot used to create genotyping array (Omni 2.5). About 5% of the variants turned out to be monomorphic after the pilot phase — but after haplotype-aware analysis most of these were truly polymorphic, with a large fraction of them being complex variants (i.e., causing probe failures).
Multi-nucleotide polymorphisms frequently problematic for older algorithms not taking context into account. Intriguing matrix of dinucleotide changes (tg to ca and others very frequent due to DNA chemical properties, usually through deamination and other processes). Better interpretation required, e.g., not misclassify a micro-satellite expansion as a series of single SNPs.
Integration of short and structural variants under active development. Detectable with high accuracy, but how to reconcile short variants detected within structural variant regions? Quick overview of Bayesian methods, genotype likelihoods and probabilities based on suitable cohorts, integration of CNV and variant calls.
Challenges to the developer community to build the tools, making them accessible, and usable. Target audience should be the biologists. Easy to install, use, documented, ideally web-based. Pitch for gkno.me, the Boston College Sequence Analysis Tool Hub. Pipelines need to automate the full analysis workflow, not just individual steps, and assess the likely impact of variants. An overview of the gkno pipeline system that moves from FASTQ to VCF.
Just as important to have the analysis available. Store data remotely, compute remotely. Visualization and analysis paradigms will need to change for this (and cloud-based approaches); focus on the important data only rather than shipping all data around. Pitch for a ecosystem of light, flexible tools that are more specialized, but interconnected. Early view at an iPad app from Yi Qiao to explore genomic variants. Light, low-memory, low CPU.
As always, will do my best to capture snippets of information as best as I can. All errors mine, and feedback and corrections most certainly welcome (@fiamh, email). Official conference website can be found at http://www.beyond-the-genome.com.
Chair: David Dooling, introduction from Rebecca Furlong, Genome Medicine, UK
Focus on genomics and epigenomics. Quick intro of the organizers and off to a good (if late) start. Media policy is opt-out.
Moving along the group problem is getting harder as each group has to make assumptions about the upstream work. Surprising amount of agreement in those assumptions, promising with regards to standards.
How to work with the bioinformatics data representation group to inform what is needed for downstream assessment? Each experiment (library level, workflow level, etc), at every base (do we need this at every base? Also found that a gVCF approach would be needed).
Group meant to support the end user, so data summary and visualization a strong need. Browser over the database (mirrors the CCD talk); agnostic to technology, sequencing approach. Allow submission of data to compare and assess against (similar to the X Prize’s validationprotocol.org.
Link to published prep protocol, target/bait details in BED/GFF, platform information, chemistry. Minimal information, a CLIA-style protocol not realistic for a volunteer group. Details on the analysis (link to protocol, read data, alignment / assembly data, variance) in the usual standard formats (FASTQ/SRA/HDF5, BAM/CRAM, VCF). Also a discussion of virtualized computes to enhance sharing / reproducibility.
Metadata around VCF 4.1, needs discussion of what is essential to be captured. Database to store each base plus meta information versus reference for each experiment (no call vs consensus call). Phasing, CNV optional in the first phase. Needs to interface with some sort of genome browser (existing!).
Build on exiting browsers. Single RM but many experiments, most metadata available for filtering. Generate ‘canned’ queries (intersect all platforms for a gold set, all OMIM SNPs, etc). vs dynamic which allow users to weight as needed. Slide / Dice / Export to compare and report differences.
Additional educational uses.
Take a region / variant list of interest, filter and visualize, leverage X Prize software to provide information on phasing, accuracy, concordance / discordance. Feed results into downstream software or browser tracks.
Lots of discussion around how to keep this stable — avoid disruption of coordinates, make sure users are warned in advance, release early versions. In this project phased releases make sense (first product with differences of NA12878 to reference), ultimately towards an NA12878 assembly.
Competition with the X Prize, how to handle new technologies that find regions not covered by the reference? Found regions that look incorrect when just looking at mapping, but look okay when assembled. Requires lots of work including Sanger confirmation to ensure it’s human. Another way to ask is what is the absolute truth, or where to draw the line. Probably another moving target.
(Like the suggestion of BS-score for believability, of course.)
For CLIA: discrepancies need not just be identified but resolved.