not invented here.

Notes on bioinformatics methods, software and discussions.

9.28.2012 Genomics – Catching up to Human Genetics

Richard Gibbs, Baylor College of Medicine, USA

aka Genomic Medicine 20??. From individual variation (Watson) to population variation (Desmond Tutu project), identification of actionable variants (Jim Lupski’s project, NEJM 2010), medical management and intervention (2012), everyone (20??). What is the main utility, and should we all be sequenced?

Technical development:

Still long way to go, not a perfect genome yet despite rapid developments. Capture technologies to stick around a bit longer as higher coverage in important regions helps

Healthy adults:

Who wants a test without medical indication? Mike Snyder, for one. Still useful information such as site frequency spectrum (1000G useful even without detailed phenotypes)

Complex disease:

GWAS vs Mendel. Few actionable alleles vs low frequency/high impact, can we have an integrated model? Can we construct models of complex diseases (‘oligogenics’). ARIC and CHARGE consortium with 30,000 well phenotyped individuals, 4,500 exome-seq’d sees Mendelian alleles in the ‘normals’

Mendelian disease:

Severe diseases, often children, collectively frequent, cites actionable case studies (40 mendelian diseases studied right now at HGSC alone, ‘industrialized’ pipeline). Studies take away endless exhausting lists of diagnosis, treatments. Huge value in molecular diagnosis even without treatment. Problem of too many small pedigrees, need a surrogate for what a variant does as part of a functional assessment. Whole Genome Lab launched 2011, steady increase of samples (by end of 2013 10k samples/month, no way to do manual curation).

How much work will be in research vs clinical settings? Lots of additional success stories that are just impossible to cover without the pedigree pictures, gene names and impact.

Clan Genomics paper: “recent mutation may have a greater influence on disease susceptibility or protection than is conferred by variations that arose in distant ancestors”. Should be more worried about immediate family, variants not present in the broader population.


Role of inherited, acquired mutations, environmental mutations. TCGA et al ‘damn successful’ in uncovering new mutations and functions. New technologies allow analysis of low frequency cancers. Different trios: normal, primary, recurrence allow clonal analysis, trace evolution, study time dependency between mutational events.


Need to find the family or cohort first. Data ends up in medical records which can be mined for resources; likely growth in social networks / 23andMe’s to create cohorts, whereas designed population studies will decrease.

Future prediction: all the excitement stale in a few years. Passion move to other fields as this becomes complete routine — it’s water coming out of the faucet.

sequencing ✳ conference ✳ btg2012 

9.28.2012 Surname leakage from personal genomes

Yaniv Erlich, Whitehead Institute for Biomedical Research, USA

Co-segregation between Y-Chr and surnames used by public services, allow you to connect to relatives; second database of interest SMGF used to extract a total of 140k surname/Y-chr-test data points.

What is the probability to recover a surname using genomic data? Success rate of ~12%. What when adding age, state as additional metadata (often included in public records, publication also allowed). Age 40, State Colorado, surname Smith. Median size of 12 people identified by this combination.

The Venter case, putting it all together. Profile STR markers with lobSTR. Now sufficient to get Venter surname from (Try yourself!). The same person does NOT need to be in the database, sufficient for relatives to be included.

Remaining talk not tweet-able. Shame, _really fun and intriguing analysis_.

privacy ✳ sequencing ✳ Conference ✳ btg2012 

9.28.2012 The implications of clonal genome evolution for cancer medicine

Samuel Aparicio, BC Cancer Research Centre, Canada

Cancers act as evolutionary systems. Provides a brief history of tumour evolution, Hauscka in the 50’s, Richard Doll’s mathematical models, Peter Nowell in Science 1976 on the clonal evolution. Ecosystems of malignant cells, with selection operating on phenotypes resulting in growth advantages / disadvantages. Conceptualized and modeled in a large number of recent papers.

In practice:

  1. Tumours will vary in size and composition
  2. Mutations detected in bulk will not all co-occur in the same cells (clonal genotype)
  3. Powerful tool to analyze function in human tumours
  4. Most tests completely ignore this clonality

Concepts: clonal prevalence (prevalence of a given mutation across all clones), genotype (unique constellation of mutations defining a clone), lineage (related by hierarchical descent). Difficult to track as mutations occur at every scale. First demonstration of clonal evolution over nine years (primary tumour, recurrence and metastasis). Three patterns: present in primary and malignant cells, present in subset of either population, undetectable in primary. Again during AML relapse tracing clonal abundance using allelic frequencies. Patient relapse was caused by mutation already present prior to therapy, but at very low abundance. Observing this kind of evolution provides hints of the relevance (significance) of the genotype.

Triple negative breast cancer has additional subtypes. Analyzed 104 TNBC (early stages), 2164 validated SNVs (single base, small indels). Wide variation in mutation abundance by case, but would be treated all the same way, even after adjusting for differences in CNV. (Aside: unusual pattern of somatic mutations in genes downstream of ITG, cell shape, cytoskeleton processes).

Mixture model to account for copy variation, dirichlet process mixture model to cluster mutations to predict discrete clonal frequencies. Some cancers with only 2-3 clonal groups, others with six groups that are distinct from each other. Huge variation between patients depending on how much the cancers evolved. In some cases p53 is not the initial (earliest) clone, so other events must be driving the cancer initiation.

(Shah et al, Nature 2012) showing clonal frequencies of TNBC organized by pathways rather by genes, resulting in an interesting crosstalk network. (Naxin et al, Nature 2011) another paper showing lots of regional variation (clonal differences) across a tumour. Single cell sequencing will help resolve the clonal genotypes. Preliminary data from seven nuclei, able to infer 4 clonal genotypes based on informative sites. Detect mutations in blood plasma before radiological progression.

Clonal evolution of cancers at the root of treatment failure. Can be tackled with modern genomics.

sequencing ✳ btg2012 ✳ conference ✳ evolution 

9.28.2012 Analyzing Genomes: is there a duty to disclose?

Amy McGuire, Baylor College of Medicine, USA

Requirement for an investigator to contact patient when incidental findings are discovered? Aka, how to return individual level research results. Guidelines include wording like ‘proven clinical value’ that are difficult to define. Some contracts extend this to secondary analysts who are expected to contact Biobanks who follow up with sample provider. There is no consensus.

Duty to reciprocity, and a right to receive (not well recognized legal principles in clinical practice). Patients when asked would prefer to receive results about themselves.

Arguments against disclosure: obligations are role-specific, no duty to rescue for researchers. Other ways (aggregate information) for participants to receive information; preliminary research results often not replicated and with unclear significance. Routine return of results impose burden on research enterprise.

GWAS Investigators survey, list of published study corresponding authors; 200 completed surveys, 35 interviewed. Only 7 primary users returned results, none of the secondary researchers did. 68% felt results should be returned in some circumstances: type of study matters. Results not significant enough in large-scale GWAS, but frequently so in small family linkage studies.

ICD matter to what is being returned to participants. ~20% of documents silent on return of results, none stated they would be returned under all circumstances, 10% when significant findings were discovered. (As @larsgt puts it: The EULA of research.) Legal obligations are being created (negligence on behalf of the investigator to return results as outlined), in line with standards of care. Moral obligation to return data can quickly turn into a legal one.

Back to Zack’s talk: incidental findings in clinical care. Fiduciary obligation to disclose, but what about variants of unknown significance (First time I have seen the VUS acronym). Is there a duty to hunt for more information? How does this mesh with the incidentalome problem discussed earlier?

Working group on Secondary Findings in Whole Exome/Genome Sequencing (article in Genetics in Medicine, Robert C Green). Bound to get more difficult as lines between research and clinical care continue to blur. Clinical Sequencing Exploratory Research (CSER) projects to assess the integration of clinical information, sequencing, utilization outcomes, clinical care.

Problem of Iatrogenic harm — over-utilization of healthcare resources. No good data available.

Other ethical issues, e.g., identification of incestuous parental relationships using SNP arrays. Seven cases at Baylor (first degree, in some cases with the mother being a minor) in the last year; genera consent to treat includes this kind of testing. Disclose information to authorities, child protective services?

btg2012 ✳ conference ✳ ethics ✳ sequencing 

9.28.2012 How to Avoid One Thousand Opportunities to Do Harm In Genomic Medicine

Isaac Kohane, Harvard Medical School and Children’s Hospital Boston, USA

Analogy of Google Maps GIS layer system to create useful maps, and how this will apply to the combination of ‘omes (3 billion based to one report). Includes mandatory jab at Apple Maps.

Threat of the Incidentalome: danger of large N and small p(Disease), false (dangerous) diagnosis will become prevalent. With only 10k tests / per person 60% of the population will be diagnosed falsely. Non-genomic follow-up tests will be cost prohibitive. Shout-out to Dan MacArthur’s LoF variant paper in science — high confidence of 100 LoF alleles with 20 complete LoF events per genome without noticeable phenotype.

Four components of the Incidentalome:

  1. Wrong annotations
  2. Measurement error

Findings per genome: > 50 variants at highly conserved, disease causing sites.. in healthy individuals. Easily 20% annotation errors in OMIM, HGMD, dbSNP (e.g., due to bad genome assembly updates). Systematical errors when sequencing the same genome multiple times, let alone between different technology platforms (5-20% disagreement). For indels concordance only 20%. 10”^3 to 10^5 errors per genome. Still, this problem can be fixed over time.

  1. Wrong priors (comparison group)

41,000 patients tested, 5 in 1000 homozygous for highly penetrant mutation for haemochromatosis. Only one with actual history — from 80% penetrance to less than 1%. Initially measured in families with (obviously) shared genetic background, but also shared environmental exposure. Compare to mouse strains: knockout only disease-causing in one out of three mice strains. Clinical cohorts already heavily biased.

I2B2 toolkit, viral dissemination of a population study system, now at 84 centers, comes with VMware image.

  1. Multiple comparisons

With one variant, picking the right comparison group is feasible. Assume only 10^4 important variants, not large enough groups around to compare to. iSnyder study as an alternative with a mechanistic evaluation of data.

All that said, what’s a genome worth? Predicted value to healthy individuals is very limited. Extremely valuable to sick patients, risk relatively slight.

ethics ✳ clinic ✳ sequencing ✳ Conference ✳ btg2012 

9.28.2012 Clinical diagnostic whole genome sequencing in a paediatric population

Elizabeth Worthey, The Medical College of Wisconsin, USA

Pediatric WGS based MDx pilot at MCW, 18 months, 3-6 cases reviewed / month, 24 approved by committee. 8-10 hours of counseling required for the consent process. A need to define the medically actionable term: treatment of manifestations, prevention of primary, secondary symptoms, etc. Where possible patients decide on what information is to be returned (within State laws). Broad range of diseases, MDx rate is ~40%, based on the pilot agreement to move forward.


  • how to ensure systematic, validated analysis? Complex data processing pipeline with controlled data updates, curated annotations, controlled development schedule, validation of each change (six month development schedule).
  • how to support clinical interpretation? Like other speakers different categories of findings. Prioritize on likely error, likely pathogenicity
  • handle regions not sequenced (false negatives); cross-sample analysis (GapMine scripts). Found ~60 genes with clinical utility always poorly covered. Most NEMO disease associated variants fall into such a coverage gap. Can use other technologies (very low coverage PacBio) to close such gaps.
  • how to handle bad reference data? WGS found candidate mutation in patient that looked promising, but turned out to be polymorphic when studying mutation databases (Goes back to previous talks on needing lots and lots of references). Tracing the annotation evidence paper trail indicate, though, that the mutation isn’t polymorphic. Data needs to be reviewed carefully (given the current state of public data)
  • when do you declare failure? Sanger confirmed variant with insignificant clinical findings. Six month later diagnosis due to new publications of patients with same mutation and similar phenotype

NGS is already providing great insight into human health. Need to be able to query each others variants data, ideally with associated phenotype information. Need to fix wrong or misleading information in databases.

Conference ✳ wgs ✳ btg2012 ✳ clinic ✳ sequencing 

9.28.2012 First year experience with the introduction of clinical whole exome sequencing

Sharon Plon, Baylor College of Medicine, USA

Whole exome as a clinical diagnostic test. Recap of typical diagnosis of genetic disorders (clinical diagnosis, imaging, x-ray, metabolic assays); majority of patients remain undiagnosed. NIH Undiagnosed Diseases Program with carefully selected patients, review medical records, 326 cases accepted, 160 admitted to NIH Clinical Service. Able to find causative gene in 24% of the cases, 12 due clinical findings, 19 by molecular.

Simply finding disease genes not enough, needs to move to the clinic Whole Genome Laboratory at Baylor, joint effort between medical genetics and HGSC with merger of expertise, weekly meetings. CLIA-certified test includes capture platform, base calling, analysis and annotation pipeline (Mercury), hand-off of report to clinical reporting team, decision what to report on and sign-out. Whole process needs to be codified, changes difficult. Set up clinical exome metrics to ensure consistency in terms of coverage, quality, etc.

Clinical reporting include sequence results, Sanger confirmation, parental inheritance of significant findings (no trio sequencing), SNP array data; three levels of review including certified faculty. Difficult cases reviewed at WES Sign-Out Conference. Report focused on deleterious mutations in disease genes related to phenotype, variants of unknown significance in those genes, and clinically actionable variants. No such lists exists just yet; base on guidelines / literature. Expanded report can be generated upon request, including variants unrelated to the phenotype. (Bit surprising given previous talks).

Rapid increase in scale, almost 100 cases last month alone. 85% pediatric patients, wide variety of indications with the majority being neurological. Referrals from across the US. As of September 128 samples completed for analysis, 26% samples with causative deleterious mutation found.

Walks through medically relevant findings (~20% of patients have actionable findings), including 7 month old child with intraventricular hemorrhage (among other symptoms). No findings from standard clinical tests, high density SNP test. Exome-seq cound RBM10 mutation (TARP syndrome), de novo splice site modification, Result provides diagnosis, prognosis and conveys low recurrence risk.

16 yo female with neurodegenerative disorder and chronic headaches, 13 yo sister with similar symptoms. Autosomal recessive disorder, found ulta-rare disorder (Ataxia of Charlevois-Saguenay), SACS gene. One rare, two novel mutations (father carried two, mother carried frameshift). Prenatal testing for future children recommended. Prognosis available, though no treatment.

Overall strong interest in exome testing with early evidence of cost-effective clinical utility. Reporting remains challenging, as is the rapidly increasing spectrum of mutations associated with a given phenotype and the number of incidental findings. Benefits include determination of mode of inheritance, potentially improved preventive care.

Cancer exomes launched, but lag behind. Similar pipeline with deeper coverage, focus on reporting somatic mutations as tiers.

sequencing ✳ clinic ✳ btg2012 

9.28.2012 Sifting disease-causing signal from genomic noise

Daniel MacArthur, Massachusetts General Hospital, Boston, USA

Importance of aggregating signal across a large number of patients; need to analyze sequence in the context of tens of thousands of genomes. Requires consistent variant calls across studies. Mendelian studies need large reference sets — are variants in a patient unusual — to build a catalogue of variation. The ‘squaring the matrix’ problem of large individual vs variant position data sets. Merging cohorts when a subset of sites is only called in one cohort — true difference, or difference in technology, capture problems? Recall all variants across all individuals — but how to do this simultaneously?

  • 100s of TBs of read data even for exome seq
  • memory requirements to keep all data in memory even for a given site
  • one approach: ReducedBAMs for active storage and calling (not for archival purposes) to reduce memory footprint

Reduce reads with no diversity to synthetic read summarizing the region; from 10GB exome to ~100MB, SNP/Indel calling ‘as accurate’ as single sample, 3-10x faster. 97% of the calls shared with a reference batched calling approach, ~40k new sites likely to be real and requiring the full data set to be called correctly. Used for joint calling of ~26,000 (!) exomes. T2D, Autism, Cancer studies, 1000G.

Early data, only a small fraction of the sites called just yet. The way we think about the mutation frequency spectrum changes: nearly 50% of LoF variants only seen once in 26,000 exomes. Landscape dominated by ultra-rare variants. Distribution across splice sites: known splice site mutations enrich at the GT/AG donor/acceptor, natural selection weeds out variants > 1%. If you go down in the frequency spectrum there are more and more variants that have not been selected against yet.

Site list and population allele frequencies will be made available via Exome Variant Server and used to generate next generation genotyping arrays. All non-singleton LoF variants, eQTLs, reported disease causing variants.

Second part of the talk on how to use these reference data sets to sift through data obtained from mendelian exome seq. Goes over the key recommendations from the NHGRI workshop in implicating sequence variants in human disease (specificity of support, beware known mutations, assess probability given large reference panels of controls, honest assessment of confidence in causality).

Annotation, segregation, conservation, frequency, using expression and interaction information. xBrowse as a family exome browser, prioritize candidates based on PPI. xBrowse with the ability to share data across consortia.

GTEx project collecting RNA-sequencing data from multiple tissues alongside with genetic data, 290 samples so far, cluster by expression. What are patterns across tissues for known disease genes (e.g., epilepsy genes)? Need for replication; but what to test? Large scale functional screen of candidate gene, test of genes in large cohorts?

sequencing ✳ conference ✳ btg2012 

9.28.2012 Whole-genome sequencing and disease-gene detection

Lynn Jorde, University of Utah, USA

Challenges associated with WGS: disease gene detection and mutation rate analysis. Sequence explosion chart, includes exome-seq of family of four with Miller syndrome (cranifacial and limb malformations). Affected children with bronchiectasis, normally not associated with the syndrome. Both children inherited five disease-causing variants (Ng et al, 2010, Nature Genetics, writeup ‘Genomes on prescription’ in Nature 2011).

34.000 Mendelian inheritance errors in two offspring, only two dozen expected. 99.9% likely to be sequencing errors. All resequenced (Agilent capture array, Illumina). Mutation rate of 1.1 x 10^-8 per nucleotide per generation, each gamete with ~35 new variants (published estimates range from 1.0 to 2.5; newer results seem to converge around 1.1).

Filtering methods to get from ~3 million variants to a tiny subset of variants tend to be somewhat ad hoc. VAAST compares patient sequences with variant frequencies in sequence databases (1000G et al), can handle indels, and incorporates functional impact (BLOSUM, OMIM, HGMD) and uses evidence of purifying selection and conservation, non-coding annotation, pedigree data and LOD scores (composite likelihood ratio test).

Family data does improve disease-gene identification. Exome-seq use four individuals in three families; WGS of first family identified four loci; after incorporating parent’s sequences only two candidate genes remain (both disease-causing). Second example of a progeria-like disease (lethal, X-linked), exome-seq of just the X chromosomes, one significant locus (NAA10) found. Loss of function effect, confirmed by genotyping complete family, resulted in CLIA approved test. Found again in a second, unrelated family.

How well do these methods work for more common loci / conditions? Apply VAAST to detect CHEK2 variants as a cause of breast cancer. Detection rate increases with the use of contextual information (phastcons in particular). Similar results for NOD2 in Crohn disease, power of 0.8 at sample size of around 3-400 case/controls. Apply to studies of Utah Population Database identifies new candidate (can’t share yet). LOD score helps; more control genomes results in more power.

Sequencing approaches can identify disease-causing mutations with small number of families when combined with methods such as VAAST. Incorporation of family data brings back ‘real genetics’.

sequencing ✳ genetics ✳ btg2012 

9.28.2012 Hypothesis-generating clinical genomics research and predictive medicine

Leslie Biesecker, National Human Genome Research Institute, USA

Clinical research paradigm: gather info, formulate hypothesis, phenotype subjects apply assay, interpret results, refine, extend. Clinical practice paradigm history, examine patient, differential diagnosis, select and apply clinical tests, interpret, refine, diagnose.

Both are low throughput paradigms, limited by speed of assays (expensive, noisy, time limited). Genomics got rid of two of those three. Move towards hypothesis-generating clinical research:

  • assemble cohort
  • generate large dataset
  • parse genomic data for patterns, perturbations, etc.
  • Hypothesis for how attributes affect the subjects
  • Test hypothesis with clinical research

CMAMMA project. Autosomal recessive disease in children, 12 genes found in single trio sequencing (exome), one leader gene. Added 7 additional patients, six with two rare variants in the same gene, nails causation (ACSF3).

Change context: ClinSeq Cohort, 900 subjects, primary for arteriosclerosis, consented for full sequencing and any downstream genotyping. Used VarSifter to work through data. Detected CMAMMA rare mutation in normal patient. Brought in 66 yo patient — memory problems, blood tests confirm that the mutation causes biochemical phenotype, but associated with a much wider phenotypic spectrum than initially anticipated; genetic modifiers matter.

Need to broadly study patients in a less biased way rather than just the patients we think have the correct phenotype. Matter of trust (consent).

Secondary variants and screening. Starting point cancer susceptibility, 55 syndromes, no causative genes, left with 27 syndromes. Filter variants of associated genes, assign score based on mutation type, family history, etc., to come up with a pathogenicity scale. Eight class 5 (pathogenic) variants found among ten subjects. Screened for malignant hyperthermia susceptibility, prevalence 1:2000 to 1:10000, two genes cause for 85% of mutations (RYR1, CACNA1), but huge allelic heterogeneity. Causative mutation (published) present in 1.4% of ClinSeq exomes, two orders more frequent than cases attributed to this disease. Unlikely to be causative. Patient with mutation among 30 causative RYR1 mutation with no history — he would have been diagnosed with the disease, but is not affected.

Nice samplings of results when you just go looking: false negatives, false positives, real hits. Half of the patients with hits with no family history.

How and who do we screen? Context is crucial. Mundane for prior molecular diagnostics, mild degree of surprise for patients with prior family challenge, but a counseling challenge for those without.

Known unknowns: no idea of the spectrum of phenotypes associated with full spectrum of genotypes. Biases in our penetrance assessment; diagnostic abilities less than we think they are. Could use genomics to improve our trial and error medicine, even though it is challenging on the individual level. Even little improvements likely to be a big advance.

clinic ✳ personal genomics ✳ sequencing ✳ conference ✳ btg2012