not invented here.

Notes on bioinformatics methods, software and discussions.

1.27.2014 Genome in a Bottle 2014: Real Time Genomics and Platinum Genomes Pedigree Analyses

(Presented by Francisco De La Vega)

Venn diagram hell required ground truth to sort out. All kinds of components — experimal validation, orthogonal data, references from different platforms all with their own biases and issues. Idea: use mendelian segregration as ground truth.

Given sufficient offspring this becomes feasible. Illumina/CG sequenced complete CEPH/Utah Pedigree 1463 and made the data available. 2x100bp, 50x, three generations. Trace variant segegration across 11 offspring, can be done for SNPs, InDels, more complex variants. How to distinguish between false positives and ‘real’ mismatches?

Use phasing approach to identify crossovers phase contiguity extension, connect haplotype islands, check calls vs haplotype framework. Requires filtering of problematic regions, algorithm minimizes number of required crossovers to explain haplotype. Found ~1-2 events per chromosome, total of ~650 crossovers. Can then test for phase consistency of variants: discordant variants that meet the phase restrictions are likely to be true (others likely false positives).

Can filter based on the likelihod that a set of genotypes is phase consistent by chance alone. See good improvement of using trio pedigree over raw NA12878 truth set (in ROC curves). In GiaB “confident” regions good agreement, improvements in the remaining regions. See GiaB note for pointers to software and Co.

Requires collecting data for whole family. Can merge multiple platforms for a phase-consistent truth set. Still needs to deal with CNVs, SVs, systematic errors.

giab ✳ workshop ✳ ngs ✳ software 

1.27.2014 Genome in a Bottle 2014: Working group reports

WG1 (Reference material): NCI-based fosmid data for somatic variant calls; additional cancer-specific mutations from Affymetrix. How to make use of plasmid samples with undefined quality characteristics? How to vet these, compare different types of material for suitability? What other methods / sets are needed? Deep sequencing of a cell line? Cell lines will change over time. ACGT collection might fit the required needs (pending the usual consent issues).

Come up with a study design and conduct experiments required to answer these questions (volunteers welcome).

WG2 (Characterization): Degraded DNA material, synthetic mutations, maybe mixed in with trio data, test sequencing depth required to detect synthetic variants. Start making use of long reads for phasing purposes in the trio (PacBio, Moleculo).

WG3 (Bioinformatics): From data to integration. Reviewed FTP structure, needs layer of representation for pedigrees etc., tools folder starting to link to GitHub and Co for better re-usability, can also include binary tools that otherwise could not be re-distributed. Accession approach (submitter, format, version information) for BAMs, VCFs and Co required, including everything needed for subsequent submission to SRA/ENA.

Consortium data model in flux, Global Alliance for Genetics and Health might be a good partner to gain recgonition. Would also provide minimal metadata framework. Will require format conversion at least for a while so data remains accessible.

New genome build a major discussion point. Value hg20 over hg19 at this point not entirely clear; but if consortium wants to drive / lead developments switch is required. Haplotype-aware aligners not widely used just yet; right now at least need minimal solution (i.e., ignoring the alternates). Extra layer handle alternates to decide which ones fit the data best. Final layer would be de novo assembly; back to the question what the truth set is. Pipeline can be imperfect but the answer set is perfectly concordant with reference set.

What are the frameworks for pedigree-based analysis, how to keep them reproducible? How to handle problematic regions — compile internal region lists, consolidate with NIST regions? How to use low coverage survey sequencing projects?

WG4 (Visualization / Figures of Merrit): How to allow users to evaluate their own data in the context of NIST, other reference data? Target audience for now prioritized to regulators (FDA), clinical labs, and to some extend industry for R&D in platform development. Requires web interface for labs without bioinformatics support; uploaded files still need some sort of standardization (API for bioinformaticians who want to go beyond the UI).

Requires high-level report (performance metrics), with the ability to drill down into the whys of discordance. Data, scores, matrices need to be made available for certification purposes. Punt file formats to WG3 (most likely not VCF). Key point is that context is needed: what evidence for what variant, how were variants filtered.

Reference set not should be changeable, but mechanism to provide feedback along with evidence needed. [See previous blog entry for more details on this session.]

giab ✳ workshop ✳ sequencing 

1.27.2014 Genome in a Bottle 2014 workshop in Stanford: Visualization group notes

Previously led by Justin Johnson (through EdgeBio and the XPRIZE validation framework). Now led by Deanna Church (Personalis.. still feels weird writing that) and Melissa Landrum (NCBI); converging with Lisa Kalman’s GeT-RM project.

Goal: user-friendly interfaces to reference genome data. Idenrtify target audience, define user interface requirements (truth sets to use, input/output, visualization, integration).

Existing: GCAT, GeT-RM with consensus SNP tracks. Started to collect metadata, gather tracks (region categorization), coordinate different efforts.

Quick intro to (ten year old) GeT-RM program from Lisa. Interaction with clinical labs, require quality control materials. Need predates NGS: develop, validate, maintain — not just for individual genes but now ~20k genes. Coordinate with NIST, Coriell, etc. Also selected 1000G/HapMap samples, collected variant call data (VCF/BAM where available) from ~30 clinical labs (gene/exome/WGS); metadata hard to capture. [Still surprised there was no pushback to get participating labs to send unified VCF rather than tons of different formats]. Visualize results in GeT-RM browser to make results available and accessible to participants. Plan is to add more tools to the browser, e.g. ,ability to chose different truth sets, ability to upload one call sets, reporting functionality. Target audiene: clinical labs (with limited bioinformatics expertise). Maybe an API [fingers crossed].

NIST might set up a competition for visualization approaches; how to abstract from browser tracks to higher level feedback.

Questions / notes from the discussion:

  1. difficult to understand where discrepancies between data sets are coming from? Filtered by QC, filtered because of validation, threw it out due to non-pathogenecity (which are not reported at all, a problem for validation studies). Might need this level of granularity.
  2. Also need a stronger focus on the difficult variants. 40bp indels, tricky CNVs and Co; the non-standard variants that have a high likelihood of being pathogenic.
  3. Specifying validation tests almost an impossible task (technical, political reasons). Tool that follows some of the better workflows, is reproducible, allow it to be tweaked. I.e., rather than specification paper find ways to make the environment useful to the scientific community.
  4. Should we use existing resource to slice/dice subsets? What is platform specific, what do you need pick up under any condition, etc. Goes back to #1; if I find something present in my VCF but not in the reference set where else was it found? I.e., platform or workflow specific variant call or something entirely different?

Overall problem of having to focus on some aspects to move working group forward. Focus on regulators, clinical decision makers?

Invitae (Steve Lincoln’s) example of clinical validation workflow. For each sample previous data, run through own lab and workflow, obtain HGVS string, list concordance, discordant pairs, add reproducibility data (agreement with own data run on a different day). Calculate sensitivity & specificity (not ROC curves, not FDR). Also, Analytic Range: given the increasing size of samples and sites not everything can be validated. State ‘in regions like this we did find’ rather than global statements (genomic regions, class of variants, etc.). One table for each of the subsets (concordant, discordant, reproducibility arm). No good standard exists for that. [Back from the days of single gene comparisons.]. Discussion on what to use as range for GiaB. RefSeq CDS, other simple sets?

Could be useful to have an intersect in frameworks such as GeT-RM: pick truth set, pick analytical range (rather than merge the two into individual truth sets).

Focus on clinical end user, what is needed for accreditation (summarized results, but also the full data matrix). Other users might be interested in ways to slice and dice the variant information.

Need to have the ability to capture community information. When users go to the trouble of re-sequencing reference material to solve difficult regions that information should be accessible… somehow. [Wiki? Browser tracks? VCFs?]

[Footnote: Invitae to release complete, open-source HGVS parser; there’s also one from Counsyl..)

giab ✳ visualization ✳ workshop 

1.27.2014 Genome in a Bottle 2014 workshop in Stanford: Day 1 morning session

Couple of notes taken on the fly; all errors mine.

Intro from Marc Salit

NIST initiative in the Stanford area (ABMS, Advances in Biological/Medical Measurement Science Program). Hub of genomics innovation despite Boston claims (hah!). Life, Agilent initial partners. Short summary of GiaB history, from idea to kickoff meeting to recent developments.

This meeting about initial reference materials and suite of tools to evaluate. “Yellow light” tools to indicate all looks well, carry on with caution. “Red light” the priceless signal that is needed. Technology and platform comparisons important second priority. Trying to stay away from regulatory involvement.

NA12878 still the initial plan, but consenting limited. Extension to PGP trios for a more compete set. Eight trios, focus on children; varying biogeographic ancestry. Open consent, state of the art. Details on Marc’s slides.

Usual working groups, with additional ‘meta’ topics. Includes move beyond the easy regions of the genome, incorporating new technologies, and selecting future genomes. Larger families for more power in confident variant calling, phasing? What approach to take for somatic variant calls? How to version reference materials?

Justin Zook providing more context

Whole genome RMs vs current validation methods: * Sanger confirmation, High depth NGS confirmation, Arrays, Mendelian inheritance, simulated data, Ti/Tv: all useful, complementary. Goal: approximately 0 false positive/negative calls in confident regions, don’t just take the intersection, avoid biases towards platforms or workflows.

Integration approach

Integration currently includes 14 datasets from 5 different platforms, keeps growing. Complex workflow to derive integrated call set, check characteristics of data sets at discordant sites to come up with reference agreement along with confidence score (excluding difficult regions with known duplications, long repeats, structural variation areas). Paper now accepted in Nature Biotechnology.

Work with GCAT to compare different workflows, algorithms. Can already spot platform systematic errors, regions different aligners are struggling, negative variants dominated by complex variant representation problems (225k highly confident variants are within 10bp of another variant). FP/FN enriched for complex variants; RealTimeGenomics’ vcfeval tries to deconvolute these.

Structrural variation

Moving towards structural variation with a similar approach. Combine different methods (depth of coverage, paired end mapping, assembly based, etc.). Strong disagreement (less than 10% in the intersection of methods). Gathering metrics (soft clipping, mapping quality, # SNP calls, coverage) summarized by new software package, will be used to asses performance

The todos

Todo: need to tackle ‘hard’ regions of the genome. Move towards pedigree calls as a different way to get high confidence calls. How do we integrate these calls with existing variant calls, and should we looking for additional large, consented families to assess?

Pilot reference material (NA12878) being tested at various sites (at NIST, Illumina, CG, Garvan Institute, NCI, INOVA); up to 300x, arrays, exome/WGS, etc. Illumina/SOLiD/IonTorrent include. Assessment now in progress, including additional QC — are there differences between first, last vial of a batch (vial, library, day, flow cell, lane, sampling, etc.)? How stable is the same material (freeze/thaw cycles, vortexting, etc.)? Check size distribution with PFGE on top of that. Size degradation likely to be the biggest confounding factor.


On NIST FTP site, Amazon S3. Official release of reference material in the next couple of months.

giab ✳ workshop ✳ stanford ✳ standards 

9.29.2012 Exploring the Cancer Methylome

Peter Laird, University of Southern California, USA

70% of CpGs methylated. In a cancer cell a number of changes. Widespread (rather than global) hypomethylation and focal CGI hypermethylation in cancer, frequently in promoter regions. What is the relationship between those two discordant events?

125 colorectal adenocarcinomas genomes profiled for methylation, 29 adjacent normals. Clusters nicely, but are those associated with clinical features? Align with BRAF mutation status, for one (but mutations do not induce the methylation status). Another Distinct CpG-island methylator phenotype, CIMP+, micro-satellite unstable subtypes also overlap (if incompletely). Can be used to generate epigenetic subtypes of colorectal cancer, applied to TCGA with the same outcome.

Synergy between cancer genetics an epigenetics. ES-Cell polycomb repressor complex targets are prone to abnormal DNA methylation in cancer (polycomb keeps master regulators of differentiation in poised state). Large number of promoters acquire methylation in cancers compared to matched normals, enrichment of polycomb targets among those cancer-associated methylated targets. Same genes also acquire increased methylation with age.

Polycomb crosstalk likely leads to cumulative stochastic methylation. Loss of polycomb and replacement with more permanent silencing method — i.e., methylation — means target no longer able to differentiate, gets stuck in stem cell state without full differentiation capability. Great target to accumulate additional mutations over time. Potentially not an active process in cancer, but a hallmark of the event that lead to the cancer development. Not a competitive advantage to develop a cancer, but a passenger event indicating that the progenitor cell could no longer differentiate.

Would explain DN methylation of ~50% of cancer-specific methylated genes, consistent with stem cell-like behaviour of cancer cells, explains observation of epigenetic field effects adjacent to tumors: a differentiation block.

Focal hypermethylation and long-range hypomethylation: WG bisulfite sequencing of primary tumors and normal tissues. Shows sample CpG site in normal/cancer, striking difference of signal. Zooming out to ~20kb windows, nice heatmap/scatterplot of methylation signal show partially methylated domains. Shows erosion of methylation pattern in window from ES cells to differentiated colon to cancer. Hypomethylation not uniform distributed, clear windows. Epigenetic unstable regions close to the nucleous (late replicating regions, lamin attachment regions).

Comparison across cancer types: holds up across multiple cancers studied, with individual differences that should be interesting to tease apart. Comparison of 2200 TCGA cancers 409 normals reveal cancer-specific profiles in unsupervised clustering and pairwise correlation analysis of all cancer types using all sites.

btg2012 ✳ Conference ✳ epigenomics ✳ Cancer 

9.29.2012 Analysis of somatic retrotransposition in human cancers

Peter Park, Harvard Medical School, USA

45% of human genome derived from transposable elements (TEs), able to replicate (copy/paste) across the genome via an RNA intermediate. Previous studies of TEs mostly in germline, >7000 insertions in 185 samples from the 1000G set. Implicated in single gene diseases with ~100 insertions reported so far (L1s, ALUs, others), e.g., neurofibromatosis type 1 contains 18 retrotransposon insertions. Some events found in cancer, for example L1 in MYC — but discovered in low-throughput studies.

Studied 43 cancer/matched normal genomes (150 billion reads) from TCGA (GBM, ovarian, colorectal, prostate, myeloma). Thousands of novel germline insertions, 194 high confidence somatic insertions identified (majority L1s).

Detecting events challenging. Numerous, often identical TE instances. Find cluster of read pairs where one end maps uniquely, other end maps to TE consensus sequence. Use of clipped reads (partially aligned reads) key. Added custom assembly of repeat elements to the genome to be able to find repeat families at once. Tool called Tea (transposable element analyzer); insertions validated by Sanger. All somatic L1/Alu insertions in cancers of epithelial cell origin, none in blood or brain cancers (Ouch: postdoc started with those tissues by chance.).

64 out of 194 insertions located in genes, including tumor suppressors (UTRs and introns). Somatic insertions tend to occur in genes commonly mutated in cancer, disrupt expression levels of target genes, biased towards regions of cancer-specific DNA hypomethylation (all statistically significant).

Unclear where and how often these happen

Quick shout out to Galaxy and Peter’s Refinery system, a data repository connected to the Galaxy backend for data analysis.

btg2012 ✳ Conference ✳ Cancer ✳ sequencing 

9.29.2012 MutaScope: a high sensitivity variant caller for amplicon sequencing

Shawn Yost, University of California San Diego, USA

Targeted therapy in cancer hinges on having more actionable genes: only 47 genes actionable (approved or in clinical trial). Designed ~2000 amplicons (150kb, 1000X coverage) to cover these. PCR amplicon and MiSeq allow for quick turnaround (UDT-Seq, Ultra-Deep Targeted Sequencing). Amplicon has fixed directionality, start/stop position, differs from whole exome sequencing. All mutations will have the same position in a read, making it difficult to identify false positives by usual metrics. High depth of coverage also unusual, but needed due to sample heterogeneity.

Applied to ~50 samples with wide range of percent of invasive tumour cells. MutaScope approach: BWA alignment, refine alignment, calculate experimental error rate using a germline sample, variant detection and classification.

Read refinement: each read assigned to the amplicon from which it was sequenced, allows to use RG information to improve variant calls. Unclip soft-clipped bases to allow variant calls at the end of reads.

Error rates: based on germ line sample to correct for error rates dependent on position of read, CG content of reference. Use two models of tumor heterogeneity (spiked in tumor samples at 1/5/20% as a control or two mixed germ line samples). Tested against Samtools, Varscan, GATK. MutaScope works better for this kind of sequencing.

Applied successfully to clinical samples, studying prevalence of somatic and germline variants. Designed specifically for PCR amplicon, modular workflow from FASTQ to PCR. Additional information in VCF such as detection p-value, classification p-value and mutant allele read group bias.

software ✳ sequencing ✳ btg2012 ✳ Conference 

9.29.2012 Translating cancer genomes

Lynda Chin, MD Anderson Cancer Center, USA

Cancer as a disease of the genome. Discovery of the BRAF mutation by Wellcome Trust as the poster child of what genome analysis can do (proof of concept). Know the target, what it does, have the right target, and identify the right patient population subset should lead to therapeutic success — in theory. A large scale catalogue of mutations insufficient. Overview of RAC1 mutations, biological evidence of activating function of the mutation, still need better understanding to translate this into aims for a drug design.

Another example, Prex2 in melanoma, mutated and highly re-arranged in different patients. Unclear whether this is a driver mutation or noise — large number of mutations, but scattered all over the gene, no hotspots. Need to engineer mutations and test in vivo model system. That still does not define what it does or how, and more importantly is it rate limiting for the tumor?

All model systems with strengths and weaknesses; need to run several independent tests that should converge on the same result before trusting the results to not be an artifact.

Case study: landscape of somatic mutations in melanoma (Hodis, Watson; Cell 2012). Number of patients without BRAF or NRAS mutations, no treatment for this group. Start with genetic model, NRAS mouse promoter can be turned on and off in a tumor. NRAS initiates tumor, mutations in TRRAP, GRM3, SETD2 present. Switch NRAS off, tumor shrinks — NRAS is required for maintenance and a valid therapeutic target. Genetic ablation of NRAS induces tumor regression, not achieved by inhibition of MEK. Identified genes significantly altered by MEK inhibitor and the effect of NRAS (plenty of overlap, MEKi represents partial inhibition of NRAS activity). Majority of NRAS-associated genes not affected by MEK inhibitors though.

Tested for pathway enriched by RAS-specific genes (cell cycle proliferation, opposite of expectations). P53 decreased upon NRAS extinction, not after MEK inhibition. Network modeling (TRAP) to analyze difference what key regulators are responsible for the pathway differences, identified CDK4 as a putative key driver (proliferation checkpoint).

Missing step: pharmacological validation with a CDK4 inhibitor (commercially available). Only works in combination with MEK inhibitor (synergistic effect); confirmed in ex vivo test. Partial inhibition uncouples apoptosis and cell cycle arrest.

Model system vital to understand how the signaling of the pathway works, bypass redundant / complex feedback systems. Enables a wider therapeutic window. Systems approach to collect data at will crucial to develop combination therapies.

btg2012 ✳ clinic ✳ sequencing ✳ drugs ✳ Conference 

9.28.2012 Genomics – Catching up to Human Genetics

Richard Gibbs, Baylor College of Medicine, USA

aka Genomic Medicine 20??. From individual variation (Watson) to population variation (Desmond Tutu project), identification of actionable variants (Jim Lupski’s project, NEJM 2010), medical management and intervention (2012), everyone (20??). What is the main utility, and should we all be sequenced?

Technical development:

Still long way to go, not a perfect genome yet despite rapid developments. Capture technologies to stick around a bit longer as higher coverage in important regions helps

Healthy adults:

Who wants a test without medical indication? Mike Snyder, for one. Still useful information such as site frequency spectrum (1000G useful even without detailed phenotypes)

Complex disease:

GWAS vs Mendel. Few actionable alleles vs low frequency/high impact, can we have an integrated model? Can we construct models of complex diseases (‘oligogenics’). ARIC and CHARGE consortium with 30,000 well phenotyped individuals, 4,500 exome-seq’d sees Mendelian alleles in the ‘normals’

Mendelian disease:

Severe diseases, often children, collectively frequent, cites actionable case studies (40 mendelian diseases studied right now at HGSC alone, ‘industrialized’ pipeline). Studies take away endless exhausting lists of diagnosis, treatments. Huge value in molecular diagnosis even without treatment. Problem of too many small pedigrees, need a surrogate for what a variant does as part of a functional assessment. Whole Genome Lab launched 2011, steady increase of samples (by end of 2013 10k samples/month, no way to do manual curation).

How much work will be in research vs clinical settings? Lots of additional success stories that are just impossible to cover without the pedigree pictures, gene names and impact.

Clan Genomics paper: “recent mutation may have a greater influence on disease susceptibility or protection than is conferred by variations that arose in distant ancestors”. Should be more worried about immediate family, variants not present in the broader population.


Role of inherited, acquired mutations, environmental mutations. TCGA et al ‘damn successful’ in uncovering new mutations and functions. New technologies allow analysis of low frequency cancers. Different trios: normal, primary, recurrence allow clonal analysis, trace evolution, study time dependency between mutational events.


Need to find the family or cohort first. Data ends up in medical records which can be mined for resources; likely growth in social networks / 23andMe’s to create cohorts, whereas designed population studies will decrease.

Future prediction: all the excitement stale in a few years. Passion move to other fields as this becomes complete routine — it’s water coming out of the faucet.

sequencing ✳ conference ✳ btg2012 

9.28.2012 Surname leakage from personal genomes

Yaniv Erlich, Whitehead Institute for Biomedical Research, USA

Co-segregation between Y-Chr and surnames used by public services, allow you to connect to relatives; second database of interest SMGF used to extract a total of 140k surname/Y-chr-test data points.

What is the probability to recover a surname using genomic data? Success rate of ~12%. What when adding age, state as additional metadata (often included in public records, publication also allowed). Age 40, State Colorado, surname Smith. Median size of 12 people identified by this combination.

The Venter case, putting it all together. Profile STR markers with lobSTR. Now sufficient to get Venter surname from (Try yourself!). The same person does NOT need to be in the database, sufficient for relatives to be included.

Remaining talk not tweet-able. Shame, _really fun and intriguing analysis_.

privacy ✳ sequencing ✳ Conference ✳ btg2012