Done teaching at CSHL (Taken with instagram)
At Kookoo’s (Taken with instagram)
DCA homebound #btg11 (Taken with instagram)
Mihai Pop - University of Maryland, USA
Conclusion: assembly is impossible. [How’s that for an intro!]. Analogy of being unable to put together a puzzle. 25-mer example of M. genitalium assembly graph that is impossible to resolve (i.e., to find the correct / ideal path through the graph as it’s NP hard). Alternative is to make things fit — take 100bp reads and generate 36bp contigs as output. Longer read lengths help but do not resolve all ambiguity (Kingsford, BMC Bioinformatics 2010). But even with Sanger or PacBio and around 1kb reads not all genomes can be assembled, using Y pestis as an example. Most of the genes can be successfully assembled even with short reads, though.
Metagenomic assembly in practice: existing assemblers equally good or bad. Settled on SOAPdenovo and a specific recipe [Yay for making that available!]. Results differ on the microbial communities assembled, tested on different body site data from the HMP. Collector’s curves for stool samples (lowest human DNA contamination). Despite thousands of expected species at about 60-80% of the data the majority of species are captured (1-2 lanes of Illumina).
MetAMOS to handle hundreds or thousands of haplotypes at the same time. Different questions being asked, depth of coverage a different property. Collapsing bubbles (sequencing errors) not a wanted feature, better repeat detection. Tons of work to be done; single genome assemblers have been in development for years. Pipeline includes gene finding, annotation, reporting, taxonomic reports. Each project needs somewhat different workflow, tools and parameters, so the system needs to be flexible.
Created HMP mock communities as test sets. Mixtures of bacterial and eukaryotic cells of known composition (even composition, staggered with different abundance across a log scale). MetAMOS only misses about 20% of the reference genomes, covers most of the ref genomes, but (similar to other assemblers) only covering 60% of the overall genomic sequences. Very low error count given the circumstances (chimeras, etc.). Integrated visualization (Krona) to present the diversity of samples.
Testing for functional enrichment in obese gut microbiome. Homocystein production pathways enriched in microbiome of obese patients (a link to heart disease? should we be treating the microbiome rather than the host?).
What are the microbiome dynamics? Time series data from mice undergoing a change of diet, infer which microbes interact with each other, currently only resolved at a high phylogenetic level, but the same interaction network is being generated even given seemingly unrelated time series information. Need to unravel those interactions.
Karen Nelson - J. Craig Venter Institute (JCVI), USA
[Pretty much completely unfamiliar with the metagenomics fields, my notes will reflect that, I’m afraid.]
General theme of ‘new opportunities’, quick overview of Sargasso sea study which created a backlog of sequencing. Expanded to global ocean sampling and analysis and the need to moving away from sequencing individual microbes to metagenomics approaches. Diverse range of topics (soil, air, lake water, oceans, …), some projects funded since 2001 to estimate microbial diversity.
Majority of microbes remains unknown, likewise for the relationships between microbiome and and the human host. Clear role in health, for example in the gastrointestinal tract. Field moving towards studying larger cohorts rather than individuals to address these relationships. Quick detour through the history of metagenomics studies of the human gut from 2005 onwards, highlighting the increase in microbiome species discovered, challenges of IRB processes, how to handle host DNA contamination (up to 90% of the sequences) when it comes to informed consent.
NIH Roadmap Human Microbiome Project (2007, 160 million dollar), > 3000 reference genomes of microbes affiliated with the human body, 300 healthy subjects. Overall aim is to generate a healthy, well defined reference cohort of specimens to study the human microbiome. Required reagent repository, protocols, databases, technology development. Demonstration projects that change on microbiome health and disease. Catalog of reference genomes published in 2010 (Nelson et al, Science).
Some of the technologies involved include single cell sequencing, specialized cell sorting to handle bacteria that cannot be cultivated. Catching up on viruses and phages as part of the project. Some of the findings include that different body sites have distinct communities; even different teeth have different microbial signatures.
Mostly collaborative projects with medical communities:
Total of ~15 projects at any given time that try to go beyond just sequencing and signatures. What are the microbes doing over time, what is their impact on the host, how are population changes contributing to this? Can we identify probiotics from healthy individuals? Can microbe populations be manipulated for the host’s benefit?
Microbiome itself only lists the diversity at the sequence level, no information about active species or genes. Transcriptomics and metabolomic approaches to elucidate these questions, but raise additional problems of methods development, lack of standardization and integration across cohorts.
Still in the early stages of learning about this superorganism. Need for geneticist and microbial biologists to talk to each other a lot more. Lots of room for bioinformatics tool development.
Stephen Kingsmore - Children’s Mercy Hospital in Kansas City, USA
OMIM lists ~7000 inherited diseases, 3-4% affect children, for half the molecular basis is unknown. Exome sequencing helps to elucidate causes for others and leads to reclassification of common complex diseases. Large number of different inheritance patterns (autosomal recessive, X-linked, dominant, mitochondrial, imprinting disorders, …). About 1000 diseases with sufficient knowledge for molecular testing [Paging @phylogenomics.. Mendelianome?]. Typical diagnosis one at a time sanger sequencing, newborn screening of ~60 treatable conditions (1-3$/test), pre-conception carrier testing.
FDA-approved process or CLIA laboratory, sequencing not an approved process yet (see Mark Boguski’s talk); patent processes another quagmire (BRCA1/2).
A need for mendelian genomic medicine with multiple concepts:
NGS-based clinical sequencing is very different to Sanger-based molecular diagnostics. Multi-plexing is obligatory to achieve economic efficiency (multiple disease tests, samples at the same time), turnaround as low as four weeks. Exome-seq as a research tool too expensive still for diagnostic purposes, too high a risk of missing exons, poor handling of GC-rich regions. Ultimate approach WGS once price becomes reasonable to avoid inefficiencies of capture methods.
Outline of a standardized NGS screen for 619 inherited diseases. Counseling, blood sample, enrich 0.1% of genome, sequence around 4GB, genotype, report differences, results report and counseling. Need to take ethics into account (do not check for variants not specifically requested). Around 3 million dollar investment (1/3 IT infrastructure, 96-well semi-automated system). Custom library for Illumina TruSeq; HiSeq 2000 running 96 samples in 5 days, ~95% of targets have > 16X coverage sufficient for clinical diagnostic.
Need to translate alignments into allele frequency to reliably distinguish heterozygous and homozygous calls, cannot undercall these. Tweaked sequencing depth and enrichment technology until in 99.99% specificity, 96% sensitivity. Interpretation can be automated (90%, 9% manual, 1% case conference as aim). Despite of three decades of variant annotation no workflow to identify benign/pathologic variants (mentioned a few times before).
[Skipping sample cases as I got distracted, ahem…]
Much of the current literature-cited mutations (~20%) disease mutations are common SNPs or falsely annotated. Echoes previous calls for a clinical-grade variant database; ClinVar at NIH is one such initiative.
“Circle of hopelessness”: paltry funding for rare diseases, limiting testing and therapies, little known about the mutations, lack of ascertainment or timely diagnosis, limited therapeutic options. Try to tackle collective compile large disease burden to raise collective R&D funding, generate comprehensive genetic testing, improve diagnosis and therapy.
Opens with Brenner’s quote that we are the model organism now. Can we find unreported idiopathic disorders and identify their genetic basis? Shows large Utah pedigree with disease information; 10-30 genetic cases presented at a weekly case conference, many of which not described in the literature. Sample case of ‘old’ babies who tended to buy at around 1 year of age from cardia arrhythmia; not described in the literature despite distinct features (rigid, loose skin, superficially similar to progeria).
Autopsies unremarkable other than cardiac failure. Did X-linked capture for sequencing, ANNOVAR and VAAST analysis, identified one mutation that perfectly separated affected / unaffected family members affecting the amino-terminal acetylation of proteins. Now identified additional family with exact same mutation, completely unrelated with no common founder.
Joris Veltman - Nijmegen Centre for Molecular Life Sciences
Take-home message to start with: Dogma of inborn diseases inherited from generation to generation inherently wrong; think rather about de novo mutations. Many diseases occur sporadic and are associated with reduced fitness (pedigree with intellectual disability, autism, schizophrenia for which linkage analysis likely not helpful).
If sporadic and selected against why are they so frequent? Intellectual disability as an example, IQ < 70, limited adaptive behaviour, heritability score > 0.8, in 5% of patients a large chromosomal rearrangement. Additional causes unknown; de novo CNVs seem to be causal for 15% of ID cases, happening throughout the genome. Assumption that they are causal based on size of deletion and absence in controls, though recurrent CNVs improve clinical interpretation. Copy number screening now a standard diagnostic approach.
Selection-mutation balance hypothesis: balance between copying errors and selection eliminating these mutations. Requires an estimate of the mutation rate, currently pegged at around 50-100 de novo mutations per genome. If de novo mutations are important in disease than the mutational target size determines the frequency of the disease. Potentially hundreds of genes as targets in ID, Autism and other common diseases:
The main challenge is that each patient likely will have a de novo mutation in a different gene. Tackled (again) using ten patient trios with severe ID (IQ < 50) to filter out private variants present only in a given family. Negative family history, no obvious karyotype. Standard sequencing and analysis workflow. By screening out shared variants with the parents enriching for sequencing artifacts; requires very high coverage and careful filtering. Prioritization strategy resulted in 143 private variants (on average), average of 5 variants after excluding inherited variants which can be handled through Sanger (0-2 validated per patient) affecting nine different genes identified in 7 out of 10 patients. Two ‘positive’ controls with known affiliation to ID.
Genes with a functional link to ID have a higher likelihood to be disease causing based on protein structure and conservation. Validated 600 ID patients for mutations in YY1, a TF involved in memory and plasticity. New work indicates that as much as 50% of patients with severe, non-syndromic ID may be explained by de novo mutations, in addition to the 15% of the CNV cases. Similar study results in autism, schizophrenia (publications in Nature Genetics).
What is the frequency of de novo mutations, are there hotspots, what are the risk factors influencing their generation? How do discriminate benign and pathogenic mutations?
Sekar Kathiresan - Harvard Medical School, USA
[Talk given by Ron Do. @bioinfosm again found almost matching slides (pdf)].
Myocardial infarction a leading cause of death in the US, heritable component, but disease mechanisms and pathways remain unidentified. Around 30 risk loci identified in coronary heart disease GWAS. Use exome sequencing as a discovery tool. Correlated with LDL-c levels.
Familial low ldl in hypobetalipoproteinenemia (sp?). Type 1 APOB related, type 2 unknown. Studied family with 38 members across 3 generations, APOB mutations were ruled out. Selected the most extreme sibling pair (wrt low lipid phenotypes), parents with higher lipid values but still below the population mean. Sequenced exomes at around 200X, variant calling (GATK), custom filters and annotations [again no details given]. After basic QC around 16k candidates, ~500 not in the 1000 Genomes set, 260 not in the control exomes. Usual split of missense, nonsense, splice-site distributions. Only one gene (ANGPTL3) affected in both siblings, sharing two nonsense variants in the first exon. Confirmed by Sanger sequencing.
Genotyped mutations in all 38 family members, transmission paralleled affection status. Individuals with both variants have both low TG (triglycerides), LDL-C and HDL-C, impact on individuals with one mutation indicate a recessive effect. Replicated in independent family and extended to a population level through the Global Lipids Genetics Consortium
Looking at rare variants in population, assuming that both private and low-frequency variants can contribute to the MI risk. Again through the ESP, studied early-onset MI cases (1200 cases, 1200 controls with high framingham risk score potentially enriched for protective mutations). Currently at n=970, cases on average 20 years younger than controls. T1 test (variant CMC test) to identify non-synonymous variants below 1% in cases and controls (per gene). No definitive gene stands out, but there is a systematic deflation of p-values, partially because many genes do not have rare variants and result in a low allele count (20% of the genes). NBEAL1 gene one of the top hits, a GWAS locus for MI, p=10-4.
Sequencing 5 T1 genes in 1000 individuals through Sanger, genotype 200 SNPs in 10k individuals. Target SNPs prioritization strategy [no details given] re-discovered known MI protective nonsense mutations. Expanded through imputation to a panel of ~60k samples (!), found a low-frequency MI SNP not present in the SNP panel but present in the exome panel. Top association results include LPA, MIA3, CDKN2BAS and other known risk factor genes, indicating that imputation approaches can be effective.
Still underpowered to reach genome-wide significance for a mutation burden result. Example calculation for NBEA1, would need around 5000 case/controls. Single low frequency variants (PCSK9 as an example) also required around 5000 samples for robust results.
Josh Akey - UW Genome Sciences, USA
[Talk given by Timothy O’Connor, a postdoc in Josh’s lab]. NHLBI Exome Sequencing Project (ESP), goal to sequence > 7000 high coverage exomes (Seattle, Broad), sample from 250.000 participants to find cases for complex diseases. Start with ~2500 exomes at average 140X, mapped, multisample variant calling to generate 1.2 million variants, removed duplicates and related exomes, retained intersection of all targets, quality filtered for about 500.000 variants.
About 90k shared between the european americans and african americans (roughly the same number of samples in the cohort). 82% of 500k variants novel, including > 6000 nonsense mutations circulating in the population. Site frequency model shows high skew towards rare variants which make up the majority of the mutations (60% singletons, one sample only; 72% present in 1-3 samples). About 15.000 variants per person.
Number of variants a function of sample size: exponential growth. Most of the variation >1% frequency found already, bulk of growth from rare variants. Large magnitude of change of mutations between genes, african americans with greaterdiversity. Immune-related genes with highest diversity. Can map diversity across pathways rather than genes (rank ordering across KEGG pathways in this case); olfaction, immune related pathways on the high end of the spectrum. Conserved: cell cycle, recombination, protein export (basic pathways).
Check for purifying selection across different classes of variants (synonymous, missense, etc). Even in synonymous selection with a significant proportion of non-neutral selection (~70% of variants).
Study recent demographic patterns (work from Carlos Bustamante) [way out of my depth in this part of the talk]. Incorporate admixture into the model to explain 90k shared SNPs doesn’t explain overlap perfectly; needs additional factors such as exponential population growth. Explore the global context of variation by mapping the ESP variants to other population samples. Allows identification of individual groups when using rare variants only (not sampling artifact, but unknown biological source). Do not cluster using the northern / southern european population (HGDP), but fall outside. Potentially eastern european ancestry, but not enough resolution to determine this conclusively just yet.
82% of variants rare / novel, 72% of variants only found 3 times. HLA/immune-related pathways with high variation, 14.000 variants per person on average, ~ 625 (!) deleterious variants per person; 120 genes with evidence for positive selection, 85 novel.