INVESTIGATORS

Sending samples to the Center:

Once your samples have been accepted for sequencing through the CMG, we will mail a sample kit to your address and will email a corresponding sample manifest. We will also send along instructions.

Please ensure that sample quantity and quality is sufficient for sequencing: 3ug DNA for each sample, c​oncentrated around 50­-100ng/ul [via Pico Green method],​ to allow for exome sequencing and whole genome sequencing if needed. Sample volume requirement is no less than 20uL or no more than 600uL. Samples volumes outside this range cannot be accepted for automation purposes. 

We will quantify all samples when they arrive at our Center using Pico Green, and will send these metrics back to you prior to submitting your samples for sequencing.

You will be provided with a login for our secure online analysis portal, seqr, that provides collaborators with the ability to deeply analyze the genetic and phenotypic data from their samples. Each sample will be linked directly to PhenoTips, to ensure seamless entering, editing, and viewing of patient phenotype information.

PhenoTips is a software tool for collecting and analyzing phenotypic information for patients with genetic disorders (see https://phenotips.org/). This tool has been integrated with seqr to record phenotype information and pedigree structure. This platform allows for easier phenotyping by providing collaborators with a standardized language, the Human Phenotype Ontology (HPO), autocompletion/suggestion of phenotype terms, and customizable forms.

The Human Phenotype Ontology is developed by members of the Monarch Initiative and by the community at large. The HPO aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality, such as atrial septal defect, that are associated with hereditary and common diseases with over 115,000 annotations to known diseases. The HPO is built to be interoperable with other ontologies in support of data integration and comparison with model organisms. HPO profiles generated for rare disease patients form the basis of non-exact comparisons to known diseases and to model organisms, and Monarch algorithms using these data are used in numerous diagnostic tools and for matchmaking patients. An annotation sufficiency metric is used to assess the diagnostic capability of a given HPO profile by comparison against all the gold standard data available within Monarch. This results in the five star rating that is seen in PhenoTips, Patient Archive, and others tools.

Find out more about the monarch initiative at monarchinitiative.org, and HPO at www.human-phenotype-ontology.org. http://dx.doi.org/10.1093/nar/gkw1039.

If you already use PhenoTips to collect phenotypic information on your patients, please let us know and we can arrange to transfer this information into our system.

A minimum amount of phenotypic information is required for each participant. We will review your submission and may request that additional information be provided.

Introduction to seqr and Phenotips Training: Watch video tutorial

How to draw a pedigree in PhenoTips: Watch video tutorial

All sequencing data produced by the Broad Institute will be made available to collaborators on request. The CMG will make the following data accessible:

  • Array data
    • Acess to array data (if generated for your samples) will be granted via Aspera. 
  • Sample BAMs
  • VCF file containing all samples in cohort/batch
    • Access to sample bams and VCFs will be made available via Aspera. In preparation of data transfer, please click this link and use the following login credentials:

Note: Any software difficulties, firewall restrictions or any network issues experienced by collaborators are to be addressed with their respective institutional IT department. Broad Institute IT will provide support for server side issues.

The account created for data download/upload will be limited to 1 terabyte and will expire in 30 days after creation. If additional space or time is required, separate arrangements will need to be made with collaborators.

Exome Sequencing

Whole exome sequencing and data processing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard.  Libraries from DNA samples (>250 ng of DNA, at >2 ng/ul) are created with an Illumina exome capture (38 Mb target) and sequenced (150 bp paired reads) to cover >90% of targets at 20x and a mean target coverage of >80x. Sample identity quality assurance checks are performed on each sample. The exome sequencing data is de-multiplexed and each sample's sequence data is aggregated into a single Picard BAM file. 

Exome sequencing data is processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels. The BWA aligner is used for mapping reads to the human genome build 37 (hg19). Single Nucleotide Polymorphism (SNPs) and insertions/deletions (indels) are jointly called across all samples using Genome Analysis Toolkit (GATK) HaplotypeCaller package version 3.4. Default filters are applied to SNP and indel calls using the GATK Variant Quality Score Recalibration (VQSR) approach. Annotation is performed using Variant Effect Predictor (VEP). Lastly, the variant call set is uploaded on to seqr for collaborative analysis between the CMG and investigator. 

The exome specifications were developed specifically for the Broad CMG. Additional details on Broad research exome products are available at http://genomics.broadinstitute.org/products/whole-exome-sequencing.

Genome Sequencing

Whole genome sequencing and data processing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard.  PCR-free preparation of sample DNA (350 ng input at >2 ng/ul) is accomplished using llumina HiSeq X Ten v2 chemistry. Libraries are sequenced to a mean target coverage of >30x.  

Genome sequencing data is processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels. The BWA aligner is used for mapping reads to the human genome build 37 (hg19). Single Nucleotide Polymorphism (SNPs) and insertions/deletions (indels) are jointly called across all samples using Genome Analysis Toolkit (GATK) HaplotypeCaller package version 3.4. Default filters are applied to SNP and indel calls using the GATK Variant Quality Score Recalibration (VQSR) approach. Annotation is performed using Variant Effect Predictor (VEP). Lastly, the variant call set is uploaded on to seqr for collaborative analysis between the CMG and investigator. 

Additional details on PCR-free whole genome sequencing are available at http://genomics.broadinstitute.org/products/whole-genome-sequencing.

Transcriptome Sequencing

Human whole transcriptome sequencing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard. The transcriptome product combines poly(A)-selection of mRNA transcripts with a strand-specific cDNA library preparation, with a mean insert size of 550bp. Libraries are sequenced on the HiSeq 2500 platform to a minimum depth of 50 million STAR-aligned reads. ERCC RNA controls are included for all samples, allowing additional control of variability between samples. Libraries are sequenced on the HiSeq 2500 platform to a minimum depth of 50 million aligned reads. ERCC RNA controls are included for all samples, allowing additional control of variability.

Additional details on human whole transcriptome sequencing are available at http://genomics.broadinstitute.org/products/whole-transcriptome-sequencing.  

Annotation / QC

After sequencing data is generated, extensive quality control is performed in two stages on each sample. At the BAM level, the following metrics will be examined: contamination percent estimates, chimeric read percent, GC/AT drop out and insert size distribution. Samples identified as outliers will be further investigated and excluded from variant calling. After variant calling, the following will be inferred for each sample: sex, ancestry and relatedness to other samples. Any discrepancies with reported information will need to be resolved with collaborators before proceeding. In addition, the following metrics will be examined: number of variants, transition to transversion ratio (TiTv), insertion/deletion ratio, heterozygous/homozygous alternate allele ratio, number of singletons and mendelian violations. Outliers for these metrics will be highlighted to collaborators in a report but will typically not halt progress. A summary of the quality control will be included as an appendix to the analysis report.

Reprocessing Sequencing Data

We will reprocess externally generated Illumina sequencing data that meets the CMG criteria through the Broad’s Picard exome processing pipeline to create an analysis-ready BAM file. This includes any necessary format conversion and merging of data per sample through to creation of a standard Picard BAM file per sample. Includes short indel co-cleaning where appropriate. At the end of this processing samples are available and eligible for inclusion in joint variant calling.

Reprocessing sequencing data – minimum requirements

FASTQ

  • Reads are from paired-end Illumina sequencing with minimum length 76 bp
  • Must be gzip compressed
  • Two files, one for each read pair (eg. sample1_1.fastq.gz, sample1_2.fastq.gz)
  • All reads must be paired
  • Truncated files will not be accepted
  • Must align using bwa aln or bwa mem without any errors/warnings

BAM Specifications

  • Reads are from paired-end Illumina sequencing with minimum length 76 bp
  • Must be able to index using samtools/Picard. Note: truncated files will fail indexing
  • All reads must be paired regardless of quality and mapping
  • Must contain all unmapped reads
  • Original base qualities should be present for each read
  • Header should contain all necessary metadata to reproduce fastq file and subsequent re-processing

Variant Calling Format (VCF)

  • VCF version 4.1 (or higher)
  • Produced by Genome Analysis Toolkit (GATK) 3.x (or higher) and must have no GATK errors/warnings with the –validation_strictness STRICT option applied
  • Must be compressed by either gzip or bgzip
  • All samples in a cohort must be contained in file and not individual vcf files for each sample. Individual sample vcf files merged into one vcf file (eg. using GATK CombineVariants) will also not be accepted.
  • hg19/GRCh37 human reference and similar to (http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta). Chromosomes should be numbered ‘1’, ‘2’, ‘3’, etc. and not ‘chr1’,’chr2’,’chr3’, etc.
  • Sample ids cannot contain any spaces or special characters.
  • Sample ids must match those specified in other accompanying documentation or else noted if different.
  • Sample genotypes must be in the format GT:DP:AD:GQ:PL

dbGaP

The National Institutes of Health (NIH) has an established central data repository called the database of Genotypes and Phenotypes (dbGaP) for securely storing and sharing human data submitted to NIH under the Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome­Wide Association Studies (GWAS). Implicit in the establishment of dbGaP is that scientific progress in genomic research will be greatly enhanced if the data are readily available to all scientific investigators and shared in a manner consistent with the research participants’ informed consent.

Controlled­-access data in dbGaP can only be obtained if a user has been authorized by the appropriate Data Access Committee (DAC). Information on requesting controlled data access, is available on the NIH website. Data available to authorized investigators may include de­identified phenotypes and genotypes for individual study subjects, pedigrees, and pre­computed univariate associations between genotype and phenotype (if not made available on the public site).

All data generated in the Broad CMG will be deposited to dbGaP within a year of data generation. To request access to the data, please visit our dbGaP site

Matchmaker

The Matchmaker Exchange (MME) is a federated network of genomic centers that share their patient data to help find the underlying genetic changes that cause rare disease. The goal of this network is to help aggregate data of patients with similar disorders from around the world. These centers connect to each other through a  common application programming interface (API) that uses a mutually agreed upon data format.

We built the matchbox software application to serve as our primary bridge to the MME. Building a MME server is a resource intensive process, and a limiting factor to a new center interested in joining this data-sharing network. To help such an institution, we made matchbox open source, easy to install, and free to use.

Our collaborators deposit various types of patient data such as phenotypes into our seqr web application, which is our central data aggregation and analysis platform. seqr is a web-based technology platform that allows the collaborative analysis of genomic data aggregated by family. It allows users to annotate, comment, search, and visualize, genomic data. This helps our collaborators and data contributors from around the world to efficiently conduct analysis and exchange information. 

Once in seqr and phenotypes having been entered, and one or more candidate genes identified, families can then be “matched” to find other patients around the world that have similar disorders. This aggregation of similar cases helps users to build evidence for gene causality and tremendously energize novel gene discovery.

More information can be found on www.matchmakerexchange.org and in a special issue of Human Mutation  (http://onlinelibrary.wiley.com/doi/10.1002/humu.2015.36.issue­10/issuetoc)

matchbox can be downloaded from github. Please contact matchmaker@broadinstitute.org for any questions.

ClinVar

ClinVar is a publicly available database of genomic variation and its relationship to human health, maintained by the National Center for Biotechnology Information (NCBI) and funded by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine. ClinVar catalogs and aggregates variant submission with their reported clinical significance and supporting information, when available. ClinVar adds value to submitted interpretations by standardizing descriptions of variants, conditions, and terms for clinical significance. This information is made publicly available through ClinVar for use in the healthcare community.

Key ClinVar facts:

  • ClinVar is fully public and freely available.
  • ClinVar is a submission-driven database that holds both primary submissions and expert-curated submissions. The scope of the submission may be as small as a single variant.
  • ClinVar welcomes submissions from clinical testing labs, researchers, locus-specific databases, expert panels, and professional societies.
  • ClinVar adds value to submitted interpretations by standardizing descriptions of variants, conditions, and terms for clinical significance.
    • Variants are mapped to reference sequences and reported in HGVS.
    • Conditions are mapped to concepts in MedGen.
    • Clinical significance terms for Mendelian disorders are reported by ACMG categories.
    • Following variant submission, ClinVar provides a conflict report of any differences in interpretation between their submitted variants and those already in ClinVar.
  • More information on ClinVar is available at www.ncbi.nlm.nih.gov/clinvar.

CMG analysts will work together with collaborators using the seqr framework to identify strong candidate genes/variants in two stages. The following analytic approach is applied to exome and whole genome Next Generation Sequencing data.

The initial analysis will aim to identify candidate variants in known genes associated with the primary disease/phenotype. The majority of filters listed below are relaxed to increase sensitivity to detect potential variants in these known genes.

The subsequent analysis will attempt to identify rare, likely candidates in genes not previously associated with disease by applying filters to exclude unlikely variants:

Frequency Filters 
Allele frequencies from the Genome Aggregation Database (gnomAD) and 1000 Genomes will be used to exclude common variants. An additional popmax filter can be used to further exclude variants that are common in any given population. We typically refer to a variant as rare if it occurs in less than 1% of these populations. 

Inheritance Patterns 
We interrogate recessive (including homozygous, compound heterozygous, and x-linked), dominant, and de novo modes of inheritance where approporiate cased on family structure. In complicated cases, we can use custom inheritance filters to explcitly specify individual genotypes. For more information about the specific criteria for each of these searches, please see the documentation on seqr.

Functional Annotation
The Variant Effect Predictor tool was used to annotate variants and can be used to filter based on functional annotations using Sequence Ontology terms. The following functional classes are considered in a moderate to high impact search: Nonsense, Essential splice site, Missense, Frameshift, and In frame variants. In addition, deleterious predictions from Polyphen, SIFT, MutationTaster, and FATHMM can be used as a tentative guide to further prioritize candidate variants. Lastly, all genes/variants in ClinVar and OMIM will be indicated.

Genes and Regions
Similar to the initial analysis, searches can be restricted to a set of candidate genes or genomic loci previously associated with the disease/phenotype. These searches are typically performed by applying additional filters.

Quality Filters
All variants detected by the Genome Analysis Toolkit (GATK) based pipeline have associated quality metrics for both the variant site and for each individual genotype that can be filtered.

All strong candidates from both stages of analysis will be tagged and classified according to the American College of Medical Genetics and Genomics (ACMG) standards and guidelines. Further analysis will be performed specific to each collaboration agreement. A summary of strong candidates will be available to the collaborator through the seqr interface.

Copy Number Variation

Copy number variation (CNV) analysis will be conducted on exome sequencing data using ExomeDepth and GATK gCNV. Genome sequencing data will be processed with GenomeSTRiP to identify CNVs and Manta to identify both CNVs and structural variants.  

 

Publication Policy

In general, it is anticipated that collaborators will retain first and senior authorship for straightforward discoveries made on their samples (either discovered by the core CMG staff or if made directly by the collaborators). The Center must be named as an author on resulting publications, along with individual CMG analysts who played a major role in the discovery.

The Center will notify all collaborators in cases where the same gene is discovered in samples spanning multiple sites; in such cases, we expect that authorship decisions will be resolved fairly between the parties involved. If a discovery requires an unusually large or innovative effort made by the core CMG staff to solve certain cases, we expect authorship will be awarded according to the actual contributions of all participating investigators.

In addition, the CMG funding must be noted in the acknowledgments of resulting publications. Please include the following statement:

Sequencing and analysis was provided by the Broad Institute of MIT and Harvard Center for Mendelian Genomics (Broad CMG) and was funded by the National Human Genome Research Institute, the National Eye Institute, and the National Heart, Lung and Blood Institute grant UM1 HG008900 to Daniel MacArthur and Heidi Rehm.

Whole Exome Methods Template Text

Whole exome sequencing and data processing were performed by the Genomics Platform at the Broad Institute of Harvard and MIT (Broad Institute, Cambridge, MA, USA). We performed whole exome sequencing on DNA samples (>250 ng of DNA, at >2 ng/ul) using Illumina exome capture (38 Mb target). Our exome-sequencing pipeline included sample plating, library preparation (2-plexing of samples per hybridization), hybrid capture, sequencing (150 bp paired reads), sample identification QC check, and data storage. Our hybrid selection libraries cover >90% of targets at 20x and a mean target coverage of ~100x. The exome sequencing data was de-multiplexed and each sample's sequence data were aggregated into a single Picard BAM file.

Exome sequencing data was processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels. We used the BWA aligner for mapping reads to the human genome build 37 (hg19). Single Nucleotide Polymorphism (SNPs) and insertions/deletions (indels) were jointly called across all samples using Genome Analysis Toolkit (GATK) HaplotypeCaller package version 3.4. Default filters were applied to SNP and indel calls using the GATK Variant Quality Score Recalibration (VQSR) approach. Lastly, the variants were annotated using Variant Effect Predictor (VEP). For additional information please refer to Supplementary Section 1 in ExAC paper (Lek et. al.). The variant call set was uploaded on to seqr and analysis was performed using the various inheritance patterns.