Sequencing

Exome Sequencing

Whole exome sequencing and data processing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard.  Libraries from DNA samples (>250 ng of DNA, at >2 ng/ul) are created with an Illumina exome capture (38 Mb target) and sequenced (150 bp paired reads) to cover >90% of targets at 20x and a mean target coverage of >80x. Sample identity quality assurance checks are performed on each sample. The exome sequencing data is de-multiplexed and each sample's sequence data is aggregated into a single Picard BAM file. 

Exome sequencing data is processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels. The BWA aligner is used for mapping reads to the human genome build 37 (hg19). Single Nucleotide Polymorphism (SNPs) and insertions/deletions (indels) are jointly called across all samples using Genome Analysis Toolkit (GATK) HaplotypeCaller package version 3.4. Default filters are applied to SNP and indel calls using the GATK Variant Quality Score Recalibration (VQSR) approach. Annotation is performed using Variant Effect Predictor (VEP). Lastly, the variant call set is uploaded on to seqr for collaborative analysis between the CMG and investigator. 

The exome specifications were developed specifically for the Broad CMG. Additional details on Broad research exome products are available at http://genomics.broadinstitute.org/products/whole-exome-sequencing.

Genome Sequencing

Whole genome sequencing and data processing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard.  PCR-free preparation of sample DNA (350 ng input at >2 ng/ul) is accomplished using llumina HiSeq X Ten v2 chemistry. Libraries are sequenced to a mean target coverage of >30x.  

Genome sequencing data is processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels. The BWA aligner is used for mapping reads to the human genome build 37 (hg19). Single Nucleotide Polymorphism (SNPs) and insertions/deletions (indels) are jointly called across all samples using Genome Analysis Toolkit (GATK) HaplotypeCaller package version 3.4. Default filters are applied to SNP and indel calls using the GATK Variant Quality Score Recalibration (VQSR) approach. Annotation is performed using Variant Effect Predictor (VEP). Lastly, the variant call set is uploaded on to seqr for collaborative analysis between the CMG and investigator. 

Additional details on PCR-free whole genome sequencing are available at http://genomics.broadinstitute.org/products/whole-genome-sequencing.

Transcriptome Sequencing

Human whole transcriptome sequencing is performed by the Genomics Platform at the Broad Institute of MIT and Harvard. The transcriptome product combines poly(A)-selection of mRNA transcripts with a strand-specific cDNA library preparation, with a mean insert size of 550bp. Libraries are sequenced on the HiSeq 2500 platform to a minimum depth of 50 million STAR-aligned reads. ERCC RNA controls are included for all samples, allowing additional control of variability between samples. Libraries are sequenced on the HiSeq 2500 platform to a minimum depth of 50 million aligned reads. ERCC RNA controls are included for all samples, allowing additional control of variability.

Additional details on human whole transcriptome sequencing are available at http://genomics.broadinstitute.org/products/whole-transcriptome-sequencing.  

Annotation / QC

After sequencing data is generated, extensive quality control is performed in two stages on each sample. At the BAM level, the following metrics will be examined: contamination percent estimates, chimeric read percent, GC/AT drop out and insert size distribution. Samples identified as outliers will be further investigated and excluded from variant calling. After variant calling, the following will be inferred for each sample: sex, ancestry and relatedness to other samples. Any discrepancies with reported information will need to be resolved with collaborators before proceeding. In addition, the following metrics will be examined: number of variants, transition to transversion ratio (TiTv), insertion/deletion ratio, heterozygous/homozygous alternate allele ratio, number of singletons and mendelian violations. Outliers for these metrics will be highlighted to collaborators in a report but will typically not halt progress. A summary of the quality control will be included as an appendix to the analysis report.

Reprocessing Sequencing Data

We will reprocess externally generated Illumina sequencing data that meets the CMG criteria through the Broad’s Picard exome processing pipeline to create an analysis-ready BAM file. This includes any necessary format conversion and merging of data per sample through to creation of a standard Picard BAM file per sample. Includes short indel co-cleaning where appropriate. At the end of this processing samples are available and eligible for inclusion in joint variant calling.

Reprocessing sequencing data – minimum requirements

FASTQ

  • Reads are from paired-end Illumina sequencing with minimum length 76 bp
  • Must be gzip compressed
  • Two files, one for each read pair (eg. sample1_1.fastq.gz, sample1_2.fastq.gz)
  • All reads must be paired
  • Truncated files will not be accepted
  • Must align using bwa aln or bwa mem without any errors/warnings

BAM Specifications

  • Reads are from paired-end Illumina sequencing with minimum length 76 bp
  • Must be able to index using samtools/Picard. Note: truncated files will fail indexing
  • All reads must be paired regardless of quality and mapping
  • Must contain all unmapped reads
  • Original base qualities should be present for each read
  • Header should contain all necessary metadata to reproduce fastq file and subsequent re-processing

Variant Calling Format (VCF)

  • VCF version 4.1 (or higher)
  • Produced by Genome Analysis Toolkit (GATK) 3.x (or higher) and must have no GATK errors/warnings with the –validation_strictness STRICT option applied
  • Must be compressed by either gzip or bgzip
  • All samples in a cohort must be contained in file and not individual vcf files for each sample. Individual sample vcf files merged into one vcf file (eg. using GATK CombineVariants) will also not be accepted.
  • hg19/GRCh37 human reference and similar to (http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta). Chromosomes should be numbered ‘1’, ‘2’, ‘3’, etc. and not ‘chr1’,’chr2’,’chr3’, etc.
  • Sample ids cannot contain any spaces or special characters.
  • Sample ids must match those specified in other accompanying documentation or else noted if different.
  • Sample genotypes must be in the format GT:DP:AD:GQ:PL